Loading Big (ish) Data into R
So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:
This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)
I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.
So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)
That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:
library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))
That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.
So how long did it take to run? Just under 5 minutes.
So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:
system.time(big.df <- read.csv(‘bigdf.csv’))
And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.
-JD
Did you try specifying colClasses in read.csv? If I remember, that will speed things up significantly (see ?read.csv Notes). This is on 1 million lines of observations in a csv file.
I ran the following:
df = data.frame(x = rnorm(1e6), y = rnorm(1e6))
write.csv(df, file=”df.csv”)
rm(list=ls())
gc()
sytem.time(read.csv(“df.csv”, row.names = 1))
gc()
system.time(read.csv(“df.csv”,
colClasses = c(“character”, “numeric”, “numeric”), row.names=1))
Also, I ran your sqldf code:
gc()
library(sqldf)
f <- file("df.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
Results:
No colClasses: elapsed = 31.345s
with colClasses: elapsed = 11.801s
sqldf: elapsed = 29.565s
I'm trying to remember the mechanism for read.csv, but I think it may import the data as "character", then tries to cast to another type with as.*. The Note section of ?read.csv explains that "character" is much slower than "integer" (and "numeric", in my example).
Nice post.
Greg
sessionInfo()
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices datasets utils methods base
other attached packages:
[1] sqldf_0-1.5 gsubfn_0.5-0 proto_0.3-8 RSQLite_0.7-1 DBI_0.2-4
[6] cimis_0.1-2 RLastFM_0.1-4 RCurl_0.98-1 bitops_1.0-4.1 XML_2.5-3
[11] lattice_0.17-22
loaded via a namespace (and not attached):
[1] grid_2.9.0
I second Greg’s message. colClasses should help a great deal. In addition setting strings as factors to FALSE should also be helpful in terms of speed.
The link from Dirk’s post covers a few of th eoptions nicely:
1.Set nrows=the number of records in your data (nmax in scan).
2.Make sure that comment.char=”" to turn off interpretation of comments.
3.Explicitly define the classes of each column using colClasses in read.table.
4.Setting multi.line=FALSE may also improve performance in scan.