Loading Big (ish) Data into R
So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:
This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)
I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.
So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)
That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:
library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))
That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.
So how long did it take to run? Just under 5 minutes.
So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:
system.time(big.df <- read.csv(‘bigdf.csv’))
And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.
-JD
Did you try specifying colClasses in read.csv? If I remember, that will speed things up significantly (see ?read.csv Notes). This is on 1 million lines of observations in a csv file.
I ran the following:
df = data.frame(x = rnorm(1e6), y = rnorm(1e6))
write.csv(df, file=”df.csv”)
rm(list=ls())
gc()
sytem.time(read.csv(“df.csv”, row.names = 1))
gc()
system.time(read.csv(“df.csv”,
colClasses = c(“character”, “numeric”, “numeric”), row.names=1))
Also, I ran your sqldf code:
gc()
library(sqldf)
f <- file("df.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
Results:
No colClasses: elapsed = 31.345s
with colClasses: elapsed = 11.801s
sqldf: elapsed = 29.565s
I'm trying to remember the mechanism for read.csv, but I think it may import the data as "character", then tries to cast to another type with as.*. The Note section of ?read.csv explains that "character" is much slower than "integer" (and "numeric", in my example).
Nice post.
Greg
sessionInfo()
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices datasets utils methods base
other attached packages:
[1] sqldf_0-1.5 gsubfn_0.5-0 proto_0.3-8 RSQLite_0.7-1 DBI_0.2-4
[6] cimis_0.1-2 RLastFM_0.1-4 RCurl_0.98-1 bitops_1.0-4.1 XML_2.5-3
[11] lattice_0.17-22
loaded via a namespace (and not attached):
[1] grid_2.9.0
I second Greg’s message. colClasses should help a great deal. In addition setting strings as factors to FALSE should also be helpful in terms of speed.
The link from Dirk’s post covers a few of th eoptions nicely:
1.Set nrows=the number of records in your data (nmax in scan).
2.Make sure that comment.char=”" to turn off interpretation of comments.
3.Explicitly define the classes of each column using colClasses in read.table.
4.Setting multi.line=FALSE may also improve performance in scan.
Hi…
Can you help me? I need to load a file hosted at web and I am using:
f<-file('http://www.the-direction-data.txt')
bigdf <- sqldf('select * from f', file.format = list(header = F, row.names = F))
I have checked the file format but I get an error:
Error en sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: no such table: f)
Do you know how can I solve this situation?
Thanks.
Freddy, I’m not sure the full extent of your issues, but I’m pretty sure “http://www.the-direction-data.txt” is not the right URL. “http://www.the-direction.com/data.txt” might be. Or something else. But getting that step right is the first step.
Hi,
I’ve tried this sqldf approach with a file of 611 MB (2099 columns, 96360 rows), but I get the error message that it cannot allocate a vector of 411 Kb.
My memory.limit is 3000 MB, so I guess it should be possible.
What am I doing wrong?
Thanks,
Sandy
Sandy,
What platform are you on and how much memory do you have. i.e. are you on 64 bit Linux with 16gb ram?
Hello,
Thanks for the post. I successfully read a 18660916 x 4 table using sqldf. However, when I tried it on a 4000 x 4500 table. Only header was read in. I couldn’t figure out why.
Hui
That’s odd, I agree.
Since the link to “this discussion” is dead, here is it again:
http://r-project.markmail.org/thread/iatn4krvfbbykfmk
Thanks Dieter. I really appreciate the link refresh.