Loading Big (ish) Data into R

So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:

This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)

2gib

I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.

So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:

library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.

So how long did it take to run? Just under 5 minutes.

So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:

system.time(big.df <- read.csv(‘bigdf.csv’))

And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.

-JD

15 Comments

  1. Greg says:

    Did you try specifying colClasses in read.csv? If I remember, that will speed things up significantly (see ?read.csv Notes). This is on 1 million lines of observations in a csv file.

    I ran the following:
    df = data.frame(x = rnorm(1e6), y = rnorm(1e6))
    write.csv(df, file=”df.csv”)
    rm(list=ls())
    gc()
    sytem.time(read.csv(“df.csv”, row.names = 1))
    gc()
    system.time(read.csv(“df.csv”,
    colClasses = c(“character”, “numeric”, “numeric”), row.names=1))

    Also, I ran your sqldf code:
    gc()
    library(sqldf)
    f <- file("df.csv")
    system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

    Results:
    No colClasses: elapsed = 31.345s
    with colClasses: elapsed = 11.801s
    sqldf: elapsed = 29.565s

    I'm trying to remember the mechanism for read.csv, but I think it may import the data as "character", then tries to cast to another type with as.*. The Note section of ?read.csv explains that "character" is much slower than "integer" (and "numeric", in my example).

    Nice post.

    Greg
    sessionInfo()
    R version 2.9.0 (2009-04-17)
    i386-apple-darwin8.11.1

    locale:
    en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

    attached base packages:
    [1] tcltk stats graphics grDevices datasets utils methods base

    other attached packages:
    [1] sqldf_0-1.5 gsubfn_0.5-0 proto_0.3-8 RSQLite_0.7-1 DBI_0.2-4
    [6] cimis_0.1-2 RLastFM_0.1-4 RCurl_0.98-1 bitops_1.0-4.1 XML_2.5-3
    [11] lattice_0.17-22

    loaded via a namespace (and not attached):
    [1] grid_2.9.0

  2. Jay says:

    I second Greg’s message. colClasses should help a great deal. In addition setting strings as factors to FALSE should also be helpful in terms of speed.

  3. Jay says:

    The link from Dirk’s post covers a few of th eoptions nicely:
    1.Set nrows=the number of records in your data (nmax in scan).

    2.Make sure that comment.char=”" to turn off interpretation of comments.

    3.Explicitly define the classes of each column using colClasses in read.table.

    4.Setting multi.line=FALSE may also improve performance in scan.

  4. Freddy says:

    Hi…

    Can you help me? I need to load a file hosted at web and I am using:

    f<-file('http://www.the-direction-data.txt')
    bigdf <- sqldf('select * from f', file.format = list(header = F, row.names = F))

    I have checked the file format but I get an error:

    Error en sqliteExecStatement(con, statement, bind.data) :
    RS-DBI driver: (error in statement: no such table: f)

    Do you know how can I solve this situation?

    Thanks.

  5. JD Long says:

    Freddy, I’m not sure the full extent of your issues, but I’m pretty sure “http://www.the-direction-data.txt” is not the right URL. “http://www.the-direction.com/data.txt” might be. Or something else. But getting that step right is the first step.

  6. Sandy says:

    Hi,

    I’ve tried this sqldf approach with a file of 611 MB (2099 columns, 96360 rows), but I get the error message that it cannot allocate a vector of 411 Kb.
    My memory.limit is 3000 MB, so I guess it should be possible.
    What am I doing wrong?

    Thanks,
    Sandy

  7. JD Long says:

    Sandy,

    What platform are you on and how much memory do you have. i.e. are you on 64 bit Linux with 16gb ram?

  8. Hui Cheng says:

    Hello,

    Thanks for the post. I successfully read a 18660916 x 4 table using sqldf. However, when I tried it on a 4000 x 4500 table. Only header was read in. I couldn’t figure out why.

    Hui

  9. JD Long says:

    That’s odd, I agree.

  10. Dieter Menne says:

    Since the link to “this discussion” is dead, here is it again:

    http://r-project.markmail.org/thread/iatn4krvfbbykfmk

  11. JD Long says:

    Thanks Dieter. I really appreciate the link refresh.

  12. odp says:

    Instead of a file, can we read from a ODBC connection through say RODBC?

  13. JD Long says:

    I think so. Although you may find that because of the inefficiency of the ODBC drivers it might be faster to use native DB tools to pull data into a text file and then slurp it up from the text file into R. You’ll have to experiment to know for sure.

  14. Francois B says:

    Hello,

    The post and the comments were really useful. Adding the right parameters to the read.table function permitted me to improve the speed and I tried to run your example of sqldb.
    Nevertheless I tried to use sqldb in a function and I have a file-not-found error that I don’t know how to solve. I precise that I am not a R-expert. Maybe someone could be able to help ?

    Francois

  15. JD Long says:

    Francois, you should post an example and a question on stackoverflow.com… that’s your best route for getting technical support.

Leave a Reply