<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; sqldf</title>
	<atom:link href="http://www.cerebralmastication.com/tag/sqldf/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Fri, 16 Jul 2010 22:07:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Loading Big (ish) Data into R</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/</link>
		<comments>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 23:14:06 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[sqldf]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416</guid>
		<description><![CDATA[So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:
This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab [...]]]></description>
			<content:encoded><![CDATA[<p>So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:</p>
<p>This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)</p>
<p><a href="http://twitter.com/vsbuffalo/statuses/5987999475" onclick="pageTracker._trackPageview('/outgoing/twitter.com/vsbuffalo/statuses/5987999475?referer=');"><img class="size-full wp-image-417 alignnone" style="border: 2px solid black; margin: 2px;" title="2gib" src="http://www.cerebralmastication.com/wp-content/uploads/2009/11/2gib.PNG" alt="2gib" width="512" height="316" /></a></p>
<p>I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. <a href="http://twitter.com/mpastell/statuses/6002853376" onclick="pageTracker._trackPageview('/outgoing/twitter.com/mpastell/statuses/6002853376?referer=');">Matti Pastell (@mpastell) recommended </a>using the <a href="http://code.google.com/p/sqldf/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/sqldf/?referer=');">sqldf </a>(SQL to data frame) package to do the import. I&#8217;ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn&#8217;t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the <a href="http://www.sqlite.org/" onclick="pageTracker._trackPageview('/outgoing/www.sqlite.org/?referer=');">sqlite </a>database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn&#8217;t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with <a href="http://old.nabble.com/Re%3A-Memory-Experimentation%3A-Rule-of-Thumb-%3D-10-15-Times-the-Memory-to12076668.html#a12078165" onclick="pageTracker._trackPageview('/outgoing/old.nabble.com/Re_3A-Memory-Experimentation_3A-Rule-of-Thumb-_3D-10-15-Times-the-Memory-to12076668.html_a12078165?referer=');">this discussion </a>on the R mailing list.</p>
<p>So enough background, here&#8217;s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here&#8217;s the code I ran to test:</p>
<blockquote><p>bigdf &lt;- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))<br />
write.csv(bigdf, &#8216;bigdf.csv&#8217;, quote = F)</p></blockquote>
<p>That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That&#8217;s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:</p>
<blockquote><p>library(sqldf)<br />
f &lt;- file(&#8220;bigdf.csv&#8221;)<br />
system.time(bigdf &lt;- sqldf(&#8220;select * from f&#8221;, dbname = tempfile(), file.format = list(header = T, row.names = F)))</p></blockquote>
<p>That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don&#8217;t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.</p>
<p>So how long did it take to run? Just under 5 minutes.</p>
<p>So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:</p>
<blockquote><p>system.time(big.df &lt;- read.csv(&#8216;bigdf.csv&#8217;))</p></blockquote>
<p>And I would love to tell you how long that took to run, but it&#8217;s been running <span style="text-decoration: line-through;">for half an hour</span> all night and I just don&#8217;t have that kind of patience.</p>
<p>-JD</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
