<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Loading Big (ish) Data into R</title>
	<atom:link href="http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Mon, 19 Jul 2010 21:30:07 -0400</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Jay</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/comment-page-1/#comment-52</link>
		<dc:creator>Jay</dc:creator>
		<pubDate>Fri, 27 Nov 2009 06:01:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416#comment-52</guid>
		<description>The link from Dirk&#039;s post covers a few of th eoptions nicely:
1.Set nrows=the number of records in your data (nmax in scan).

2.Make sure that comment.char=&quot;&quot; to turn off interpretation of comments.

3.Explicitly define the classes of each column using colClasses in read.table.

4.Setting multi.line=FALSE may also improve performance in scan.</description>
		<content:encoded><![CDATA[<p>The link from Dirk&#8217;s post covers a few of th eoptions nicely:<br />
1.Set nrows=the number of records in your data (nmax in scan).</p>
<p>2.Make sure that comment.char=&#8221;" to turn off interpretation of comments.</p>
<p>3.Explicitly define the classes of each column using colClasses in read.table.</p>
<p>4.Setting multi.line=FALSE may also improve performance in scan.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jay</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/comment-page-1/#comment-51</link>
		<dc:creator>Jay</dc:creator>
		<pubDate>Fri, 27 Nov 2009 05:54:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416#comment-51</guid>
		<description>I second Greg&#039;s message. colClasses should help a great deal. In addition setting strings as factors to FALSE should also be helpful in terms of speed.</description>
		<content:encoded><![CDATA[<p>I second Greg&#8217;s message. colClasses should help a great deal. In addition setting strings as factors to FALSE should also be helpful in terms of speed.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/comment-page-1/#comment-50</link>
		<dc:creator>Greg</dc:creator>
		<pubDate>Wed, 25 Nov 2009 20:02:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416#comment-50</guid>
		<description>Did you try specifying colClasses in read.csv? If I remember, that will speed things up significantly (see ?read.csv Notes). This is on 1 million lines of observations in a csv file.

I ran the following:
df = data.frame(x = rnorm(1e6), y = rnorm(1e6))
write.csv(df, file=&quot;df.csv&quot;)
rm(list=ls())
gc()
sytem.time(read.csv(&quot;df.csv&quot;, row.names = 1))
gc()
system.time(read.csv(&quot;df.csv&quot;,
     colClasses = c(&quot;character&quot;, &quot;numeric&quot;, &quot;numeric&quot;), row.names=1))

Also, I ran your sqldf code:
gc()
library(sqldf)
f &lt;- file(&quot;df.csv&quot;)
system.time(bigdf &lt;- sqldf(&quot;select * from f&quot;, dbname = tempfile(), file.format = list(header = T, row.names = F)))

Results:
No colClasses:   elapsed = 31.345s
with colClasses: elapsed = 11.801s
sqldf:           elapsed = 29.565s

I&#039;m trying to remember the mechanism for read.csv, but I think it may import the data as &quot;character&quot;, then tries to cast to another type with as.*. The Note section of ?read.csv explains that &quot;character&quot; is much slower than &quot;integer&quot; (and &quot;numeric&quot;, in my example).


Nice post.

Greg
sessionInfo()
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
 [1] sqldf_0-1.5     gsubfn_0.5-0    proto_0.3-8     RSQLite_0.7-1   DBI_0.2-4
 [6] cimis_0.1-2     RLastFM_0.1-4   RCurl_0.98-1    bitops_1.0-4.1  XML_2.5-3
[11] lattice_0.17-22

loaded via a namespace (and not attached):
[1] grid_2.9.0</description>
		<content:encoded><![CDATA[<p>Did you try specifying colClasses in read.csv? If I remember, that will speed things up significantly (see ?read.csv Notes). This is on 1 million lines of observations in a csv file.</p>
<p>I ran the following:<br />
df = data.frame(x = rnorm(1e6), y = rnorm(1e6))<br />
write.csv(df, file=&#8221;df.csv&#8221;)<br />
rm(list=ls())<br />
gc()<br />
sytem.time(read.csv(&#8220;df.csv&#8221;, row.names = 1))<br />
gc()<br />
system.time(read.csv(&#8220;df.csv&#8221;,<br />
     colClasses = c(&#8220;character&#8221;, &#8220;numeric&#8221;, &#8220;numeric&#8221;), row.names=1))</p>
<p>Also, I ran your sqldf code:<br />
gc()<br />
library(sqldf)<br />
f &lt;- file(&quot;df.csv&quot;)<br />
system.time(bigdf &lt;- sqldf(&quot;select * from f&quot;, dbname = tempfile(), file.format = list(header = T, row.names = F)))</p>
<p>Results:<br />
No colClasses:   elapsed = 31.345s<br />
with colClasses: elapsed = 11.801s<br />
sqldf:           elapsed = 29.565s</p>
<p>I&#039;m trying to remember the mechanism for read.csv, but I think it may import the data as &quot;character&quot;, then tries to cast to another type with as.*. The Note section of ?read.csv explains that &quot;character&quot; is much slower than &quot;integer&quot; (and &quot;numeric&quot;, in my example).</p>
<p>Nice post.</p>
<p>Greg<br />
sessionInfo()<br />
R version 2.9.0 (2009-04-17)<br />
i386-apple-darwin8.11.1</p>
<p>locale:<br />
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8</p>
<p>attached base packages:<br />
[1] tcltk     stats     graphics  grDevices datasets  utils     methods   base</p>
<p>other attached packages:<br />
 [1] sqldf_0-1.5     gsubfn_0.5-0    proto_0.3-8     RSQLite_0.7-1   DBI_0.2-4<br />
 [6] cimis_0.1-2     RLastFM_0.1-4   RCurl_0.98-1    bitops_1.0-4.1  XML_2.5-3<br />
[11] lattice_0.17-22</p>
<p>loaded via a namespace (and not attached):<br />
[1] grid_2.9.0</p>
]]></content:encoded>
	</item>
</channel>
</rss>
