<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; sqlite</title>
	<atom:link href="http://www.cerebralmastication.com/tag/sqlite/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Fri, 16 Jul 2010 22:07:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Loading Big (ish) Data into R</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/</link>
		<comments>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 23:14:06 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[sqldf]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416</guid>
		<description><![CDATA[So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:
This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab [...]]]></description>
			<content:encoded><![CDATA[<p>So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:</p>
<p>This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)</p>
<p><a href="http://twitter.com/vsbuffalo/statuses/5987999475" onclick="pageTracker._trackPageview('/outgoing/twitter.com/vsbuffalo/statuses/5987999475?referer=');"><img class="size-full wp-image-417 alignnone" style="border: 2px solid black; margin: 2px;" title="2gib" src="http://www.cerebralmastication.com/wp-content/uploads/2009/11/2gib.PNG" alt="2gib" width="512" height="316" /></a></p>
<p>I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. <a href="http://twitter.com/mpastell/statuses/6002853376" onclick="pageTracker._trackPageview('/outgoing/twitter.com/mpastell/statuses/6002853376?referer=');">Matti Pastell (@mpastell) recommended </a>using the <a href="http://code.google.com/p/sqldf/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/sqldf/?referer=');">sqldf </a>(SQL to data frame) package to do the import. I&#8217;ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn&#8217;t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the <a href="http://www.sqlite.org/" onclick="pageTracker._trackPageview('/outgoing/www.sqlite.org/?referer=');">sqlite </a>database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn&#8217;t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with <a href="http://old.nabble.com/Re%3A-Memory-Experimentation%3A-Rule-of-Thumb-%3D-10-15-Times-the-Memory-to12076668.html#a12078165" onclick="pageTracker._trackPageview('/outgoing/old.nabble.com/Re_3A-Memory-Experimentation_3A-Rule-of-Thumb-_3D-10-15-Times-the-Memory-to12076668.html_a12078165?referer=');">this discussion </a>on the R mailing list.</p>
<p>So enough background, here&#8217;s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here&#8217;s the code I ran to test:</p>
<blockquote><p>bigdf &lt;- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))<br />
write.csv(bigdf, &#8216;bigdf.csv&#8217;, quote = F)</p></blockquote>
<p>That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That&#8217;s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:</p>
<blockquote><p>library(sqldf)<br />
f &lt;- file(&#8220;bigdf.csv&#8221;)<br />
system.time(bigdf &lt;- sqldf(&#8220;select * from f&#8221;, dbname = tempfile(), file.format = list(header = T, row.names = F)))</p></blockquote>
<p>That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don&#8217;t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.</p>
<p>So how long did it take to run? Just under 5 minutes.</p>
<p>So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:</p>
<blockquote><p>system.time(big.df &lt;- read.csv(&#8216;bigdf.csv&#8217;))</p></blockquote>
<p>And I would love to tell you how long that took to run, but it&#8217;s been running <span style="text-decoration: line-through;">for half an hour</span> all night and I just don&#8217;t have that kind of patience.</p>
<p>-JD</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Choosing an SQL Engine for Analytics</title>
		<link>http://www.cerebralmastication.com/2009/03/chosing-an-sql-engine-for-analytics/</link>
		<comments>http://www.cerebralmastication.com/2009/03/chosing-an-sql-engine-for-analytics/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 21:37:57 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[firebird]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sql server]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=212</guid>
		<description><![CDATA[I&#8217;ve been struggling for a while on which database to use for my working data. I used to use MS Access quite a lot. The problems with MS Access include but are not limited to:

2 GB file size limit, at least historically
Versions change with each edition of MS Office
Sort of tough to write SQL scripts
Very [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been struggling for a while on which database to use for my working data. I used to use MS Access quite a lot. The problems with MS Access include but are not limited to:</p>
<ul>
<li>2 GB file size limit, at least historically</li>
<li>Versions change with each edition of MS Office</li>
<li>Sort of tough to write SQL scripts</li>
<li>Very little automation, ie compression, backup, etc.</li>
<li>Windows only</li>
</ul>
<p>I used Oracle for a few years as a result of my previous employer being an Oracle shop. I then switched to SQL Server when I changed jobs. A full blown client/server DB really does not make a lot of sense for much of what I do. I don&#8217;t run a transactional data store. I don&#8217;t need to have dozens of users hooked to the DB. And I do sometimes need access to my data when I am not hooked to the mother-ship. So I could run the free version of SQL Server on my laptop or run MySQL on my laptop, but both of these options rub me the wrong way. Why? I do a lot of data analysis in R which is RAM intensive. Running a DB server on my laptop means that some fraction of my RAM is going to be taken up by the db server software which is hanging out waiting for me to throw requests at it. I could manually hack around this by starting the server before I load data and then killing it after the data is loaded. That&#8217;s just too big of a pain in my rectum. Oh yeah, one more design requirement: I want to be able to push the whole DB out to a storage blob at Amazon and pound on it using EC2 machines, running Linux. Plus I am cheap and don&#8217;t want to pay a lot.</p>
<p>I&#8217;ll probably end up with a model where I keep some master data sets on a client/server DB and then I will replicate chunks of that to my laptop into my serverless db. I&#8217;ll probably also put output from my desktop db back into the server after analytic work is  done.</p>
<p>I knew about SQLite because of an <a href="http://www.twit.tv/floss26" onclick="pageTracker._trackPageview('/outgoing/www.twit.tv/floss26?referer=');">interview with its author, Richard Hipp on FLOSS Weekly</a>. There&#8217;s also a <a href="http://video.google.com/videoplay?docid=-5160435487953918649" onclick="pageTracker._trackPageview('/outgoing/video.google.com/videoplay?docid=-5160435487953918649&amp;referer=');">video of Hipp talking at the Googleplex</a>. I wish that guy was my neighbor. He seems like the type of guy who would shovel your walk for you then apologize for not doing it perfectly by sending over homemade cookies. Unrelated to the cookies, I really like that SQLite is <a href="http://en.wikipedia.org/wiki/Type_system#Strong_and_weak_typing" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Type_system_Strong_and_weak_typing?referer=');">weakly typed</a>.  I&#8217;m a free spirit like that.</p>
<p>I did some digging for SQLite alternatives and came up with <a href="http://stackoverflow.com/questions/417917/alternatives-to-sqlite" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/417917/alternatives-to-sqlite?referer=');">some stuff at StackOverflow</a>. You can read the post but it reminded me of Firebird. I&#8217;m immediately drawn to FireBird since their logo looks so dang much like the Ruger logo:</p>
<p><img class="alignleft size-full wp-image-214" title="downloads" src="http://www.cerebralmastication.com/wp-content/uploads/2009/03/downloads.jpg" alt="downloads" width="327" height="76" /></p>
<p><img class="alignleft size-full wp-image-215" title="fb-facts" src="http://www.cerebralmastication.com/wp-content/uploads/2009/03/fb-facts.png" alt="fb-facts" width="312" height="70" /></p>
<p>But is Firebird able to be run severless?  If I have to install a server then I would just as well run MySQL.</p>
<p><a href="http://www.oracle.com/database/berkeley-db/index.html" onclick="pageTracker._trackPageview('/outgoing/www.oracle.com/database/berkeley-db/index.html?referer=');">Berkeley DB </a>seems like another option worth investigating, although I am not sure if I can use it without really embedding it in another program the way that I can with SQLite.</p>
<p>SQLite gets bonus points for having native R drivers meaning that I don&#8217;t have to go through a connector technology like ODBC. This is important enough that I should probably make that a requirement. I think Berekley DB has support in R as well. I know for a fact that writing back to SQL Server through the R ODBC package (RODBC) is like pushing a car with a rope, but only slower. Plus I don&#8217;t know how to make ODBC work on Linux. Not rocket science, I am sure, but still one more thing I would have to learn before I do that which I am paid to do.</p>
<p>I&#8217;m going to do some testing, but it looks like I should test real life performance of SQLlite and Firebird with my data.  More to come on this, I am sure.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/03/chosing-an-sql-engine-for-analytics/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
