<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication</title>
	<atom:link href="http://www.cerebralmastication.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Wed, 17 Feb 2010 16:54:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Real-World, Real-Time Analytics</title>
		<link>http://www.cerebralmastication.com/2010/02/real-world-real-time-analytics/</link>
		<comments>http://www.cerebralmastication.com/2010/02/real-world-real-time-analytics/#comments</comments>
		<pubDate>Wed, 17 Feb 2010 16:54:32 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[interview]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[rockstars]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=645</guid>
		<description><![CDATA[Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and read Peter Skomoroch&#8217;s interview with Bradford Cross of FlightCaster.
Peter wrote up this interview back in August 2009, so I&#8217;m a little late to this party. There&#8217;s some really great quotes in this interview. Here&#8217;s a few of my fav [...]]]></description>
			<content:encoded><![CDATA[<p>Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and <a href="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data" onclick="pageTracker._trackPageview('/outgoing/www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data?referer=');">read Peter Skomoroch&#8217;s interview with Bradford Cross </a>of <a href="http://www.flightcaster.com/" onclick="pageTracker._trackPageview('/outgoing/www.flightcaster.com/?referer=');">FlightCaster</a>.</p>
<p>Peter wrote up this interview back in August 2009, so I&#8217;m a little late to this party. There&#8217;s some really great quotes in this interview. Here&#8217;s a few of my fav quotes from Cross:</p>
<blockquote><p>At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.</p></blockquote>
<blockquote><p>Building layer upon layer of abstraction is a big key&#8230;    The technical term for this is “wrap the crap.”</p></blockquote>
<p>Here&#8217;s a problem I think anyone who works with data and models can relate to:</p>
<blockquote><p>I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/real-world-real-time-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>You can Hadoop it! It&#8217;s elastic! Boogie woogie woog-ie!</title>
		<link>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/</link>
		<comments>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 18:31:23 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=592</guid>
		<description><![CDATA[I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_594" class="wp-caption alignleft" style="width: 271px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png"><img class="size-full wp-image-594 " style="border: 1px solid black; margin: 3px;" title="I paid an old man in Chinatown $200 for this!" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png" alt="" width="261" height="144" /></a><p class="wp-caption-text">This blog&#39;s name in Chinese! </p></div>
<p>I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me it doesn&#8217;t mean that.</p>
<p>So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">As I have told you before</a>, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn&#8217;t make this up if I tried, so you know it&#8217;s true! And what&#8217;s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their <a href="http://aws.amazon.com/elasticmapreduce/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/elasticmapreduce/?referer=');">Elastic MapReduce service</a>! I think Amazon coined the term &#8220;Time Travel as a Service&#8221; or TTaaS  generally pronounced as &#8220;ta-tas&#8221; in <a href="http://www.savethetatas.com/" onclick="pageTracker._trackPageview('/outgoing/www.savethetatas.com/?referer=');">the industry</a>. If you are a CTO be sure and use this in your next &#8220;vision statement&#8221; pitch so everyone will know you&#8217;re hip to all this cloud stuff.</p>
<p>So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don&#8217;t you? Well it&#8217;s a damn good thing you read this blog because I&#8217;m going to give you the keys to the Wu dynasty and a little 福寿 while we&#8217;re at it.</p>
<p>I&#8217;ve never had an original thought in my life so I started with <a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=128995&amp;#128995" onclick="pageTracker._trackPageview('/outgoing/developer.amazonwebservices.com/connect/thread.jspa?messageID=128995_amp_128995&amp;referer=');">this discussion </a>over at the AMZN E M/R discussion forum. Peter Skomoroch from <a href="http://www.datawrangling.com/" onclick="pageTracker._trackPageview('/outgoing/www.datawrangling.com/?referer=');">Data Wrangling </a>gives a very good example with all the data and code provided so you can run it yourself.  Pete&#8217;s example really shakes the  yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg&#8217;s <a href="http://docs.google.com/viewer?url=http%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2FHadoopStreaming%2FHadoopStreaming.pdf" onclick="pageTracker._trackPageview('/outgoing/docs.google.com/viewer?url=http_3A_2F_2Fcran.r-project.org_2Fweb_2Fpackages_2FHadoopStreaming_2FHadoopStreaming.pdf&amp;referer=');">HadoopStreaming package</a> which was good for insight, but I didn&#8217;t use the package as it&#8217;s really focused on the &#8216;big data&#8217; problem.</p>
<div id="attachment_639" class="wp-caption alignleft" style="width: 218px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg"><img class="size-full wp-image-639 " style="border: 1px solid black; margin: 3px;" title="hadoop elephant" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg" alt="" width="208" height="156" /></a><p class="wp-caption-text">That elephant is so freaking cute! </p></div>
<p>Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don&#8217;t have big data. I have big CPU hogging models (mostly slow because I don&#8217;t code worth a shit). What got me reconsidering my world view was <cite></cite><a onclick="pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com/?referer=');pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com?referer=http%3A%2F%2Fwww.cerebralmastication.com%2F');" rel="external nofollow" href="http://www.johnmyleswhite.com/">John Myles White</a>&#8217;s comment on my <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">multicore post </a>earlier. John encouraged me to look into running my simulations on AMZN&#8217;s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.</p>
<p>I learned a few &#8220;gotchas&#8221; or, as we say in the future: 臭婊子(I think that should be plural). I&#8217;ll do a whole blog post on gotchas soon, but here&#8217;s the bullet points:</p>
<ul>
<li>AMZN is currently running the version of Debian Linux named Lenny which has version 2.7.1 of R installed. No matter what the documentation says, don&#8217;t let Lenny tend to the rabbits.</li>
<li>Test all code by firing up an interactive Pig instance and logging in as &#8216;hadoop&#8217;. Instead of running Pig, run R and test your code. And as it says in the FAQ: &#8220;The Pig don&#8217;t care either way. &#8221; Which, despite sounding like buggery, is the truth.</li>
<li>If your code runs inside of R on a Hadoop instance, drop back to the command line on the Hadoop instance and run &#8216;cat infile.txt | yourMapper.R | sort | yourReducer.R &gt; outfile.txt&#8217;. This pipes your input file into your mapper file which does it&#8217;s thing and then pipes the results to your reducer file which then &#8220;pumps up the jam&#8221; into an output file.  What you see in the outfile.txt is what Hadoop will produce. So it you don&#8217;t like what you see, you better do some more coding.</li>
<li>You CAN load packages into R in a Hadoop instance running in AMZN E M/R. There are a few caveats, of course:</li>
</ul>
<ol>
<li>Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.</li>
<li>As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see <a href="http://stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again?referer=');">from this question</a>, I am not yet comfortable with.</li>
<li>There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there&#8217;s no possibility for a conflict! Here&#8217;s an example of how I load the Hmisc package:</li>
<p><script src="http://gist.github.com/304262.js?file=AMZNloadPackage.R"></script></ol>
<ul>
<li>You&#8217;ll probably want to provide some data to R. This is done by uploading your files to S3 and then passing the &#8220;-cacheFile&#8221; option to Hadoop. To get the plyr package to load in R 2.7.1 I had to edit the package. I then uploaded the altered package thusly:</li>
</ul>
<blockquote><p>-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz</p></blockquote>
<p>More to come later. I&#8217;ve gotta get back to the future.</p>
<div id="attachment_631" class="wp-caption alignleft" style="width: 314px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg"><img class="size-full wp-image-631" style="border: 1px solid black; margin: 3px;" title="christopher_lloyd" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg" alt="" width="304" height="224" /></a><p class="wp-caption-text">You hold the elephant and I&#39;ll plug this in. </p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using the R multicore package in Linux with wild and passionate abandon</title>
		<link>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/</link>
		<comments>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 19:57:20 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=562</guid>
		<description><![CDATA[One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg"><img class="alignleft size-full wp-image-586" style="border: 0pt none; margin: 20px;" title="amd_mc_processing" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg" alt="" width="214" height="193" /></a>One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an <a href="http://www.virtualbox.org/" onclick="pageTracker._trackPageview('/outgoing/www.virtualbox.org/?referer=');">Oracle VirtualBox </a>(Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I&#8217;ve moved to running my sims on EC2 machines. I&#8217;m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It&#8217;s not very sexy, but it gets the job done&#8230; I guess the same could be said of myself, except substitute &#8220;makes slurping sounds eating udon&#8221; in the place of &#8220;gets the job done.&#8221;</p>
<p>When running processor intensive crap (that&#8217;s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the <a href="http://www.rforge.net/doc/packages/multicore/multicore.html" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/doc/packages/multicore/multicore.html?referer=');">multicore package </a>is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It&#8217;s almost like magic, except without the rabbit and hat.</p>
<p>The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for &#8216;magic carpet&#8217; and is an allusion to the worm hole technology. So how does one harness this package for <span style="text-decoration: line-through;">nefarious self interest </span>doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that&#8217;s basically it.</p>
<p>How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I&#8217;ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the &#8216;valuation&#8217; based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:</p>
<blockquote><p>myOutput &lt;- mclapply( drawList, function(x) valuationReturns(drawNumber=x))</p></blockquote>
<p>The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!</p>
<p>I recommend the <a href="http://htop.sourceforge.net/" onclick="pageTracker._trackPageview('/outgoing/htop.sourceforge.net/?referer=');">htop program </a>for tracking what&#8217;s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png"><img class="size-full wp-image-564 alignnone" title="r on ec2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png" alt="" width="535" height="400" /></a></p>
<p>I don&#8217;t understand time travel, but I&#8217;ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.</p>
<p>Let me know if you have other multicore tips and tricks.</p>
<p>If you want to give me shit for running my simulations as root, feel free. I&#8217;m impervious to your &#8220;best practices&#8221; mumbo jumbo. La la la la la la!! Not listening!</p>
<p>Special thanks to <a href="http://www.cis.udel.edu/~cavazos/index.php?page=multicore-programming" onclick="pageTracker._trackPageview('/outgoing/www.cis.udel.edu/_cavazos/index.php?page=multicore-programming&amp;referer=');">John Cavazos over at the University of Delaware</a> from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Remote Backup Fail and How to Silently Copy Files</title>
		<link>http://www.cerebralmastication.com/2010/01/remote-backup-fail-and-how-to-silently-copy-files/</link>
		<comments>http://www.cerebralmastication.com/2010/01/remote-backup-fail-and-how-to-silently-copy-files/#comments</comments>
		<pubDate>Tue, 19 Jan 2010 23:33:01 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[batch files]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=549</guid>
		<description><![CDATA[Today I called my firms desktop support to talk to them about how to get Iron Mountain Connected Backup to archive files located somewhere other than [C:\Documents and Settings\user\] and through talking with my desktop support guy I discovered that it doesn&#8217;t support that. Oh, and by the way it&#8217;s a &#8220;desktop backup&#8221; so it&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/01/pee-on-iron-mountain.jpg"><img class="alignleft size-full wp-image-554" style="margin: 10px; border: 1px solid black;" title="pee on iron mountain" src="http://www.cerebralmastication.com/wp-content/uploads/2010/01/pee-on-iron-mountain.jpg" alt="" width="210" height="210" /></a>Today I called my firms desktop support to talk to them about how to get <a href="http://backup.ironmountain.com/" onclick="pageTracker._trackPageview('/outgoing/backup.ironmountain.com/?referer=');">Iron Mountain Connected Backup</a> to archive files located somewhere other than [C:\Documents and Settings\user\] and through talking with my desktop support guy I discovered that it doesn&#8217;t support that. Oh, and by the way it&#8217;s a &#8220;desktop backup&#8221; so it&#8217;s not backup up my MS Access files or Outlook PST files. I told the guy that I had gone in and made sure it was backing those files up and they were checked in the UI. He informed me that it may look like they are backed up, but I can&#8217;t restore them. To which I responded<span style="color: #800000;"><strong> &#8220;Any developer who writes backup software that will backup a file it can&#8217;t restore should be kicked squarely in the nuts and then never allowed near a computer for life&#8221;</strong></span> I&#8217;m not kidding. Honest to god I would kick an Iron Mountain developer right in the baby maker for passing this piece of shit program off as &#8220;enterprise ready.&#8221; The only way this program could be more useless is if it actually deleted files from my PC instead of backing them up. If the software is crippled because they are selling it as a &#8220;desktop backup&#8221; then, by god, they better tell me that in big fucking blinking letters and a marching band playing John Philip Sousa on my lap.</p>
<p><strong>Alternatives:</strong> I&#8217;ve been running <a href="http://www.jungledisk.com/" onclick="pageTracker._trackPageview('/outgoing/www.jungledisk.com/?referer=');">Jungle Disk</a> at home and really like it. I could use that at work except I have not set up an Amazon or RackSpace account with my work credit card. But I am in Chicago and my database server/ file server is in Dallas TX. So I decided to just create a mirror on my laptop onto a shared drive on my server. There&#8217;s lots of ways to do this, but the path I chose was to use <a href="http://en.wikipedia.org/wiki/Robocopy" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Robocopy?referer=');">RoboCopy, a command line copy tool from Microsoft</a> that is part of the Windows Server 2003 Resource Kit. I&#8217;m running XP and I wanted the mirroring of my machine to be invisible, silent, and scheduled. To do this I found I needed to take the following steps:</p>
<ol>
<li>Install RoboCopy</li>
<li>Create a batch file to mirror the directory I wanted</li>
<li>Create a windows script to call the batch silently</li>
<li>Schedule the windows script to run automagically</li>
</ol>
<p><strong>Install RoboCopy:</strong> Download the <a href="http://www.microsoft.com/downloads/details.aspx?familyid=9d467a69-57ff-4ae7-96ee-b18c4790cffd&amp;displaylang=en" onclick="pageTracker._trackPageview('/outgoing/www.microsoft.com/downloads/details.aspx?familyid=9d467a69-57ff-4ae7-96ee-b18c4790cffd_amp_displaylang=en&amp;referer=');">Windows Server 2003 Resource Kit</a> and install it. Very easy.</p>
<p><strong>Create a batch file to run RoboCopy</strong>: I named mine c:/backup.bat and it looks something like this:</p>
<blockquote>
<div id="_mcePaste">Set Source=&#8221;C:\Documents and Settings\jdlong&#8221;</div>
<div id="_mcePaste">Set Dest=&#8221;\\myDallasServer\backup\jdlong&#8221;</div>
<div id="_mcePaste">Robocopy %Source% %Dest% /MIR /Z /R:0  &gt;nul</div>
</blockquote>
<p>This simply sets the source and destination and then runs RoboCopy with the /MIR (mirror) and /Z (restartable) switches invoked</p>
<p><strong>Create a windows script</strong>: The problem with the batch file is that it is noisy when it runs. Even piping the output to nul it still produces a CMD window that stays up until it finishes running. That&#8217;s where the Windows Script file comes into play. It calls the batch file but hides the CMD window. I created a file called c:\runBackup.vbs that has this in it:</p>
<blockquote><p>Set WshShell = CreateObject(&#8220;WScript.Shell&#8221;)<br />
WshShell.Run chr(34) &amp; &#8220;C:\backup.bat&#8221; &amp; Chr(34), 0<br />
Set WshShell = Nothing</p></blockquote>
<div><strong>Schedule the windows script:</strong> Control Panel -&gt; Scheduled Tasks. Then I created a new task that runs  c:\runBackup.vbs every night at 11PM. The only down side is that when I change my password I have to remember to change the password associated with the scheduled task or it will fail.</div>
<div>The only upside is that I figured out that Iron Mountain sucks prior to having data loss. I got lucky. Next week I am going to test my backup. And then test it every quarter after that. And I won&#8217;t depend on my corporate IT do to my backups.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/01/remote-backup-fail-and-how-to-silently-copy-files/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Struggling with apply() in R</title>
		<link>http://www.cerebralmastication.com/2009/12/struggling-with-apply-in-r/</link>
		<comments>http://www.cerebralmastication.com/2009/12/struggling-with-apply-in-r/#comments</comments>
		<pubDate>Fri, 11 Dec 2009 19:30:55 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[apply]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=432</guid>
		<description><![CDATA[It&#8217;s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the following discussion on Stack Overflow:

Dirk&#8217;s comment is actually spot on. I&#8217;ve asked the same damn question at least 4-5 times. Only I didn&#8217;t really understand it was the same question. That&#8217;s one of [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the <a href="http://stackoverflow.com/questions/1355355/how-to-avoid-a-loop-in-r-selecting-items-from-a-list" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/1355355/how-to-avoid-a-loop-in-r-selecting-items-from-a-list?referer=');">following discussion </a>on Stack Overflow:</p>
<p><a href="http://stackoverflow.com/questions/1355355/how-to-avoid-a-loop-in-r-selecting-items-from-a-list" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/1355355/how-to-avoid-a-loop-in-r-selecting-items-from-a-list?referer=');"><img class="alignnone size-full wp-image-433" style="border: 2px solid black; margin: 2px;" title="apply_struggle" src="http://www.cerebralmastication.com/wp-content/uploads/2009/12/apply_struggle.PNG" alt="apply_struggle" width="536" height="217" /></a></p>
<p>Dirk&#8217;s comment is actually spot on. I&#8217;ve asked the same damn question at least 4-5 times. Only I didn&#8217;t really understand it was the same question. That&#8217;s one of the problems of not really being good at something; it&#8217;s hard to think abstractly about it. I&#8217;m not really good at R, so sometimes I don&#8217;t realize that multiple concepts are related. As I talk with other new users of R it&#8217;s clear that unless they come from a programming language with an apply-esque construct they likely are struggling with R. I think most of the confusion comes from a) not understanding what data format apply() is going to return and b) not understanding anonymous functions.</p>
<p>With this in mind I did a little screencast illustrating how this struggle plays out for a new users. I also show why I use the plyr package for much of the stuff other folks use apply() for.</p>
<p>Any feedback you have is appreciated. This is my first stab at a screencast, so I am still trying to figure out the best approach/method as well as how many drinks puts me on the <a href="http://xkcd.com/323/" onclick="pageTracker._trackPageview('/outgoing/xkcd.com/323/?referer=');">Ballmer Peak</a>.</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="350" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="src" value="http://www.youtube.com/v/tdoIwXT_lP8" /><embed type="application/x-shockwave-flash" width="425" height="350" src="http://www.youtube.com/v/tdoIwXT_lP8"></embed></object></p>
<p><strong>EDIT</strong>: it&#8217;s been pointed out that I misuse some terminology a number of times. I should have named my year vector &#8220;yearVector.&#8221; By calling it &#8220;yearList&#8221; I then refer to the vector as a list. I was using &#8220;list&#8221; in the vernacular, but since list is a specific R data structure it is confusing that I named a vector a name with &#8220;list&#8221; in it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/12/struggling-with-apply-in-r/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Loading Big (ish) Data into R</title>
		<link>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/</link>
		<comments>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 23:14:06 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[sqldf]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=416</guid>
		<description><![CDATA[So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:
This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab [...]]]></description>
			<content:encoded><![CDATA[<p>So for the rest of this conversation big data == 2 Gigs. Done. Don&#8217;t give me any of this &#8216;that&#8217;s not big, THIS is big&#8217; shit. There now, on with the cool stuff:</p>
<p>This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)</p>
<p><a href="http://twitter.com/vsbuffalo/statuses/5987999475" onclick="pageTracker._trackPageview('/outgoing/twitter.com/vsbuffalo/statuses/5987999475?referer=');"><img class="size-full wp-image-417 alignnone" style="border: 2px solid black; margin: 2px;" title="2gib" src="http://www.cerebralmastication.com/wp-content/uploads/2009/11/2gib.PNG" alt="2gib" width="512" height="316" /></a></p>
<p>I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. <a href="http://twitter.com/mpastell/statuses/6002853376" onclick="pageTracker._trackPageview('/outgoing/twitter.com/mpastell/statuses/6002853376?referer=');">Matti Pastell (@mpastell) recommended </a>using the <a href="http://code.google.com/p/sqldf/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/sqldf/?referer=');">sqldf </a>(SQL to data frame) package to do the import. I&#8217;ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn&#8217;t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the <a href="http://www.sqlite.org/" onclick="pageTracker._trackPageview('/outgoing/www.sqlite.org/?referer=');">sqlite </a>database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn&#8217;t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with <a href="http://old.nabble.com/Re%3A-Memory-Experimentation%3A-Rule-of-Thumb-%3D-10-15-Times-the-Memory-to12076668.html#a12078165" onclick="pageTracker._trackPageview('/outgoing/old.nabble.com/Re_3A-Memory-Experimentation_3A-Rule-of-Thumb-_3D-10-15-Times-the-Memory-to12076668.html_a12078165?referer=');">this discussion </a>on the R mailing list.</p>
<p>So enough background, here&#8217;s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here&#8217;s the code I ran to test:</p>
<blockquote><p>bigdf &lt;- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))<br />
write.csv(bigdf, &#8216;bigdf.csv&#8217;, quote = F)</p></blockquote>
<p>That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That&#8217;s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:</p>
<blockquote><p>library(sqldf)<br />
f &lt;- file(&#8220;bigdf.csv&#8221;)<br />
system.time(bigdf &lt;- sqldf(&#8220;select * from f&#8221;, dbname = tempfile(), file.format = list(header = T, row.names = F)))</p></blockquote>
<p>That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don&#8217;t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.</p>
<p>So how long did it take to run? Just under 5 minutes.</p>
<p>So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:</p>
<blockquote><p>system.time(big.df &lt;- read.csv(&#8216;bigdf.csv&#8217;))</p></blockquote>
<p>And I would love to tell you how long that took to run, but it&#8217;s been running <span style="text-decoration: line-through;">for half an hour</span> all night and I just don&#8217;t have that kind of patience.</p>
<p>-JD</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using Amazon EC2 to Thwart Crappy Internal IT Services</title>
		<link>http://www.cerebralmastication.com/2009/11/using-amazon-ec2-to-thwart-crappy-internal-it-services/</link>
		<comments>http://www.cerebralmastication.com/2009/11/using-amazon-ec2-to-thwart-crappy-internal-it-services/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 15:28:26 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=391</guid>
		<description><![CDATA[
The alternative title of this blog post is &#8220;How to get your sorry ass fired by violating your internal IT policies.&#8221; So keep that in mind as you read this.
I say lots of silly crap. Twitter allows me the pleasure of sharing this blather with the world. I was a little surprised that of all [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://twitter.com/CMastication/status/5294564298" onclick="pageTracker._trackPageview('/outgoing/twitter.com/CMastication/status/5294564298?referer=');"><img class="alignleft size-full wp-image-393" style="margin: 6px;" title="ec2 tweet" src="http://www.cerebralmastication.com/wp-content/uploads/2009/11/ec2-tweet.PNG" alt="ec2 tweet" width="417" height="233" /></a></p>
<p>The alternative title of this blog post is &#8220;How to get your sorry ass fired by violating your internal IT policies.&#8221; So keep that in mind as you read this.</p>
<p>I say lots of silly crap. Twitter allows me the pleasure of sharing this blather with the world. I was a little surprised that of all the things I have said over the last few months the above Tweet received the most discussion. Apparently this tweet captured the imagination and consternation of some fellow Tweeters. I had people follow up with me and basically ask, &#8220;what do you mean?&#8221; Twitter is good for a sound bite, but less so for an elaborate answer. Which brings us to this:</p>
<p>What are the top ways Amazon EC2 can allow a business user to escape the manipulative and counterproductive grip of corporate IT? Well I&#8217;m glad you asked!</p>
<p><strong>1) Over-restrictive web filtering policies</strong>:  When I worked as a risk manager for a Fortune 500 insurance firm I was shocked on the first day when I could not search Google Groups. At the time Google Groups was one of my favorite resources for figuring out everything from SQL syntax to Excel formulas. The firm, like most firms, outsourced the filtering of web content. Apparently they signed up for &#8220;Super Freaking Restrictive&#8221; filtering. I could not even search the web for &#8220;Ubuntu&#8221; as all sites with the word Ubuntu in the title or with the world &#8220;Ubuntu&#8221; passed as a form submission were blocked. Apparently Ubuntu is not just a Linux distro, but also a militant organization of African computer programmers, or something. So how did I get around this with EC2? I would fire up an EC2 Ubuntu instance running Squid proxy before I left home, then ssh into the cloud from work and use a little SSH port forwarding to route my web traffic through the ssh connection and out via Squid. I set up my EC2 instance to listen for ssh on port 443 and my firm&#8217;s firewall would let the connection pass as it assumed it was simply ssl traffic into Amazon. Brilliant!</p>
<p><strong>2) Under powered database servers: </strong>At another point I was responsible for data analytics on a portfolio of insurance policies. I had to join together data from multiple systems (underwriting, admin, claims, etc.). The firm was an Oracle shop and none of the Oracle machines had enough user space for me to make the big ass join that had to be made in order to cobble together my analytics. For a while I hobbled along using PROC SQL in SAS to bring all the data together inside of SAS running on a PC. Finally I just gave up and built my own data mart in the cloud. And I could totally cut my internal IT politics out of the system. Whew, once the politics and begging for resources was over I could kick ass at analytics without having to beg borrow and plead for permissions and space.</p>
<p><strong>3) Failure to backup desktop machines / inadequate shared drive space: </strong>Another experience I had was with a firm that decided it was a good policy to NOT back up desktop PCs at all. Each department was given shared drive space on a central server where &#8220;business critical&#8221; files were supposed to be kept (whatever the hell that means). Only the files on the central server were backed up. I was in the risk management department (ironically) and we had a whopping 100 MB allocated to us. Yes, this was 2004 and 100 MB was not enough to hold 2 years of risk reviews. Not to mention any ad hoc analysis and all the supporting documents. So everyone had their desktop drives, at least one USB drive, and no off site backup. It was during this period that I discovered <a href="http://www.jungledisk.com/" onclick="pageTracker._trackPageview('/outgoing/www.jungledisk.com/?referer=');">Jungle Disk </a>which allows client side encrypted data to be backed up to Amazon! Off site backup problem solved! And, once again, corp IT cut out of the system. (yes, this is a use of S3, not EC2) By the way, I paid for backups out of my own pocket because I felt it was very important. Well, I did have the firm buy me books which I happily kept when I left. We&#8217;ll call it even.</p>
<p>Let me reiterate that all three of the above uses <span style="text-decoration: line-through;">may have</span> <span style="color: #000000;">put me in direct violation of my corporate IT policies. And let me also state that ultimately I found a job at a firm where internal IT sees their job as helping the business units get crap done. If you are an IT professional and you find your self thinking, &#8220;damn, I have to make sure I restrict my users from all of these crafty uses of EC2&#8243; then, <strong><span style="color: #993300;">jackass,you are the problem with your firm&#8217;s IT department</span></strong>. If you see your job as stopping users then you are a useless burden on your firm and you should be not only fired, but spat upon. The way to prevent users from doing these, and other &#8220;shadow IT&#8221; behaviors is to <strong><span style="color: #993300;">provide the IT services that help your users be awesom<span style="color: #993300;">e</span></span><span style="color: #993300;">!</span></strong> If you do that then you don&#8217;t have to worry about what your users are up to. They&#8217;ll be too damn busy being awesome to have time to mess with Amazon EC2.</span></p>
<p>All the examples above took place at previous places of employment. I currently use Amazon EC2 in order to scale some of my analytics, but it is done with the knowledge and support of my internal IT team. They fully understand what I am doing and they want to help me be awesome at analysis. It&#8217;s amazing how much less time I am wasting these days now that I don&#8217;t have to be so creative about avoiding the manipulative and counterproductive intervention of my internal IT team.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/11/using-amazon-ec2-to-thwart-crappy-internal-it-services/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Kicking Ass with plyr</title>
		<link>http://www.cerebralmastication.com/2009/10/kicking-ass-with-plry/</link>
		<comments>http://www.cerebralmastication.com/2009/10/kicking-ass-with-plry/#comments</comments>
		<pubDate>Thu, 29 Oct 2009 16:17:45 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=381</guid>
		<description><![CDATA[Tonight (October 29, 2009) at 5:30 PM is the Chicago R meetup at Jaks tap. Here&#8217;s more info.  I&#8217;ll be making a presentation based on my earlier blog post about plyr. The presentation will only be 8 minutes long so I&#8217;ve had to pick and choose my info carefully. OK, who am I kidding? I [...]]]></description>
			<content:encoded><![CDATA[<p>Tonight (October 29, 2009) at 5:30 PM is the Chicago R meetup at Jaks tap. <a href="http://www.nabble.com/R-CMD---meetup%3DChicago---when%3DOct-29---where%3DJak'sTap-td25873425.html" onclick="pageTracker._trackPageview('/outgoing/www.nabble.com/R-CMD---meetup_3DChicago---when_3DOct-29---where_3DJak_sTap-td25873425.html?referer=');">Here&#8217;s more info</a>.  I&#8217;ll be making a presentation based on my <a href="http://www.cerebralmastication.com/?p=339">earlier blog post about plyr</a>. The presentation will only be 8 minutes long so I&#8217;ve had to pick and choose my info carefully. OK, who am I kidding? I had a couple of Schlitz (in a bottle!) for lunch over at <a href="http://chicago.menupages.com/restaurants/boni-vino/" onclick="pageTracker._trackPageview('/outgoing/chicago.menupages.com/restaurants/boni-vino/?referer=');">Boni Vinos</a> and slammed some slides together rather haphazardly. At any rate, here&#8217;s the presentation. I owe special thanks to all the folks in Twitter who reviewed these slides this week. A special shout out to <a href="http://twitter.com/kenahoo/status/5237929377" onclick="pageTracker._trackPageview('/outgoing/twitter.com/kenahoo/status/5237929377?referer=');">@kenahoo</a> who caught my one code typo! And also to <a href="http://twitter.com/hadleywickham/status/5235012169" onclick="pageTracker._trackPageview('/outgoing/twitter.com/hadleywickham/status/5235012169?referer=');">@hadleywickham</a> (author of plyr) who made some good suggestions, some of which I heeded. As a professor he should consider 15% application of his information to be a phenomenally high rate.</p>
<p>Click the graphic to download the slides as a PDF:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2009/10/r-plyr-20091029.pdf"><img class="size-full wp-image-382 alignnone" title="kickingasswithplry" src="http://www.cerebralmastication.com/wp-content/uploads/2009/10/kickingasswithplry.PNG" alt="kickingasswithplry" width="473" height="365" /></a></p>
<p>If you&#8217;re wondering what my favorite beer is, I&#8217;ll give you a secret. My favorite beer is #3. That&#8217;s the one that makes me a persuasive and articulate public speaker. #4 makes me dance well.</p>
<p>I hope to see you tonight.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/10/kicking-ass-with-plry/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Why Stack Overflow Careers is a Disruptive Innovation</title>
		<link>http://www.cerebralmastication.com/2009/10/why-stack-overflow-careers-is-a-disruptive-innovation/</link>
		<comments>http://www.cerebralmastication.com/2009/10/why-stack-overflow-careers-is-a-disruptive-innovation/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 20:27:12 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=370</guid>
		<description><![CDATA[Today Joel (typo fixed) Jeff Atwood announced via the Stack Overflow blog a new site called Stack Overflow Careers, a programming job site focused at job hunters.  This is a compliment to the job listing service which allows companies who are hiring to advertise on Stack Overflow. Seems like the the world&#8217;s most &#8216;no shit&#8217; [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.stackoverflow.com/2009/10/introducing-stack-overflow-careers/" onclick="pageTracker._trackPageview('/outgoing/blog.stackoverflow.com/2009/10/introducing-stack-overflow-careers/?referer=');"><img class="alignnone" src="http://careers.stackoverflow.com/Content/cso/Img/logo.png" alt="" width="363" height="64" /></a>Today <span style="text-decoration: line-through;">Joel</span> (typo fixed) Jeff Atwood announced <a href="http://blog.stackoverflow.com/2009/10/introducing-stack-overflow-careers/" onclick="pageTracker._trackPageview('/outgoing/blog.stackoverflow.com/2009/10/introducing-stack-overflow-careers/?referer=');">via the Stack Overflow blog </a>a new site called Stack Overflow Careers, a programming job site focused at job hunters.  This is a compliment to the job listing service which allows companies who are hiring to advertise on Stack Overflow. Seems like the the world&#8217;s most &#8216;no shit&#8217; idea, right? But this is more than a simple idea, this is <a href="http://en.wikipedia.org/wiki/Disruptive_technology" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Disruptive_technology?referer=');">disruptive innovation </a>in job hunting that will revolutionize  how programmers, and later other technical talent, get jobs. Why so revolutionary? Because Stack Overflow is the venue where programmers actually prove their mettle. What SO has that no other job site can offer hiring companies is access to not just resumes, but code samples, writing samples, and conversations which the candidates have provided through use of SO. I&#8217;ve hired technical talent and it is hard to figure out if the person can get stuff done and do critical thinking. It&#8217;s even hard to figure out their level of mastery of a given technology without spending a lot of time testing. But with SO I can see how people answer questions, how they ask questions, etc.</p>
<p>The value of this service to job hunters is high enough that I think SO is smart to charge for providing the link between a user&#8217;s resume and their profile. The price is low (&lt;$10/yr right now) But this could provide a meaningful revenue stream for SO without cluttering the site with ad noise.</p>
<p>This should also have positive feedback effects on Stack Overflow. I anticipate that the quality of answers and questions both will go up. While I&#8217;m still stupid enough to post a comment that says &#8220;hey dumb ass&#8230;&#8221; most users will be smarter than that. Even if it&#8217;s just in the back of someone&#8217;s mind that a potential employer might be reading, it should improve their answers.</p>
<p>Now, I&#8217;m an economist, so I think there will be some unintended consequences to the Stack Overflow Careers site. First unintended consequence will be an increase in the number of new accounts. Why? Because everyone has to set up a &#8220;dumbass&#8221; account for when they want to ask stupid questions or make a sarcastic remark. But then when someone is about to post a 750 word answer with 5 pictures and 3 screen shots they will put on their &#8220;game day&#8221; account and knock that SOB out of the park! But over time most folks will let the idea of future job hunts slip out of their conscious mind and will basically use their main account for all activity. Plus even dumb questions earn karma points, so why not!</p>
<p>The second unintended consequence will be a plethora of clinger-on-ers. These will range from the simple: I give Dice 2 months before they allow some sort of linking to a SO profile; to the more complex: some job site is going to not only link to a SO profile but also scrape all that user&#8217;s SO activity and dynamically link it to their resume. I am unsure how the Stack Overflow Creative Commons license will protect SO from this type of activity. Stack Overflow may end up spending more on money on lawyers than they ever anticipated.</p>
<p>The third unintended consequence is that Stack Overflow, the company, just went from taking nickels away from Expert&#8217;s Exchange to taking dollars away from job listing sites. I hope that SO has both lawyers and sales people because now that real money is on the table they will have some savvy competition.</p>
<p>Good luck Joel and Jeff. I&#8217;m rooting for you!</p>
<p><strong>Disclaimer: </strong>I have no financial interest in Stack Overflow at all. I have been involved in getting the R programming language community involved in Stack Overflow because I think it is the best information exchange platform available and I want more R information to be exchanged. Plus it pissed me off that the R community was using a freaking mailing list as it&#8217;s primary Q/A platform. A mailing list. In 2009. Honestly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/10/why-stack-overflow-careers-is-a-disruptive-innovation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Fast Intro to PLYR for R</title>
		<link>http://www.cerebralmastication.com/2009/08/a-fast-intro-to-plyr-for-r/</link>
		<comments>http://www.cerebralmastication.com/2009/08/a-fast-intro-to-plyr-for-r/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 20:00:52 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=339</guid>
		<description><![CDATA[I&#8217;m not dead yet! Although it has been rumored that I am. The new job is going great and I&#8217;m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association.
There&#8217;s been a lot going on recently in the R user community. There was an [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-340" title="pliers" src="http://www.cerebralmastication.com/wp-content/uploads/2009/08/pliers.jpg" alt="pliers" width="235" height="87" />I&#8217;m not dead yet! Although it has been rumored that I am. The new job is going great and I&#8217;m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association.</p>
<p>There&#8217;s been a lot going on recently in the R user community. There was an<a href="http://en.oreilly.com/oscon2009/public/schedule/detail/10432" onclick="pageTracker._trackPageview('/outgoing/en.oreilly.com/oscon2009/public/schedule/detail/10432?referer=');"> R flash mob of Stack Overflow</a> which resulted in a noticeable increase in the number of <a href="http://stackoverflow.com/questions/tagged/r" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/tagged/r?referer=');">R questions and answers</a> in SO. I&#8217;ve been blown away by the quality of the <a href="http://stackoverflow.com/questions/tagged?tagnames=r&amp;sort=stats&amp;pagesize=50" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/tagged?tagnames=r_amp_sort=stats_amp_pagesize=50&amp;referer=');">participants</a>. There has also been increased quality discussions on Twitter which are being <a href="http://twitter.com/#search?q=%23rstats" onclick="pageTracker._trackPageview('/outgoing/twitter.com/_search?q=_23rstats&amp;referer=');">tagged with #rstats</a>. These changes in the community have <a href="http://www.iq.harvard.edu/blog/sss/archives/2009/08/the_changing_na.shtml" onclick="pageTracker._trackPageview('/outgoing/www.iq.harvard.edu/blog/sss/archives/2009/08/the_changing_na.shtml?referer=');">not gone unnoticed</a>.</p>
<p>Recently I posted a question about how to do a &#8216;group by&#8217; in a regression with R. I had a way I had been doing this but I was suspicious there was a better way. <a href="http://stackoverflow.com/questions/1169539/linear-regression-and-group-by-in-r/1214432#1214432" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/1169539/linear-regression-and-group-by-in-r/1214432_1214432?referer=');">One of the answers</a> proposed using the PLYR package. I think I had seen the plyr package a few times but never really understood it. Although I didn&#8217;t select this as my top answer, it prompted me to look into PLYR more. What I discovered was really interesting.</p>
<p>The <a href="http://had.co.nz/plyr/" onclick="pageTracker._trackPageview('/outgoing/had.co.nz/plyr/?referer=');">PLYR package </a>is a tool for doing split-apply-combine (SAC) procedures. I&#8217;m very fluent in SQL so the best analogy for me was the GROUP BY statement in SQL. PLYR adds very little new functionality to R. What it does do is take the process of SAC and make it cleaner, more tidy and easier. I think I&#8217;m not the only one who wants a clean and tidy SAC. Here&#8217;s a quick example of making some summary stats using PLYR:</p>
<pre># install.packages("plyr") #run this if you don't have the package already
 library(plyr)

#make some example data
dd&lt;-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) &lt;- c("v1", "v2", "v3", "dim1", "dim2")

#ddply is the plyr function
ddply(dd, c("dim1","dim2"), function(df)mean(df$v1))</pre>
<p>result:</p>
<blockquote>
<pre>    dim1 dim2          V1
    1    A    J  0.02554362
    2    B    J -0.15839675
    3    B    K -0.06077399
    4    C    K -0.02326776</pre>
</blockquote>
<p>PLYR functions have a neat naming convention. The first two letters of the function tells the input and output data types, respectively. The one I use the most is ddply which takes a data frame in and spits out a data frame.  Let me see if I can explain what ddply is doing. The first argument, dd, is the input data frame. The next argument is the &#8220;group by&#8221; variables. Since I want to group by two variables I send them as a vector (that&#8217;s what the c() bit does). What threw me for a loop initially was the third argument, the function. What I found myself trying (unsuccessfully) was just using mean(v1) as the third argument. If I did that, R would spit at me and bring the marital status of my parents into question. I discovered that the problem was the ddply function was splitting the data by my &#8216;group by&#8217; variables and then it wanted to pass each of the resulting data frames to a function. So what does it mean to pass a data frame to mean(v1)? Yeah, it means Jack Crap, that&#8217;s what it means. So in one of the PLYR examples I saw they were using these inline functions. The idea behind function(df)mean(df$v1) is to create a function to which we can pass a data frame and get out a meaningful result. The subset (or split) of the data gets passed to the function and that subset is then known as df. mean(df$v1) calculates the mean of v1 and returns an answer. ddply holds on to the answers of each split and then reassembles them all in the end. Slick, ey?</p>
<p>As with most things in R the idea can be extended to a vector of functions in order to perform many operations on each split:</p>
<pre>ddply(dd, c("dim1","dim2"), function(df)c(mean(df$v1),mean(df$v2),mean(df$v3),sd(df$v1),sd(df$v2),sd(df$v3)))</pre>
<p>The result looks like this:</p>
<blockquote>
<pre>dim1 dim2          V1        V2         V3        V4        V5       V6
1    A    J  0.02554362 0.3400250  0.1206980 0.9326424 1.0044120 1.100762
2    B    J -0.15839675 0.3662559 -0.1784193 0.7447807 0.8752162 1.105258
3    B    K -0.06077399 0.5184403 -0.2076024 1.0385107 1.0609706 1.153153
4    C    K -0.02326776 0.2639328  0.1352895 0.7940938 0.9025207 1.072460</pre>
</blockquote>
<p>Pretty nifty.</p>
<p>The author of PLYR is Hadley Wickham who is also the man behind <a href="http://had.co.nz/ggplot2/" onclick="pageTracker._trackPageview('/outgoing/had.co.nz/ggplot2/?referer=');">GGPLOT2</a>. If you like PLYR or GGPLOT2 then you should immediately <a href="http://www.amazon.com/gp/product/0387981403?ie=UTF8&amp;tag=hadlwick-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0387981403" onclick="pageTracker._trackPageview('/outgoing/www.amazon.com/gp/product/0387981403?ie=UTF8_amp_tag=hadlwick-20_amp_linkCode=as2_amp_camp=1789_amp_creative=390957_amp_creativeASIN=0387981403&amp;referer=');">buy Hadley&#8217;s GGPLOT2 book on Amazon</a>. But be sure and use the link on this site or the link on <a href="http://had.co.nz/ggplot2/book/" onclick="pageTracker._trackPageview('/outgoing/had.co.nz/ggplot2/book/?referer=');">Hadley&#8217;s site </a>so he can get Amazon associate payment. The authors I have talked to told me they get more from the Associate program than they get from publishing royalties.</p>
<p>My father is a retired pilot turned crop farmer. He ALWAYS carries a pair of pliers in a nylon pouch on his belt. I can see that Hadley&#8217;s PLRY package is going to become my proverbial &#8216;belt pliers.&#8217;</p>
<p>Of course if I wrote an R package I&#8217;d have to name it <a href="http://www.paratech.us/html/FET/Crw/CrwSRB/ParatechNFSRB.htm" onclick="pageTracker._trackPageview('/outgoing/www.paratech.us/html/FET/Crw/CrwSRB/ParatechNFSRB.htm?referer=');">Super RamBar</a>, cause that&#8217;s just how I roll.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/08/a-fast-intro-to-plyr-for-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
