<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; howto</title>
	<atom:link href="http://www.cerebralmastication.com/tag/howto/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Wed, 07 Dec 2011 13:08:46 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fitting Distribution X to Data From Distribution Y</title>
		<link>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/</link>
		<comments>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/#comments</comments>
		<pubDate>Thu, 12 May 2011 20:31:31 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=1009</guid>
		<description><![CDATA[I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot.png"><img class="alignleft size-medium wp-image-1010" title="rstudio-plot" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot-300x240.png" alt="" width="300" height="240" /></a>I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a &#8220;toy model&#8221; and I think this is invaluable as a method for building intuition and a visceral understanding of data.<br />
Here&#8217;s some example code which we played with:</p>
<blockquote>
<div style="overflow:auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family:monospace;"><a href="http://inside-r.org/r-doc/base/set.seed" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/set.seed?referer=');"><span style="color: #003399; font-weight: bold;">set.seed</span></a><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span>
x <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rgamma" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rgamma?referer=');"><span style="color: #003399; font-weight: bold;">rgamma</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">.2</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># normalize the gamma so it's between 0 &amp; 1</span>
<span style="color: #666666; font-style: italic;"># .0001 added because having exactly 1 causes fail</span>
xt <span style="">&lt;-</span> x <span style="">/</span> <span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/r-doc/base/max" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/max?referer=');"><span style="color: #003399; font-weight: bold;">max</span></a><span style="color: #009900;">&#40;</span> x <span style="color: #009900;">&#41;</span> <span style="">+</span> <span style="color: #cc66cc;">.0001</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># fit a beta distribution to xt</span>
<a href="http://inside-r.org/r-doc/base/library" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/library?referer=');"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/packages/cran/MASS" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/packages/cran/MASS?referer=');"><span style="">MASS</span></a> <span style="color: #009900;">&#41;</span>
fit.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/MASS/fitdistr" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/MASS/fitdistr?referer=');"><span style="color: #003399; font-weight: bold;">fitdistr</span></a><span style="color: #009900;">&#40;</span> xt<span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;beta&quot;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/stats/start" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/start?referer=');"><span style="color: #003399; font-weight: bold;">start</span></a> = <a href="http://inside-r.org/r-doc/base/list" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/list?referer=');"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">&#40;</span> shape1=<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> shape2=<span style="color: #cc66cc;">5</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
x.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rbeta" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rbeta?referer=');"><span style="color: #003399; font-weight: bold;">rbeta</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the pdfs on top of each other</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/lines" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/lines?referer=');"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x.beta<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/col" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/col?referer=');"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: #0000ff;">&quot;red&quot;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the qqplots</span>
<a href="http://inside-r.org/r-doc/stats/qqplot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/qqplot?referer=');"><span style="color: #003399; font-weight: bold;">qqplot</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #339933;">,</span> x.beta<span style="color: #009900;">&#41;</span></pre>
</div>
</div>
<p><a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org" onclick="pageTracker._trackPageview('/outgoing/www.inside-r.org/pretty-r?referer=');">Created by Pretty R at inside-R.org</a></p>
</blockquote>
<p>It&#8217;s not illustrated above, but it&#8217;s probably useful to transform the simulated data (x.beta) back into pre normalized space by multiplying by max( x ) + .0001 . (I swore I&#8217;d never say this but I lied) I&#8217;ll leave that as an exercise for the reader. </p>
<p>Another very useful tool in building a mental road map of distributions is the <a href="http://www.johndcook.com/distribution_chart.html" onclick="pageTracker._trackPageview('/outgoing/www.johndcook.com/distribution_chart.html?referer=');">graphical chart of distribution relationships that John Cook introduced me to</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Shell scripting EC2 for fun and profit</title>
		<link>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/</link>
		<comments>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/#comments</comments>
		<pubDate>Fri, 06 May 2011 20:57:40 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=993</guid>
		<description><![CDATA[Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool doRedis backend for the R foreach package. But that&#8217;s a whole other post. What I was scratching my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/" onclick="pageTracker._trackPageview('/outgoing/www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/?referer=');"><img class="alignleft size-full wp-image-994" style="border: 1px solid black; margin: 2px;" title="lg-go-away-tshirt" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/lg-go-away-tshirt.jpg" alt="" width="179" height="218" /></a>Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool <a href="http://cran.r-project.org/web/packages/doRedis/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/doRedis/index.html?referer=');">doRedis backend</a> for the R <a href="http://cran.r-project.org/web/packages/foreach/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/foreach/index.html?referer=');">foreach package</a>. But that&#8217;s a whole other post. What I was scratching my head about today was that I&#8217;d really just like to, with a single command, spin up an EC2 instance, wait for it to come up, and then ssh into it. I do this iteration about 20 times a day when I&#8217;m testing things, so it seemed to make sense to shell script it.<br />
To do this, one needs the EC2 command line tools installed on your workstation. In Ubuntu that&#8217;s as easy as `sudo apt-get ec2-api-tools`</p>
<p>So here&#8217;s a short shell script to spin up an instance, wait 30 seconds, then connect:<br />
<script src="http://gist.github.com/959780.js"></script></p>
<p>If you&#8217;re reading this through an RSS reader, you can see the script over at <a href="https://gist.github.com/959780" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/959780?referer=');">github</a>.</p>
<p>Obviously you&#8217;ll need to change the parameters at the top of the script to suit your needs. But since this was a bit of a pain in the donkey hole for me to figure out, I thought I would share.</p>
<p>If you want to help out, I&#8217;d love you to enlighten me on how to have the script figure out if an instance has finished booting so I could eliminate the sleep step.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Details of two-way sync between two Ubuntu machines</title>
		<link>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/</link>
		<comments>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 20:48:32 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=966</guid>
		<description><![CDATA[In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png"><img class="alignleft size-full wp-image-956" title="sync" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png" alt="" width="128" height="128" /></a>In a <a href="http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/">previous post</a> I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine.</p>
<p>I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy. After testing this for a while I discovered this is totally the wrong way to do it. The problem is that the Unison process makes temp directories and files in the file system of the target. So my Unison job on the laptop would be trying to syn files and, in the process, create temp files which would kick off a Unison sync on the sever which would make temp files on the laptop&#8230; I think you can see how convoluted this gets.</p>
<p>So a much better solution is to only run Unison from one machine (I chose my laptop) and have the other machine (server in my case) send an SSH command (over the aforementioned reverse proxy) to the laptop asking the laptop to kick off a Unison sync. This way all of the syncs happen from the laptop.</p>
<p>So, in short, both machines run lsyncd which monitors files for changes. I keep up an SSH tunnel with reverse port forwarding which forwards a remote machine port back to my laptop&#8217;s port 22 (SSH). Unison need be installed ONLY on my laptop. When a change happens on my laptop, lsyncd fires off a Unison sync from my laptop that syncs it with the server. When a file changes on the server, the lsyncd job on the server makes a connection to my laptop via ssh and fires off a Unsion sync between my laptop and the server.</p>
<p>Here&#8217;s an example of my lsyncd config scripts:</p>
<p><strong>Laptop:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/laptop/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/laptop/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p><strong>Server:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/server/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/server/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p>Keep in mind that I am using version 2 of lsyncd which can be downloaded here: <a href="http://code.google.com/p/lsyncd/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/lsyncd/?referer=');">http://code.google.com/p/lsyncd/</a></p>
<p>The version of lsyncd available in the Ubuntu repo is version 1.x which does not use the same config format as I illustrate above. However, if you run into dependency issues with v2, the easiest thing to do is install the repo version which will install dependencies and then manually download and install v2 from the above URL.</p>
<p>My reverse port forwarding set up looks like this:</p>
<blockquote><p>autossh -2 -4 -X -R 5432:localhost:22 12.34.56.78</p></blockquote>
<p>the -R bit forwards remote port 5432 to my laptop&#8217;s port 22 which is the ssh. So on my server if I run ssh localhost -p 5432 what actually happens is I am sshing from the remote machine to my laptop.</p>
<p><strong>Notes:</strong></p>
<ul>
<li>The IP address of my server in this example is 12.34.56.78.</li>
<li>Don&#8217;t try and sync the directories where the lsyncd logs are kept. That will results in an endless sync cycle as each machine keeps noticing changes endlessly. Don&#8217;t ask me how I know this.</li>
<li>The command to start the sync on the laptop is &#8220;lsyncd /home/jal/lsyncd/laptop/configfile&#8221; where configfile is the above lsyncd configuration file.</li>
<li>lsyncd could, conceivably, tell Unison to sync only the part of the directory tree that changed. I have not been able to make that feature work right, however. And it only takes Unison a few seconds to sync, so I&#8217;ve not worried about it.</li>
</ul>
<p>This has greatly sped up my <a href="http://rstudio.org" onclick="pageTracker._trackPageview('/outgoing/rstudio.org?referer=');">RStudio</a> based workflow when doing analysis with R. Now when I change files on my server using RStudio they are immediately (well it waits 15 seconds) replicated to my local machine and vice versa!</p>
<p>Good luck and if you have any suggestions please post a comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Connecting to SQL Server from R using RJDBC</title>
		<link>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/</link>
		<comments>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/#comments</comments>
		<pubDate>Wed, 22 Sep 2010 18:00:26 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sql server]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=891</guid>
		<description><![CDATA[A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo.png"><img class="alignleft size-medium wp-image-901" style="border: 2px solid black; margin: 3px;" title="sql_server_2008_logo" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo-300x187.png" alt="" width="235" height="146" /></a>A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.</p>
<p><a href="http://www.fosstrading.com/" onclick="pageTracker._trackPageview('/outgoing/www.fosstrading.com/?referer=');">Joshua Ulrich</a> was kind enough to help me out by pointing me to <a href="http://www.rforge.net/RJDBC/" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/RJDBC/?referer=');">RJDBC</a> which scared me a little (I&#8217;m easily spooked) because it involves Java. The only thing I know about Java is every time I touch it I <a href="http://stackoverflow.com/questions/3311940/r-rjava-package-install-failing" target="_blank" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/3311940/r-rjava-package-install-failing?referer=');">spend days trying to get environment variables</a> loaded just exactly the way it wants them. But Josh assured me that it was really not that hard. Here&#8217;s the short version:</p>
<p><a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707&amp;displaylang=en" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707_amp_displaylang=en&amp;referer=');">Download the RJDBC driver from Microsoft</a>. There&#8217;s Win and *nix versions, so grab which ever you need. Unpack the driver in a known location (I used /etc/sqljdbc_2.0/). Then access the driver from R like so:</p>
<pre>require(RJDBC)
drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "/etc/sqljdbc_2.0/sqljdbc4.jar") 
  conn &lt;- dbConnect(drv, "jdbc:sqlserver://serverName", "userID", "password")
#then build a query and run it
sqlText &lt;- paste("
   SELECT * FROM myTable
  ", sep="")
queryResults &lt;- dbGetQuery(conn, sqlText)</pre>
<p>I have a few scripts that I want to run on both my Ubuntu laptop and my Windows Server. To accommodate that I made my scripts compatible with both by doing the following to my drv line:</p>
<pre>if (.Platform$OS.type == "unix"){
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
         "/etc/sqljdbc_2.0/sqljdbc4.jar")
} else {
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
        "C:/Program Files/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0
         /enu/sqljdbc4.jar")
 }</pre>
<p>Obviously if you unpacked your drivers in different locations you&#8217;ll need to molest the code to fit your life situation.</p>
<p><span style="color: #ff6600;"><strong>EDIT: </strong>A MUCH better place to put the JDBC drivers in Ubuntu would be the /opt/ path as opposed to /etc/ which I used above. In Ubuntu the /opt/ directory is where one should put user executables and /etc/ should be reserved for packages installed by apt. I&#8217;m not familiar with all the conventions in Ubuntu (or even Linux in general) so I didn&#8217;t realize this until I got some reader feedback. </span></p>
<p>Be forewarned, RJDBC is pretty damn slow and it appears to no longer be in active development. For my use case, RODBC was clearly faster. But RJDBC works for me in Ubuntu and that was my biggest need.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Principal Component Analysis (PCA) vs Ordinary Least Squares (OLS): A Visual Explanation</title>
		<link>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/</link>
		<comments>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 17:11:27 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=866</guid>
		<description><![CDATA[Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it MEAN?&#8221;
I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png"><img class="size-full wp-image-876 alignleft" style="border: 2px solid black; margin: 3px;" title="sa" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png" alt="" width="299" height="82" /></a>Over at stats.stackexchange.com recently, a <a href="http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700#2700" onclick="pageTracker._trackPageview('/outgoing/stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700_2700?referer=');">really interesting question was raised</a> about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it <strong>MEAN</strong>?&#8221;</p>
<p>I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled &#8220;Why I give a shit.&#8221; A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster.  OK, that&#8217;s a lie. Later in life I learned that bordered Hessian matrices are a second derivative test used in some optimizations. Not that I would EVER do that shit by hand. I&#8217;d use some R package and blindly trust that it was coded properly.</p>
<p>So back to PCA: as I was reading the aforementioned stats question I was reminded of a recent presentation that <a href="http://quanttrader.info/public/" onclick="pageTracker._trackPageview('/outgoing/quanttrader.info/public/?referer=');">Paul Teetor</a> gave at a August Chicago R User Group. In his presentation on spread trading with R he showed a graphic that illustrated the difference between OLS and PCA. I took some notes and went home and made sure I could recreate the same thing. If you have wondered what makes OLS and PCA different, open up an R session and play along.</p>
<p><strong>Your Independent Variable Matters:</strong></p>
<p>The first observation to make is that regressing x ~ y is not the same as y ~ x even in a simple univariate regression. You can illustrate this by doing the following:</p>
<blockquote><p>set.seed(2)<br />
x &lt;- 1:100</p>
<p>y &lt;- 20 + 3 * x<br />
e &lt;- rnorm(100, 0, 60)<br />
y &lt;- 20 + 3 * x + e</p>
<p>plot(x,y)<br />
yx.lm &lt;- lm(y ~ x)<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)</p>
<p>xy.lm &lt;- lm(x ~ y)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>You should get something that looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols.png"><img class="size-medium wp-image-867 alignnone" title="olsVSols" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols-280x300.png" alt="" width="280" height="300" /></a></p>
<p>So it&#8217;s obvious they give different lines. But why? Well, OLS minimizes the error between the dependent and the model. Two of these errors are illustrated for the y ~ x case in the following picture:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1.png"><img class="alignnone size-medium wp-image-870" title="OLS1" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1-280x300.png" alt="" width="280" height="300" /></a></p>
<p>But when we flip the model around and regress x ~ y then OLS minimizes these errors:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2.png"><img class="alignnone size-medium wp-image-871" title="OLS2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2-280x300.png" alt="" width="280" height="300" /></a></p>
<p>Ok, so what about PCA?</p>
<p>Well let&#8217;s draw the first principal component the old school way:</p>
<blockquote><p>#normalize means and cbind together<br />
xyNorm &lt;- cbind(x=x-mean(x), y=y-mean(y))<br />
plot(xyNorm)</p>
<p>#covariance<br />
xyCov &lt;- cov(xyNorm)<br />
eigenValues &lt;- eigen(xyCov)$values<br />
eigenVectors &lt;- eigen(xyCov)$vectors</p>
<p>plot(xyNorm, ylim=c(-200,200), xlim=c(-200,200))<br />
lines(xyNorm[x], eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x])<br />
lines(xyNorm[x], eigenVectors[2,2]/eigenVectors[1,2] * xyNorm[x])</p>
<p># the largest eigenValue is the first one<br />
# so that&#8217;s our principal component.<br />
# but the principal component is in normalized terms (mean=0)<br />
# and we want it back in real terms like our starting data<br />
# so let&#8217;s denormalize it<br />
plot(xy)<br />
lines(x, (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y))<br />
# that looks right. line through the middle as expected</p>
<p># what if we bring back our other two regressions?<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>PCA minimizes the error orthogonal (perpendicular) to the model line. So first principal component  looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca.png"><img class="alignnone size-medium wp-image-872" title="pca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The two yellow lines, as in the previous images, examples of two of the errors which the routine minimizes.</p>
<p>So if we plot all three lines on the same scatter plot we can see the differences:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca.png"><img class="alignnone size-medium wp-image-873" title="olsVSpca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The x ~ y OLS and the first principal component are pretty close, but click on the image to get a better view and you will see they are not exactly the same.</p>
<p>All the code from the above examples can be found in a <a href="http://gist.github.com/582767" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/582767?referer=');">gist over at GitHub.com</a>. It&#8217;s best to copy and past from the github as sometimes Wordpress molests my quotes and breaks the codez.</p>
<p>The best introduction to PCA which I have read is the one I link to on Stats.StackExchange.com. It&#8217;s titled <a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf" onclick="pageTracker._trackPageview('/outgoing/www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf?referer=');">&#8220;A Tutorial on Principal Components Analysis&#8221; by Lindsay I Smith</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Third, and Hopefully Final, Post on Correlated Random Normal Generation (Cholesky Edition)</title>
		<link>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/</link>
		<comments>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 18:03:21 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=824</guid>
		<description><![CDATA[When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I&#8217;ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, &#8220;the way mensches do multivariate (log)normal [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_825" class="wp-caption alignleft" style="width: 260px"><a href="http://www.sabix.org/bulletin/b39/vie.html" onclick="pageTracker._trackPageview('/outgoing/www.sabix.org/bulletin/b39/vie.html?referer=');"><img class="size-medium wp-image-825 " title="39-cholesky" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/39-cholesky-250x300.jpg" alt="" width="250" height="300" /></a><p class="wp-caption-text">André-Louis Cholesky is my homeboy</p></div>
<p>When I did a <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/">brief post three days ago</a> I had no plans on writing two more posts on correlated random number generation. But I&#8217;ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/comment-page-1/#comment-5068">Gappy, calls me out</a> and says, &#8220;the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.&#8221;  And I think we&#8217;re all smart enough to read through Mr. Gappy&#8217;s comment and see that he&#8217;s saying I&#8217;m a complicated, opaque, and slow, גוי‎. My wife called and said his list would be more accurate if he added &#8216;emotionally detached.&#8217; I have no idea what she means.</p>
<p>At any rate, in response to Gappy&#8217;s comment, here is the third verse (same as the first). The crux of the change is the following lines:</p>
<pre>
<blockquote>

# shift the mean of ourData to zero
ourData0 &lt;- as.data.frame(sweep(ourData,2,colMeans(ourData),"-"))

#Cholesky Decomposition of the covariance matrix
C &lt;- chol(nearPD(cov(ourData0))$mat)

#create a matrix of random standard normals
Z &lt;- matrix(rnorm(n * ncol(ourData)), ncol(ourData))

#multiply the standard normals by the transpose of the Cholesky
X &lt;- t(C) %*% Z

myDraws &lt;- data.frame(as.matrix(t(X)))
names(myDraws) &lt;- names(ourData)

# we still need to shift the means of the samples.

# shift the mean of the draws over to match the starting data
myDraws &lt;- as.data.frame(sweep(myDraws,2,colMeans(ourData),"+"))
</blockquote>
</pre>
<p><em><strong>Edit: </strong>When I first publishes this example, I didn&#8217;t shift the means prior to taking the cov(). I&#8217;ve sense corrected that.  Also thanks to @fdaapproved on Twitter who pointed out that I can replace the loop above with myDraws &lt;- as.data.frame(sweep(t(X),2,colMeans(ourData),&#8221;+&#8221;))</em></p>
<p>This method, which uses Cholesky decomposition, is how I initially learned to create correlated random draws. I think this method is comparable to the mvrnorm() method. mvrnorm() is handy because it wraps everything above in one single line of code. But the above method is reliant only on the Matrix package and that&#8217;s only for the nearPD() function. If you are familiar with the guts of the mvrnorm() function and the chol() function, I&#8217;d love for you to comment on any technical differences. I looked briefly at the code for both and quickly realized my matrix math was rusty enough that it was going to take a while for me to sort through the code.</p>
<p>If you want the whole script you can find it embedded below <a href="http://gist.github.com/562567" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/562567?referer=');">and on Github</a>.</p>
<script src="http://gist.github.com/562567.js"></script>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Even Simpler Multivariate Correlated Simulations</title>
		<link>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/</link>
		<comments>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/#comments</comments>
		<pubDate>Tue, 31 Aug 2010 15:17:27 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=804</guid>
		<description><![CDATA[So after yesterday&#8217;s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, &#8220;Dude, why are you making this so hard?&#8221; The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/08/Screenshot-Untitled-Window-3.png"><img class="alignleft size-full wp-image-803" title="mvrnorm example" src="http://www.cerebralmastication.com/wp-content/uploads/2010/08/Screenshot-Untitled-Window-3.png" alt="" width="341" height="221" /></a>So after yesterday&#8217;s post on <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/">Simple Simulation using Copulas</a> I got a very nice email that basically begged the question, &#8220;Dude, why are you making this so hard?&#8221; The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick</p>
<blockquote><p>?mvrnorm</p></blockquote>
<p>and, I&#8217;ll be damned, he&#8217;s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values. So that gives the flexibility to make the tails of the distributions more correlated, for example. But my example yesterday was purposefully simple&#8230; so simple that a copula was not even needed.</p>
<p>After creating my sample data all I really needed to do was this:</p>
<blockquote><p>myDraws &lt;- mvrnorm(1e5, mu=mean(ourData), Sigma=cov(ourData))</p></blockquote>
<p>So I  took my example from yesterday and updated it using the mvrnorm() code and, as is my custom, put up a <a href="http://gist.github.com/559082" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/559082?referer=');">Github gist.</a> The code is embedded below as well. I added a little ggplot2 code at the end that will create a facet plot of the 4 distributions showing the shape of the distributions of both the starting data and the simulated data. The plot in the upper left of this post is the ggplot output.</p>
<p><em><strong>EDIT: </strong></em>The email hipping me to this was sent by <a href="http://dirk.eddelbuettel.com" onclick="pageTracker._trackPageview('/outgoing/dirk.eddelbuettel.com?referer=');">Dirk Eddelbuettel</a> who&#8217;s been very helpful to me more times than I can count. I had omitted his name initially. However after confirming with Dirk, he told me it was OK to mention him by name in this post.</p>
<script src="http://gist.github.com/559082.js"></script>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Stochastic Simulation With Copulas in R</title>
		<link>http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/</link>
		<comments>http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/#comments</comments>
		<pubDate>Mon, 30 Aug 2010 20:12:34 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=782</guid>
		<description><![CDATA[A friend of mine gave me a call last week and was wondering if I had a little R code that could illustrate how to do a Cholesky decomposition. He ultimately wanted to build a Monte Carlo model with correlated variables. I pointed him to a number of packages that do Cholesky decomp but then [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cafepress.com/+ringer_t,350602392" onclick="pageTracker._trackPageview('/outgoing/www.cafepress.com/+ringer_t_350602392?referer=');"><img class="alignleft size-full wp-image-792" style="margin: 5px; border: 2px solid black;" title="econModels" src="http://www.cerebralmastication.com/wp-content/uploads/2010/08/econModels.jpg" alt="You know we do! " width="206" height="162" /></a>A friend of mine gave me a call last week and was wondering if I had a little R code that could illustrate how to do a <a href="http://en.wikipedia.org/wiki/Cholesky_decomposition" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Cholesky_decomposition?referer=');">Cholesky decomposition</a>. He ultimately wanted to build a Monte Carlo model with correlated variables. I pointed him to a number of packages that do Cholesky decomp but then I recommended he consider just using a Gaussian <a href="http://en.wikipedia.org/wiki/Copula_%28statistics%29" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Copula_28statistics_29?referer=');">Copula </a> and R for the whole simulation. For most of my copula needs in R, I use the <a href="http://cran.r-project.org/web/packages/QRMlib/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/QRMlib/index.html?referer=');">QRMlib package</a> which is a code companion to the book <a href="http://www.amazon.com/gp/product/0691122555?ie=UTF8&amp;tag=riskthou-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0691122555" onclick="pageTracker._trackPageview('/outgoing/www.amazon.com/gp/product/0691122555?ie=UTF8_amp_tag=riskthou-20_amp_linkCode=as2_amp_camp=1789_amp_creative=390957_amp_creativeASIN=0691122555&amp;referer=');"><span style="text-decoration: underline;">Quantitative Risk Management: Concepts, Techniques and Tools</span></a> by Alexander J. McNeil, Rudiger Frey and Paul Embrechts. The book is only loosely coupled (pun intended) with the code in the QRMlib package. I really wish the book had been written with code examples and tight linkage between the book and the code. Of course I&#8217;m the type of guy who prefers code snip-its to mathematical notation.</p>
<p>I had some code where I used the QRMlib package, but it was really messy and fairly specific to my use case. So I whipped up very simple example of how to create correlated random draws from a multivariate distribution. In this example I used normally distributed marginals and Gaussian correlation to keep things simple and easy to follow. Rather than blogging through the code, I added a shit load (metric ass ton, if you&#8217;re in Canada) of comments. The code is designed to be stepped through. So don&#8217;t just run the whole blob and wonder what happened.</p>
<p>Walk through the code and if you find any errors be sure and let me know.</p>
<p>The code is embedded in a Github gist below, but if you are reading this in an aggregator (shout out to <a href="http://www.r-bloggers.com/" onclick="pageTracker._trackPageview('/outgoing/www.r-bloggers.com/?referer=');">R-Bloggers</a>) you&#8217;ll need to <a href="http://gist.github.com/557900" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/557900?referer=');">manually go to the gist</a>.</p>
<script src="http://gist.github.com/557900.js"></script>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>You can Hadoop it! It&#8217;s elastic! Boogie woogie woog-ie!</title>
		<link>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/</link>
		<comments>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 18:31:23 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=592</guid>
		<description><![CDATA[I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_594" class="wp-caption alignleft" style="width: 271px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png"><img class="size-full wp-image-594 " style="border: 1px solid black; margin: 3px;" title="I paid an old man in Chinatown $200 for this!" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png" alt="" width="261" height="144" /></a><p class="wp-caption-text">This blog&#39;s name in Chinese! </p></div>
<p>I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me it doesn&#8217;t mean that.</p>
<p>So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">As I have told you before</a>, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn&#8217;t make this up if I tried, so you know it&#8217;s true! And what&#8217;s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their <a href="http://aws.amazon.com/elasticmapreduce/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/elasticmapreduce/?referer=');">Elastic MapReduce service</a>! I think Amazon coined the term &#8220;Time Travel as a Service&#8221; or TTaaS  generally pronounced as &#8220;ta-tas&#8221; in <a href="http://www.savethetatas.com/" onclick="pageTracker._trackPageview('/outgoing/www.savethetatas.com/?referer=');">the industry</a>. If you are a CTO be sure and use this in your next &#8220;vision statement&#8221; pitch so everyone will know you&#8217;re hip to all this cloud stuff.</p>
<p>So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don&#8217;t you? Well it&#8217;s a damn good thing you read this blog because I&#8217;m going to give you the keys to the Wu dynasty and a little 福寿 while we&#8217;re at it.</p>
<p>I&#8217;ve never had an original thought in my life so I started with <a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=128995&amp;#128995" onclick="pageTracker._trackPageview('/outgoing/developer.amazonwebservices.com/connect/thread.jspa?messageID=128995_amp_128995&amp;referer=');">this discussion </a>over at the AMZN E M/R discussion forum. Peter Skomoroch from <a href="http://www.datawrangling.com/" onclick="pageTracker._trackPageview('/outgoing/www.datawrangling.com/?referer=');">Data Wrangling </a>gives a very good example with all the data and code provided so you can run it yourself.  Pete&#8217;s example really shakes the  yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg&#8217;s <a href="http://docs.google.com/viewer?url=http%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2FHadoopStreaming%2FHadoopStreaming.pdf" onclick="pageTracker._trackPageview('/outgoing/docs.google.com/viewer?url=http_3A_2F_2Fcran.r-project.org_2Fweb_2Fpackages_2FHadoopStreaming_2FHadoopStreaming.pdf&amp;referer=');">HadoopStreaming package</a> which was good for insight, but I didn&#8217;t use the package as it&#8217;s really focused on the &#8216;big data&#8217; problem.</p>
<div id="attachment_639" class="wp-caption alignleft" style="width: 218px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg"><img class="size-full wp-image-639 " style="border: 1px solid black; margin: 3px;" title="hadoop elephant" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg" alt="" width="208" height="156" /></a><p class="wp-caption-text">That elephant is so freaking cute! </p></div>
<p>Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don&#8217;t have big data. I have big CPU hogging models (mostly slow because I don&#8217;t code worth a shit). What got me reconsidering my world view was <cite></cite><a onclick="pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com/?referer=');pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com?referer=http%3A%2F%2Fwww.cerebralmastication.com%2F');" rel="external nofollow" href="http://www.johnmyleswhite.com/">John Myles White</a>&#8217;s comment on my <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">multicore post </a>earlier. John encouraged me to look into running my simulations on AMZN&#8217;s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.</p>
<p>I learned a few &#8220;gotchas&#8221; or, as we say in the future: 臭婊子(I think that should be plural). I&#8217;ll do a whole blog post on gotchas soon, but here&#8217;s the bullet points:</p>
<ul>
<li>AMZN is currently running the version of Debian Linux named Lenny which has version 2.7.1 of R installed. No matter what the documentation says, don&#8217;t let Lenny tend to the rabbits.</li>
<li>Test all code by firing up an interactive Pig instance and logging in as &#8216;hadoop&#8217;. Instead of running Pig, run R and test your code. And as it says in the FAQ: &#8220;The Pig don&#8217;t care either way. &#8221; Which, despite sounding like buggery, is the truth.</li>
<li>If your code runs inside of R on a Hadoop instance, drop back to the command line on the Hadoop instance and run &#8216;cat infile.txt | yourMapper.R | sort | yourReducer.R &gt; outfile.txt&#8217;. This pipes your input file into your mapper file which does it&#8217;s thing and then pipes the results to your reducer file which then &#8220;pumps up the jam&#8221; into an output file.  What you see in the outfile.txt is what Hadoop will produce. So it you don&#8217;t like what you see, you better do some more coding.</li>
<li>You CAN load packages into R in a Hadoop instance running in AMZN E M/R. There are a few caveats, of course:</li>
</ul>
<ol>
<li>Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.</li>
<li>As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see <a href="http://stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again?referer=');">from this question</a>, I am not yet comfortable with.</li>
<li>There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there&#8217;s no possibility for a conflict! Here&#8217;s an example of how I load the Hmisc package:</li>
<p><script src="http://gist.github.com/304262.js?file=AMZNloadPackage.R"></script></ol>
<ul>
<li>You&#8217;ll probably want to provide some data to R. This is done by uploading your files to S3 and then passing the &#8220;-cacheFile&#8221; option to Hadoop. To get the plyr package to load in R 2.7.1 I had to edit the package. I then uploaded the altered package thusly:</li>
</ul>
<blockquote><p>-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz</p></blockquote>
<p>More to come later. I&#8217;ve gotta get back to the future.</p>
<div id="attachment_631" class="wp-caption alignleft" style="width: 314px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg"><img class="size-full wp-image-631" style="border: 1px solid black; margin: 3px;" title="christopher_lloyd" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg" alt="" width="304" height="224" /></a><p class="wp-caption-text">You hold the elephant and I&#39;ll plug this in. </p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using the R multicore package in Linux with wild and passionate abandon</title>
		<link>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/</link>
		<comments>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 19:57:20 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=562</guid>
		<description><![CDATA[One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg"><img class="alignleft size-full wp-image-586" style="border: 0pt none; margin: 20px;" title="amd_mc_processing" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg" alt="" width="214" height="193" /></a>One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an <a href="http://www.virtualbox.org/" onclick="pageTracker._trackPageview('/outgoing/www.virtualbox.org/?referer=');">Oracle VirtualBox </a>(Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I&#8217;ve moved to running my sims on EC2 machines. I&#8217;m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It&#8217;s not very sexy, but it gets the job done&#8230; I guess the same could be said of myself, except substitute &#8220;makes slurping sounds eating udon&#8221; in the place of &#8220;gets the job done.&#8221;</p>
<p>When running processor intensive crap (that&#8217;s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the <a href="http://www.rforge.net/doc/packages/multicore/multicore.html" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/doc/packages/multicore/multicore.html?referer=');">multicore package </a>is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It&#8217;s almost like magic, except without the rabbit and hat.</p>
<p>The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for &#8216;magic carpet&#8217; and is an allusion to the worm hole technology. So how does one harness this package for <span style="text-decoration: line-through;">nefarious self interest </span>doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that&#8217;s basically it.</p>
<p>How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I&#8217;ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the &#8216;valuation&#8217; based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:</p>
<blockquote><p>myOutput &lt;- mclapply( drawList, function(x) valuationReturns(drawNumber=x))</p></blockquote>
<p>The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!</p>
<p>I recommend the <a href="http://htop.sourceforge.net/" onclick="pageTracker._trackPageview('/outgoing/htop.sourceforge.net/?referer=');">htop program </a>for tracking what&#8217;s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png"><img class="size-full wp-image-564 alignnone" title="r on ec2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png" alt="" width="535" height="400" /></a></p>
<p>I don&#8217;t understand time travel, but I&#8217;ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.</p>
<p>Let me know if you have other multicore tips and tricks.</p>
<p>If you want to give me shit for running my simulations as root, feel free. I&#8217;m impervious to your &#8220;best practices&#8221; mumbo jumbo. La la la la la la!! Not listening!</p>
<p>Special thanks to <a href="http://www.cis.udel.edu/~cavazos/index.php?page=multicore-programming" onclick="pageTracker._trackPageview('/outgoing/www.cis.udel.edu/_cavazos/index.php?page=multicore-programming&amp;referer=');">John Cavazos over at the University of Delaware</a> from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
	</channel>
</rss>

