<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; R</title>
	<atom:link href="http://www.cerebralmastication.com/tag/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Wed, 07 Dec 2011 13:08:46 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fitting Distribution X to Data From Distribution Y</title>
		<link>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/</link>
		<comments>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/#comments</comments>
		<pubDate>Thu, 12 May 2011 20:31:31 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=1009</guid>
		<description><![CDATA[I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot.png"><img class="alignleft size-medium wp-image-1010" title="rstudio-plot" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot-300x240.png" alt="" width="300" height="240" /></a>I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a &#8220;toy model&#8221; and I think this is invaluable as a method for building intuition and a visceral understanding of data.<br />
Here&#8217;s some example code which we played with:</p>
<blockquote>
<div style="overflow:auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family:monospace;"><a href="http://inside-r.org/r-doc/base/set.seed" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/set.seed?referer=');"><span style="color: #003399; font-weight: bold;">set.seed</span></a><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span>
x <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rgamma" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rgamma?referer=');"><span style="color: #003399; font-weight: bold;">rgamma</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">.2</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># normalize the gamma so it's between 0 &amp; 1</span>
<span style="color: #666666; font-style: italic;"># .0001 added because having exactly 1 causes fail</span>
xt <span style="">&lt;-</span> x <span style="">/</span> <span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/r-doc/base/max" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/max?referer=');"><span style="color: #003399; font-weight: bold;">max</span></a><span style="color: #009900;">&#40;</span> x <span style="color: #009900;">&#41;</span> <span style="">+</span> <span style="color: #cc66cc;">.0001</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># fit a beta distribution to xt</span>
<a href="http://inside-r.org/r-doc/base/library" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/library?referer=');"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/packages/cran/MASS" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/packages/cran/MASS?referer=');"><span style="">MASS</span></a> <span style="color: #009900;">&#41;</span>
fit.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/MASS/fitdistr" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/MASS/fitdistr?referer=');"><span style="color: #003399; font-weight: bold;">fitdistr</span></a><span style="color: #009900;">&#40;</span> xt<span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;beta&quot;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/stats/start" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/start?referer=');"><span style="color: #003399; font-weight: bold;">start</span></a> = <a href="http://inside-r.org/r-doc/base/list" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/list?referer=');"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">&#40;</span> shape1=<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> shape2=<span style="color: #cc66cc;">5</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
x.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rbeta" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rbeta?referer=');"><span style="color: #003399; font-weight: bold;">rbeta</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the pdfs on top of each other</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/lines" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/lines?referer=');"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x.beta<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/col" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/col?referer=');"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: #0000ff;">&quot;red&quot;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the qqplots</span>
<a href="http://inside-r.org/r-doc/stats/qqplot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/qqplot?referer=');"><span style="color: #003399; font-weight: bold;">qqplot</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #339933;">,</span> x.beta<span style="color: #009900;">&#41;</span></pre>
</div>
</div>
<p><a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org" onclick="pageTracker._trackPageview('/outgoing/www.inside-r.org/pretty-r?referer=');">Created by Pretty R at inside-R.org</a></p>
</blockquote>
<p>It&#8217;s not illustrated above, but it&#8217;s probably useful to transform the simulated data (x.beta) back into pre normalized space by multiplying by max( x ) + .0001 . (I swore I&#8217;d never say this but I lied) I&#8217;ll leave that as an exercise for the reader. </p>
<p>Another very useful tool in building a mental road map of distributions is the <a href="http://www.johndcook.com/distribution_chart.html" onclick="pageTracker._trackPageview('/outgoing/www.johndcook.com/distribution_chart.html?referer=');">graphical chart of distribution relationships that John Cook introduced me to</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Shell scripting EC2 for fun and profit</title>
		<link>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/</link>
		<comments>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/#comments</comments>
		<pubDate>Fri, 06 May 2011 20:57:40 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=993</guid>
		<description><![CDATA[Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool doRedis backend for the R foreach package. But that&#8217;s a whole other post. What I was scratching my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/" onclick="pageTracker._trackPageview('/outgoing/www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/?referer=');"><img class="alignleft size-full wp-image-994" style="border: 1px solid black; margin: 2px;" title="lg-go-away-tshirt" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/lg-go-away-tshirt.jpg" alt="" width="179" height="218" /></a>Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool <a href="http://cran.r-project.org/web/packages/doRedis/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/doRedis/index.html?referer=');">doRedis backend</a> for the R <a href="http://cran.r-project.org/web/packages/foreach/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/foreach/index.html?referer=');">foreach package</a>. But that&#8217;s a whole other post. What I was scratching my head about today was that I&#8217;d really just like to, with a single command, spin up an EC2 instance, wait for it to come up, and then ssh into it. I do this iteration about 20 times a day when I&#8217;m testing things, so it seemed to make sense to shell script it.<br />
To do this, one needs the EC2 command line tools installed on your workstation. In Ubuntu that&#8217;s as easy as `sudo apt-get ec2-api-tools`</p>
<p>So here&#8217;s a short shell script to spin up an instance, wait 30 seconds, then connect:<br />
<script src="http://gist.github.com/959780.js"></script></p>
<p>If you&#8217;re reading this through an RSS reader, you can see the script over at <a href="https://gist.github.com/959780" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/959780?referer=');">github</a>.</p>
<p>Obviously you&#8217;ll need to change the parameters at the top of the script to suit your needs. But since this was a bit of a pain in the donkey hole for me to figure out, I thought I would share.</p>
<p>If you want to help out, I&#8217;d love you to enlighten me on how to have the script figure out if an instance has finished booting so I could eliminate the sleep step.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Details of two-way sync between two Ubuntu machines</title>
		<link>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/</link>
		<comments>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 20:48:32 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=966</guid>
		<description><![CDATA[In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png"><img class="alignleft size-full wp-image-956" title="sync" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png" alt="" width="128" height="128" /></a>In a <a href="http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/">previous post</a> I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine.</p>
<p>I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy. After testing this for a while I discovered this is totally the wrong way to do it. The problem is that the Unison process makes temp directories and files in the file system of the target. So my Unison job on the laptop would be trying to syn files and, in the process, create temp files which would kick off a Unison sync on the sever which would make temp files on the laptop&#8230; I think you can see how convoluted this gets.</p>
<p>So a much better solution is to only run Unison from one machine (I chose my laptop) and have the other machine (server in my case) send an SSH command (over the aforementioned reverse proxy) to the laptop asking the laptop to kick off a Unison sync. This way all of the syncs happen from the laptop.</p>
<p>So, in short, both machines run lsyncd which monitors files for changes. I keep up an SSH tunnel with reverse port forwarding which forwards a remote machine port back to my laptop&#8217;s port 22 (SSH). Unison need be installed ONLY on my laptop. When a change happens on my laptop, lsyncd fires off a Unison sync from my laptop that syncs it with the server. When a file changes on the server, the lsyncd job on the server makes a connection to my laptop via ssh and fires off a Unsion sync between my laptop and the server.</p>
<p>Here&#8217;s an example of my lsyncd config scripts:</p>
<p><strong>Laptop:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/laptop/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/laptop/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p><strong>Server:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/server/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/server/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p>Keep in mind that I am using version 2 of lsyncd which can be downloaded here: <a href="http://code.google.com/p/lsyncd/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/lsyncd/?referer=');">http://code.google.com/p/lsyncd/</a></p>
<p>The version of lsyncd available in the Ubuntu repo is version 1.x which does not use the same config format as I illustrate above. However, if you run into dependency issues with v2, the easiest thing to do is install the repo version which will install dependencies and then manually download and install v2 from the above URL.</p>
<p>My reverse port forwarding set up looks like this:</p>
<blockquote><p>autossh -2 -4 -X -R 5432:localhost:22 12.34.56.78</p></blockquote>
<p>the -R bit forwards remote port 5432 to my laptop&#8217;s port 22 which is the ssh. So on my server if I run ssh localhost -p 5432 what actually happens is I am sshing from the remote machine to my laptop.</p>
<p><strong>Notes:</strong></p>
<ul>
<li>The IP address of my server in this example is 12.34.56.78.</li>
<li>Don&#8217;t try and sync the directories where the lsyncd logs are kept. That will results in an endless sync cycle as each machine keeps noticing changes endlessly. Don&#8217;t ask me how I know this.</li>
<li>The command to start the sync on the laptop is &#8220;lsyncd /home/jal/lsyncd/laptop/configfile&#8221; where configfile is the above lsyncd configuration file.</li>
<li>lsyncd could, conceivably, tell Unison to sync only the part of the directory tree that changed. I have not been able to make that feature work right, however. And it only takes Unison a few seconds to sync, so I&#8217;ve not worried about it.</li>
</ul>
<p>This has greatly sped up my <a href="http://rstudio.org" onclick="pageTracker._trackPageview('/outgoing/rstudio.org?referer=');">RStudio</a> based workflow when doing analysis with R. Now when I change files on my server using RStudio they are immediately (well it waits 15 seconds) replicated to my local machine and vice versa!</p>
<p>Good luck and if you have any suggestions please post a comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Fast Two Way Sync in Ubuntu!</title>
		<link>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/</link>
		<comments>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/#comments</comments>
		<pubDate>Sat, 09 Apr 2011 15:32:48 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sync]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=955</guid>
		<description><![CDATA[I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I&#8217;m in the office because it bogs down my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png"><img class="alignleft size-full wp-image-956" title="sync" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png" alt="" width="128" height="128" /></a>I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I&#8217;m in the office because it bogs down my laptop and all those videos on <a href="http://www.thesuperficial.com/" onclick="pageTracker._trackPageview('/outgoing/www.thesuperficial.com/?referer=');">The Superficial</a> get all jerky and stuff.</p>
<p>I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I&#8217;m going to call these collectively &#8220;my servers&#8221; for the rest of this post). The nagging problem with this has been keeping files in sync. <a href="http://rstudio.org/" onclick="pageTracker._trackPageview('/outgoing/rstudio.org/?referer=');">RStudio Server</a> has been a great help to my workflow because it lets me edit files in my browser and they run on my servers. But when a long running R job blows out files I want those IMMEDIATELY synced with my laptop. That way I know when I undock my laptop to run to the train station that all my files will be there for me to spill Old Style beer on as I ride the Metra North line.</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/dropbox_logo_home.png"><img class="alignleft size-full wp-image-958" style="margin: 5px;" title="dropbox_logo_home" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/dropbox_logo_home.png" alt="" width="209" height="54" /></a>I experimented with <a href="https://www.dropbox.com/" onclick="pageTracker._trackPageview('/outgoing/www.dropbox.com/?referer=');">Dropbox</a> and I gotta say, it&#8217;s great. It really is well engineered, fast, and drop dead simple. I love that with Dropbox I could pull up most any file from my Dropbox on my iPad or iPhone. That&#8217;s a very handy feature. And it&#8217;s fast. If I created a small text file on my server, it would be synced with my laptop in a few seconds. Perfect! Wel&#8230; almost. Dropbox has a huge limitation: encryption. Dropbox encrypts for transmission and may even store files encrypted on their end. However, Dropbox controls the key. So if a rogue employee, a crafty Russian hacker, or a law enforcement officer with a subpoena gained access to Dropbox, they could get access to my files without my knowledge. As a risk manager I can&#8217;t help but see Dropbox&#8217;s security as a huge, targeted, single point of failure. It&#8217;s hard to say which would be a bigger payday: cracking GMail, or cracking Dropbox. But I&#8217;m suspicious it&#8217;s Dropbox. There are some workarounds to try and shoehorn file encryption into Dropbox, and they all suck.</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/logo.gif"><img class="alignleft size-full wp-image-960" style="margin: 5px; border: 0pt none;" title="logo" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/logo.gif" alt="" width="85" height="80" /></a>So Dropbox can&#8217;t really give me what I want (what I really really want). But I stumbled into <a href="https://spideroak.com/" onclick="pageTracker._trackPageview('/outgoing/spideroak.com/?referer=');">Spideroak</a> who are like the smarter, but lesser known cousins of Dropbox. Their software does everything Dropbox does (including tracking all revisions!) but they have a &#8220;trust no one&#8221; model which encrypts all files before leaving my computer using, and this is critical, MY key which they don&#8217;t store. Pretty cool, eh? Spideroak also has a iPad/iPhone app and offers a neat feature that allows emailing any file in my Spideroak &#8220;bucket&#8221; to anyone using my iPhone without having to upload the file to my iPhone first. They do this by sending a special link to the email recipient that allows them to open only the file you wanted them to have. This could be a huge bacon saver on the road.</p>
<p>So Spideroak&#8217;s the panacea then? Well&#8230; um&#8230; no. They have two critical flaws: 1) They depend on time stamps on files to determine most recent file. 2) Syncs are slow, sometimes taking more than 5 minutes for very small files. The time stamp issue is an engineering failure, plain and simple. I&#8217;ve talked to their tech support and been assured that they are going to change this and index using server time, not system time in the future. But as of April 6, 2011, Spideroak uses local system time. For most users this is no big deal. For my use case this is painful. My server and my laptop were 6 seconds different and that time difference was enough for me to get Spideroak confused about which files were the freshest. This is a big deal when syncing two file systems with fast changing files. The other issue, slow sync, was actually more painful but probably the result of their attempt to be nice with CPU time and also encryption. When jobs on my server finished, I expected those files to start syncing within seconds and the only delay I expected was bandwidth constraints. With Spideroak syncs might take 5 minutes to start and then it would go out for coffee, come back jittery and then finally complete. Even if SPideroak fixed the time sync issue (or I forced my laptop to set its time based on my server), it still would not work for my sync because of the huge lags.</p>
<p>So looking at Dropbox and Spideroak I realized that I liked everything about Spideroak except its sync. It&#8217;s a great cloud backup tool that seems to properly do encryption, it&#8217;s multiplatform (win, linux, mac), has an iPad/iPhone app for viewing/sending files, it&#8217;s smart about backups and won&#8217;t upload the same file twice (even if the file is on two different computers). For my business use, I just can&#8217;t use Dropbox. The lack of &#8220;trust no one&#8221; encryption is a deal killer. So what I really need is a sync solution to use along side Spideroak.</p>
<p>There are some neat projects out there for sync. Projects like <a href="http://www.sparkleshare.org/" onclick="pageTracker._trackPageview('/outgoing/www.sparkleshare.org/?referer=');">Sparkleshare</a> look really promising but they are trying to do all sorts of things, not just sync. I&#8217;ve already settled on letting Spideroak do backup and version tracking so I don&#8217;t really need all those features&#8230; OK, OK, I can hear you muttering, &#8220;just use rsync and be done with it already.&#8221; Yeah, that&#8217;s a good idea. But rsync is single directional and does a lot of things well, but can also be a bit of an asshole if you don&#8217;t set all the flags right and rub its belly the right way. If you google for &#8220;bidirectional sync&#8221; you&#8217;re going to see this problem has plagued a lot of folks. This blog post has already gone on long enough so I&#8217;ll cut to the chase. Here&#8217;s the stack of tools I settled on for cobbling together my own secure, real-time, bidirectional sync between two Ubuntu boxes (one of which changes IP address and is often behind a NAT router):</p>
<p>1) <a href="http://www.cis.upenn.edu/~bcpierce/unison/" onclick="pageTracker._trackPageview('/outgoing/www.cis.upenn.edu/_bcpierce/unison/?referer=');">Unison</a> &#8211; Fast sync using rsync-esque algos and really fast caching/scanning</p>
<p>2) <a href="http://code.google.com/p/lsyncd/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/lsyncd/?referer=');">lsyncd</a> &#8211; Live (real-time) sync daemon</p>
<p>3) <a href="http://linux.die.net/man/1/autossh" onclick="pageTracker._trackPageview('/outgoing/linux.die.net/man/1/autossh?referer=');">autossh</a> &#8211; ssh client with a nifty wrapper that keeps the connection alive and respawns the connection if dropped</p>
<p>I&#8217;ll do another post with the nitty-gritty of how I set this up, but the short version is that I installed Unison and lsyncd on both the laptop and the server. Single direction sync from my laptop to the server is pretty straight forward: lsyncd watches files, if one changes it calls unison which syncs the files with the server. The tricky bit was getting my server to be able to sync with my laptop which is often behind a NAT router. The solution was to open an ssh connection from my laptop to my server using autossh and reverse port forward port 5555 from the server back to my laptop&#8217;s port 22. That way an lsyncd process on the server can monitor the file system and when it sees a change can kick off a unison job that syncs the server to ssh://localhost:5555//some/path which is forwarded to my laptop! Autossh makes sure that connection does not get dropped and respawns if it does get dropped. So with a little shell scripting to start the lsyncd daemon on both machines, some config of lsyncd, and a local shell script to fire off the autossh connection, I&#8217;ve got real-time bidirectional sync!</p>
<p>In a follow up post I&#8217;ll put of the details of this configuration. Stay tuned. (EDIT: <a href="http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/">Update posted</a>!)</p>
<p>If you&#8217;ve solved sync a different way and you like your solution, please comment. I&#8217;ve not settled that this is my long-term solution. It&#8217;s just a solution that works. Which is more than I had yesterday.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Where the heck has JD been?</title>
		<link>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/</link>
		<comments>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/#comments</comments>
		<pubDate>Wed, 23 Mar 2011 02:14:37 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=947</guid>
		<description><![CDATA[It&#8217;s been pointed out to me that I haven&#8217;t had any blog posts in a while. It&#8217;s true. I&#8217;m fairly slack. But in the last few months I&#8217;ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been pointed out to me that I haven&#8217;t had any blog posts in a while. It&#8217;s true. I&#8217;m fairly slack. But in the last few months I&#8217;ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, I&#8217;m nothing compared to <a href="http://www.badassoftheweek.com/akaiwa.html" onclick="pageTracker._trackPageview('/outgoing/www.badassoftheweek.com/akaiwa.html?referer=');">Hideaki Akaiwa</a>.</p>
<p>And you may have heard that the R Cookbook by Chicago&#8217;s own Paul Teeter has been printed! Way to go Paul! And for a limited time you can get the book 50% off <a href="http://oreilly.com/store/dd-jpn.html" onclick="pageTracker._trackPageview('/outgoing/oreilly.com/store/dd-jpn.html?referer=');">direct from O&#8217;Reilly</a>.</p>
<p>And let it be known: I&#8217;ve double dog dared you to find a stats or programming book with any better back cover quotes:</p>
<p><img class="alignnone" title="back cover" src="http://static.ow.ly/photos/original/9rtx.png" alt="" width="553" height="572" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Controlling Amazon Web Services using rJava and the AWS Java SDK</title>
		<link>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/</link>
		<comments>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/#comments</comments>
		<pubDate>Tue, 30 Nov 2010 19:51:17 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[rJava]]></category>
		<category><![CDATA[S3]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=937</guid>
		<description><![CDATA[ I&#8217;ve been messing around with using Amazon Web Services for a while. I&#8217;ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I&#8217;ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone" style="border: 1px solid black; margin: 4px;" title="aws" src="http://awsmedia.s3.amazonaws.com/logo_aws.gif" alt="" width="164" height="60" /> I&#8217;ve been messing around with using Amazon Web Services for a while. I&#8217;ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I&#8217;ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. This has some real disadvantages, however. Using the command line tools means each tool has to be configured individually which is painful on a new machine. It&#8217;s also much harder to roll my R code up into a CRAN package because I have to check dependencies on the command line tools and ensure that the user has properly configured each tool. Clearly a pain in the ass.</p>
<p>So I was looking for more simple/elegant solutions. After thinking the <a href="http://code.google.com/p/boto/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/boto/?referer=');">Boto</a> library for Python might be helpful, I realized that the easiest way to use that would be with <a href="http://rjython.r-forge.r-project.org/" onclick="pageTracker._trackPageview('/outgoing/rjython.r-forge.r-project.org/?referer=');">rJython</a> which meant having to interact with R, Python, AND Java. Considering I don&#8217;t program in Python or Java, that seemed like a fair bit of complexity. Then I realized that the canonical implementation of the AWS API was the AWS Java SDK. The <a href="http://www.rforge.net/rJava/" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/rJava/?referer=');">rJava</a> package makes interacting with Java from R a viable option.</p>
<p>Since I&#8217;ve never written a single line of Java code in my pathetic life, this was somewhat harder than it could have been. But with some help from <a href="http://romainfrancois.blog.free.fr/" onclick="pageTracker._trackPageview('/outgoing/romainfrancois.blog.free.fr/?referer=');">Romain Francois</a> I was able to cobble together &#8220;something that works.&#8221; The code below gives a simple example of interfacing with S3. The example will look to see if a given bucket exists on S3, if not it will create the bucket. Then it will upload a single file from your PC into that bucket. You will have to <a href="http://aws.amazon.com/sdkforjava/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/sdkforjava/?referer=');">download the SDK</a>, unzip it in the location of your choice, and then change the script to reflect your configuration.</p>
<p>If you are running R in Ubuntu, you should install rJava using apt-get instead of using install.packages() from inside of R:</p>
<blockquote><p>sudo apt-get install r-cran-rjava</p></blockquote>
<p>Here&#8217;s the codez. And a <a href="https://gist.github.com/722230" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/722230?referer=');">direct link</a> for you guys reading this through an RSS reader:<br />
<script src="http://gist.github.com/722230.js"></script></p>
<p>I realize that Duncan Temple Lang has created the <a href="http://www.omegahat.org/RAmazonS3/" onclick="pageTracker._trackPageview('/outgoing/www.omegahat.org/RAmazonS3/?referer=');">RAmazonS3</a> package which can easily do what the above code sample does. The advantage of using rJava and the AWS Java SDK is the ability to apply the same approach to <strong>ALL</strong> the AWS services. And since Amazon maintains the SDK this guarantees that future AWS services and features will be supported as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Connecting to SQL Server from R using RJDBC</title>
		<link>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/</link>
		<comments>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/#comments</comments>
		<pubDate>Wed, 22 Sep 2010 18:00:26 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sql server]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=891</guid>
		<description><![CDATA[A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo.png"><img class="alignleft size-medium wp-image-901" style="border: 2px solid black; margin: 3px;" title="sql_server_2008_logo" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo-300x187.png" alt="" width="235" height="146" /></a>A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.</p>
<p><a href="http://www.fosstrading.com/" onclick="pageTracker._trackPageview('/outgoing/www.fosstrading.com/?referer=');">Joshua Ulrich</a> was kind enough to help me out by pointing me to <a href="http://www.rforge.net/RJDBC/" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/RJDBC/?referer=');">RJDBC</a> which scared me a little (I&#8217;m easily spooked) because it involves Java. The only thing I know about Java is every time I touch it I <a href="http://stackoverflow.com/questions/3311940/r-rjava-package-install-failing" target="_blank" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/3311940/r-rjava-package-install-failing?referer=');">spend days trying to get environment variables</a> loaded just exactly the way it wants them. But Josh assured me that it was really not that hard. Here&#8217;s the short version:</p>
<p><a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707&amp;displaylang=en" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707_amp_displaylang=en&amp;referer=');">Download the RJDBC driver from Microsoft</a>. There&#8217;s Win and *nix versions, so grab which ever you need. Unpack the driver in a known location (I used /etc/sqljdbc_2.0/). Then access the driver from R like so:</p>
<pre>require(RJDBC)
drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "/etc/sqljdbc_2.0/sqljdbc4.jar") 
  conn &lt;- dbConnect(drv, "jdbc:sqlserver://serverName", "userID", "password")
#then build a query and run it
sqlText &lt;- paste("
   SELECT * FROM myTable
  ", sep="")
queryResults &lt;- dbGetQuery(conn, sqlText)</pre>
<p>I have a few scripts that I want to run on both my Ubuntu laptop and my Windows Server. To accommodate that I made my scripts compatible with both by doing the following to my drv line:</p>
<pre>if (.Platform$OS.type == "unix"){
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
         "/etc/sqljdbc_2.0/sqljdbc4.jar")
} else {
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
        "C:/Program Files/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0
         /enu/sqljdbc4.jar")
 }</pre>
<p>Obviously if you unpacked your drivers in different locations you&#8217;ll need to molest the code to fit your life situation.</p>
<p><span style="color: #ff6600;"><strong>EDIT: </strong>A MUCH better place to put the JDBC drivers in Ubuntu would be the /opt/ path as opposed to /etc/ which I used above. In Ubuntu the /opt/ directory is where one should put user executables and /etc/ should be reserved for packages installed by apt. I&#8217;m not familiar with all the conventions in Ubuntu (or even Linux in general) so I didn&#8217;t realize this until I got some reader feedback. </span></p>
<p>Be forewarned, RJDBC is pretty damn slow and it appears to no longer be in active development. For my use case, RODBC was clearly faster. But RJDBC works for me in Ubuntu and that was my biggest need.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Principal Component Analysis (PCA) vs Ordinary Least Squares (OLS): A Visual Explanation</title>
		<link>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/</link>
		<comments>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 17:11:27 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=866</guid>
		<description><![CDATA[Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it MEAN?&#8221;
I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png"><img class="size-full wp-image-876 alignleft" style="border: 2px solid black; margin: 3px;" title="sa" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png" alt="" width="299" height="82" /></a>Over at stats.stackexchange.com recently, a <a href="http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700#2700" onclick="pageTracker._trackPageview('/outgoing/stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700_2700?referer=');">really interesting question was raised</a> about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it <strong>MEAN</strong>?&#8221;</p>
<p>I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled &#8220;Why I give a shit.&#8221; A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster.  OK, that&#8217;s a lie. Later in life I learned that bordered Hessian matrices are a second derivative test used in some optimizations. Not that I would EVER do that shit by hand. I&#8217;d use some R package and blindly trust that it was coded properly.</p>
<p>So back to PCA: as I was reading the aforementioned stats question I was reminded of a recent presentation that <a href="http://quanttrader.info/public/" onclick="pageTracker._trackPageview('/outgoing/quanttrader.info/public/?referer=');">Paul Teetor</a> gave at a August Chicago R User Group. In his presentation on spread trading with R he showed a graphic that illustrated the difference between OLS and PCA. I took some notes and went home and made sure I could recreate the same thing. If you have wondered what makes OLS and PCA different, open up an R session and play along.</p>
<p><strong>Your Independent Variable Matters:</strong></p>
<p>The first observation to make is that regressing x ~ y is not the same as y ~ x even in a simple univariate regression. You can illustrate this by doing the following:</p>
<blockquote><p>set.seed(2)<br />
x &lt;- 1:100</p>
<p>y &lt;- 20 + 3 * x<br />
e &lt;- rnorm(100, 0, 60)<br />
y &lt;- 20 + 3 * x + e</p>
<p>plot(x,y)<br />
yx.lm &lt;- lm(y ~ x)<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)</p>
<p>xy.lm &lt;- lm(x ~ y)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>You should get something that looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols.png"><img class="size-medium wp-image-867 alignnone" title="olsVSols" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols-280x300.png" alt="" width="280" height="300" /></a></p>
<p>So it&#8217;s obvious they give different lines. But why? Well, OLS minimizes the error between the dependent and the model. Two of these errors are illustrated for the y ~ x case in the following picture:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1.png"><img class="alignnone size-medium wp-image-870" title="OLS1" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1-280x300.png" alt="" width="280" height="300" /></a></p>
<p>But when we flip the model around and regress x ~ y then OLS minimizes these errors:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2.png"><img class="alignnone size-medium wp-image-871" title="OLS2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2-280x300.png" alt="" width="280" height="300" /></a></p>
<p>Ok, so what about PCA?</p>
<p>Well let&#8217;s draw the first principal component the old school way:</p>
<blockquote><p>#normalize means and cbind together<br />
xyNorm &lt;- cbind(x=x-mean(x), y=y-mean(y))<br />
plot(xyNorm)</p>
<p>#covariance<br />
xyCov &lt;- cov(xyNorm)<br />
eigenValues &lt;- eigen(xyCov)$values<br />
eigenVectors &lt;- eigen(xyCov)$vectors</p>
<p>plot(xyNorm, ylim=c(-200,200), xlim=c(-200,200))<br />
lines(xyNorm[x], eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x])<br />
lines(xyNorm[x], eigenVectors[2,2]/eigenVectors[1,2] * xyNorm[x])</p>
<p># the largest eigenValue is the first one<br />
# so that&#8217;s our principal component.<br />
# but the principal component is in normalized terms (mean=0)<br />
# and we want it back in real terms like our starting data<br />
# so let&#8217;s denormalize it<br />
plot(xy)<br />
lines(x, (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y))<br />
# that looks right. line through the middle as expected</p>
<p># what if we bring back our other two regressions?<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>PCA minimizes the error orthogonal (perpendicular) to the model line. So first principal component  looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca.png"><img class="alignnone size-medium wp-image-872" title="pca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The two yellow lines, as in the previous images, examples of two of the errors which the routine minimizes.</p>
<p>So if we plot all three lines on the same scatter plot we can see the differences:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca.png"><img class="alignnone size-medium wp-image-873" title="olsVSpca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The x ~ y OLS and the first principal component are pretty close, but click on the image to get a better view and you will see they are not exactly the same.</p>
<p>All the code from the above examples can be found in a <a href="http://gist.github.com/582767" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/582767?referer=');">gist over at GitHub.com</a>. It&#8217;s best to copy and past from the github as sometimes Wordpress molests my quotes and breaks the codez.</p>
<p>The best introduction to PCA which I have read is the one I link to on Stats.StackExchange.com. It&#8217;s titled <a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf" onclick="pageTracker._trackPageview('/outgoing/www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf?referer=');">&#8220;A Tutorial on Principal Components Analysis&#8221; by Lindsay I Smith</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Third, and Hopefully Final, Post on Correlated Random Normal Generation (Cholesky Edition)</title>
		<link>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/</link>
		<comments>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 18:03:21 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=824</guid>
		<description><![CDATA[When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I&#8217;ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, &#8220;the way mensches do multivariate (log)normal [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_825" class="wp-caption alignleft" style="width: 260px"><a href="http://www.sabix.org/bulletin/b39/vie.html" onclick="pageTracker._trackPageview('/outgoing/www.sabix.org/bulletin/b39/vie.html?referer=');"><img class="size-medium wp-image-825 " title="39-cholesky" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/39-cholesky-250x300.jpg" alt="" width="250" height="300" /></a><p class="wp-caption-text">André-Louis Cholesky is my homeboy</p></div>
<p>When I did a <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/">brief post three days ago</a> I had no plans on writing two more posts on correlated random number generation. But I&#8217;ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/comment-page-1/#comment-5068">Gappy, calls me out</a> and says, &#8220;the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.&#8221;  And I think we&#8217;re all smart enough to read through Mr. Gappy&#8217;s comment and see that he&#8217;s saying I&#8217;m a complicated, opaque, and slow, גוי‎. My wife called and said his list would be more accurate if he added &#8216;emotionally detached.&#8217; I have no idea what she means.</p>
<p>At any rate, in response to Gappy&#8217;s comment, here is the third verse (same as the first). The crux of the change is the following lines:</p>
<pre>
<blockquote>

# shift the mean of ourData to zero
ourData0 &lt;- as.data.frame(sweep(ourData,2,colMeans(ourData),"-"))

#Cholesky Decomposition of the covariance matrix
C &lt;- chol(nearPD(cov(ourData0))$mat)

#create a matrix of random standard normals
Z &lt;- matrix(rnorm(n * ncol(ourData)), ncol(ourData))

#multiply the standard normals by the transpose of the Cholesky
X &lt;- t(C) %*% Z

myDraws &lt;- data.frame(as.matrix(t(X)))
names(myDraws) &lt;- names(ourData)

# we still need to shift the means of the samples.

# shift the mean of the draws over to match the starting data
myDraws &lt;- as.data.frame(sweep(myDraws,2,colMeans(ourData),"+"))
</blockquote>
</pre>
<p><em><strong>Edit: </strong>When I first publishes this example, I didn&#8217;t shift the means prior to taking the cov(). I&#8217;ve sense corrected that.  Also thanks to @fdaapproved on Twitter who pointed out that I can replace the loop above with myDraws &lt;- as.data.frame(sweep(t(X),2,colMeans(ourData),&#8221;+&#8221;))</em></p>
<p>This method, which uses Cholesky decomposition, is how I initially learned to create correlated random draws. I think this method is comparable to the mvrnorm() method. mvrnorm() is handy because it wraps everything above in one single line of code. But the above method is reliant only on the Matrix package and that&#8217;s only for the nearPD() function. If you are familiar with the guts of the mvrnorm() function and the chol() function, I&#8217;d love for you to comment on any technical differences. I looked briefly at the code for both and quickly realized my matrix math was rusty enough that it was going to take a while for me to sort through the code.</p>
<p>If you want the whole script you can find it embedded below <a href="http://gist.github.com/562567" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/562567?referer=');">and on Github</a>.</p>
<script src="http://gist.github.com/562567.js"></script>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Even Simpler Multivariate Correlated Simulations</title>
		<link>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/</link>
		<comments>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/#comments</comments>
		<pubDate>Tue, 31 Aug 2010 15:17:27 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=804</guid>
		<description><![CDATA[So after yesterday&#8217;s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, &#8220;Dude, why are you making this so hard?&#8221; The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/08/Screenshot-Untitled-Window-3.png"><img class="alignleft size-full wp-image-803" title="mvrnorm example" src="http://www.cerebralmastication.com/wp-content/uploads/2010/08/Screenshot-Untitled-Window-3.png" alt="" width="341" height="221" /></a>So after yesterday&#8217;s post on <a href="http://www.cerebralmastication.com/2010/08/stochastic-simulation-with-copulas-in-r/">Simple Simulation using Copulas</a> I got a very nice email that basically begged the question, &#8220;Dude, why are you making this so hard?&#8221; The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick</p>
<blockquote><p>?mvrnorm</p></blockquote>
<p>and, I&#8217;ll be damned, he&#8217;s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values. So that gives the flexibility to make the tails of the distributions more correlated, for example. But my example yesterday was purposefully simple&#8230; so simple that a copula was not even needed.</p>
<p>After creating my sample data all I really needed to do was this:</p>
<blockquote><p>myDraws &lt;- mvrnorm(1e5, mu=mean(ourData), Sigma=cov(ourData))</p></blockquote>
<p>So I  took my example from yesterday and updated it using the mvrnorm() code and, as is my custom, put up a <a href="http://gist.github.com/559082" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/559082?referer=');">Github gist.</a> The code is embedded below as well. I added a little ggplot2 code at the end that will create a facet plot of the 4 distributions showing the shape of the distributions of both the starting data and the simulated data. The plot in the upper left of this post is the ggplot output.</p>
<p><em><strong>EDIT: </strong></em>The email hipping me to this was sent by <a href="http://dirk.eddelbuettel.com" onclick="pageTracker._trackPageview('/outgoing/dirk.eddelbuettel.com?referer=');">Dirk Eddelbuettel</a> who&#8217;s been very helpful to me more times than I can count. I had omitted his name initially. However after confirming with Dirk, he told me it was OK to mention him by name in this post.</p>
<script src="http://gist.github.com/559082.js"></script>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/08/even-simpler-multivariate-correlated-simulations/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

