<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; Uncategorized</title>
	<atom:link href="http://www.cerebralmastication.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Wed, 07 Dec 2011 13:08:46 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fitting Distribution X to Data From Distribution Y</title>
		<link>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/</link>
		<comments>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/#comments</comments>
		<pubDate>Thu, 12 May 2011 20:31:31 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=1009</guid>
		<description><![CDATA[I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot.png"><img class="alignleft size-medium wp-image-1010" title="rstudio-plot" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/rstudio-plot-300x240.png" alt="" width="300" height="240" /></a>I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I&#8217;m not a &#8220;closed form&#8221; kinda guy. I&#8217;m more of a &#8220;numerical simulation&#8221; type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a &#8220;toy model&#8221; and I think this is invaluable as a method for building intuition and a visceral understanding of data.<br />
Here&#8217;s some example code which we played with:</p>
<blockquote>
<div style="overflow:auto;">
<div class="geshifilter">
<pre class="r geshifilter-R" style="font-family:monospace;"><a href="http://inside-r.org/r-doc/base/set.seed" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/set.seed?referer=');"><span style="color: #003399; font-weight: bold;">set.seed</span></a><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">3</span><span style="color: #009900;">&#41;</span>
x <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rgamma" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rgamma?referer=');"><span style="color: #003399; font-weight: bold;">rgamma</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span> <span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">.2</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># normalize the gamma so it's between 0 &amp; 1</span>
<span style="color: #666666; font-style: italic;"># .0001 added because having exactly 1 causes fail</span>
xt <span style="">&lt;-</span> x <span style="">/</span> <span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/r-doc/base/max" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/max?referer=');"><span style="color: #003399; font-weight: bold;">max</span></a><span style="color: #009900;">&#40;</span> x <span style="color: #009900;">&#41;</span> <span style="">+</span> <span style="color: #cc66cc;">.0001</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;"># fit a beta distribution to xt</span>
<a href="http://inside-r.org/r-doc/base/library" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/library?referer=');"><span style="color: #003399; font-weight: bold;">library</span></a><span style="color: #009900;">&#40;</span> <a href="http://inside-r.org/packages/cran/MASS" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/packages/cran/MASS?referer=');"><span style="">MASS</span></a> <span style="color: #009900;">&#41;</span>
fit.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/MASS/fitdistr" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/MASS/fitdistr?referer=');"><span style="color: #003399; font-weight: bold;">fitdistr</span></a><span style="color: #009900;">&#40;</span> xt<span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;beta&quot;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/stats/start" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/start?referer=');"><span style="color: #003399; font-weight: bold;">start</span></a> = <a href="http://inside-r.org/r-doc/base/list" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/list?referer=');"><span style="color: #003399; font-weight: bold;">list</span></a><span style="color: #009900;">&#40;</span> shape1=<span style="color: #cc66cc;">2</span><span style="color: #339933;">,</span> shape2=<span style="color: #cc66cc;">5</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
x.beta <span style="">&lt;-</span> <a href="http://inside-r.org/r-doc/stats/rbeta" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/rbeta?referer=');"><span style="color: #003399; font-weight: bold;">rbeta</span></a><span style="color: #009900;">&#40;</span>1e5<span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>fit.beta<span style="">$</span>estimate<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the pdfs on top of each other</span>
<a href="http://inside-r.org/r-doc/graphics/plot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/plot?referer=');"><span style="color: #003399; font-weight: bold;">plot</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<a href="http://inside-r.org/r-doc/graphics/lines" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/graphics/lines?referer=');"><span style="color: #003399; font-weight: bold;">lines</span></a><span style="color: #009900;">&#40;</span><a href="http://inside-r.org/r-doc/stats/density" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/density?referer=');"><span style="color: #003399; font-weight: bold;">density</span></a><span style="color: #009900;">&#40;</span>x.beta<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/base/col" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/base/col?referer=');"><span style="color: #003399; font-weight: bold;">col</span></a>=<span style="color: #0000ff;">&quot;red&quot;</span> <span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">## plot the qqplots</span>
<a href="http://inside-r.org/r-doc/stats/qqplot" onclick="pageTracker._trackPageview('/outgoing/inside-r.org/r-doc/stats/qqplot?referer=');"><span style="color: #003399; font-weight: bold;">qqplot</span></a><span style="color: #009900;">&#40;</span>xt<span style="color: #339933;">,</span> x.beta<span style="color: #009900;">&#41;</span></pre>
</div>
</div>
<p><a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org" onclick="pageTracker._trackPageview('/outgoing/www.inside-r.org/pretty-r?referer=');">Created by Pretty R at inside-R.org</a></p>
</blockquote>
<p>It&#8217;s not illustrated above, but it&#8217;s probably useful to transform the simulated data (x.beta) back into pre normalized space by multiplying by max( x ) + .0001 . (I swore I&#8217;d never say this but I lied) I&#8217;ll leave that as an exercise for the reader. </p>
<p>Another very useful tool in building a mental road map of distributions is the <a href="http://www.johndcook.com/distribution_chart.html" onclick="pageTracker._trackPageview('/outgoing/www.johndcook.com/distribution_chart.html?referer=');">graphical chart of distribution relationships that John Cook introduced me to</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/fitting-distribution-x-to-data-from-distribution-y/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Shell scripting EC2 for fun and profit</title>
		<link>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/</link>
		<comments>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/#comments</comments>
		<pubDate>Fri, 06 May 2011 20:57:40 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=993</guid>
		<description><![CDATA[Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool doRedis backend for the R foreach package. But that&#8217;s a whole other post. What I was scratching my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/" onclick="pageTracker._trackPageview('/outgoing/www.thinkgeek.com/tshirts-apparel/unisex/frustrations/374d/?referer=');"><img class="alignleft size-full wp-image-994" style="border: 1px solid black; margin: 2px;" title="lg-go-away-tshirt" src="http://www.cerebralmastication.com/wp-content/uploads/2011/05/lg-go-away-tshirt.jpg" alt="" width="179" height="218" /></a>Lately I&#8217;ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis&#8217;s very cool <a href="http://cran.r-project.org/web/packages/doRedis/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/doRedis/index.html?referer=');">doRedis backend</a> for the R <a href="http://cran.r-project.org/web/packages/foreach/index.html" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/packages/foreach/index.html?referer=');">foreach package</a>. But that&#8217;s a whole other post. What I was scratching my head about today was that I&#8217;d really just like to, with a single command, spin up an EC2 instance, wait for it to come up, and then ssh into it. I do this iteration about 20 times a day when I&#8217;m testing things, so it seemed to make sense to shell script it.<br />
To do this, one needs the EC2 command line tools installed on your workstation. In Ubuntu that&#8217;s as easy as `sudo apt-get ec2-api-tools`</p>
<p>So here&#8217;s a short shell script to spin up an instance, wait 30 seconds, then connect:<br />
<script src="http://gist.github.com/959780.js"></script></p>
<p>If you&#8217;re reading this through an RSS reader, you can see the script over at <a href="https://gist.github.com/959780" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/959780?referer=');">github</a>.</p>
<p>Obviously you&#8217;ll need to change the parameters at the top of the script to suit your needs. But since this was a bit of a pain in the donkey hole for me to figure out, I thought I would share.</p>
<p>If you want to help out, I&#8217;d love you to enlighten me on how to have the script figure out if an instance has finished booting so I could eliminate the sleep step.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/05/shell-scripting-ec2-for-fun-and-profit/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>The best interview question I&#8217;ve ever been asked</title>
		<link>http://www.cerebralmastication.com/2011/04/the-best-interview-question-ive-ever-been-asked/</link>
		<comments>http://www.cerebralmastication.com/2011/04/the-best-interview-question-ive-ever-been-asked/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 15:33:17 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[risk]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=979</guid>
		<description><![CDATA[In 2005 I was interviewing for a job as Risk Manager with Genworth Financial. I was working a gig up in Armonk, NY so I hopped a car to the GNW office and met with Mark Griffin, at that point the Chief Risk Office (CRO) for GNW. After some small talk, Mark asked me the [...]]]></description>
			<content:encoded><![CDATA[<p>In 2005 I was interviewing for a job as Risk Manager with <a href="http://www.google.com/finance?client=ob&amp;q=NYSE:GNW" onclick="pageTracker._trackPageview('/outgoing/www.google.com/finance?client=ob_amp_q=NYSE_GNW&amp;referer=');">Genworth Financial</a>. I was working a gig up in Armonk, NY so I hopped a car to the GNW office and met with Mark Griffin, at that point the Chief Risk Office (CRO) for GNW. After some small talk, Mark asked me the single most interesting interview question I&#8217;ve ever been asked. I don&#8217;t recall the exact wording, but the gist was:</p>
<blockquote><p><strong>If you could go back and work more on one project from your past, what would it be and why?</strong></p></blockquote>
<p>This immediately struck me as a good question. Like all really good interview questions, there is no right answer, but any answer tells a LOT about the person answering it. I talked about a few projects I had really enjoyed from my past: fuel hedging dashboard for an international airline, data mining government program data, but said that the one thing I wish I could work more on was reinsurance ceding strategies for insurance companies. Naturally he responded, &#8220;Why so?&#8221; So I explained the challenge and how I felt that if I had a little more time and a little more data I could numerically optimize reinsurance strategies and when I last worked on the problem it was 2001 and now, four years later, the computing power was better and I thought I could really get it right.</p>
<p>I&#8217;m pretty sure I didn&#8217;t explain very well. Mark was obviously fishing around to see if I got a little OCD about analytical challenges and if I loved digging. I thought about Mark&#8217;s question a lot three years later when I left Genworth to go work in reinsurance, optimizing reinsurance strategies.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/the-best-interview-question-ive-ever-been-asked/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Details of two-way sync between two Ubuntu machines</title>
		<link>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/</link>
		<comments>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 20:48:32 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=966</guid>
		<description><![CDATA[In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png"><img class="alignleft size-full wp-image-956" title="sync" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png" alt="" width="128" height="128" /></a>In a <a href="http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/">previous post</a> I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine.</p>
<p>I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy. After testing this for a while I discovered this is totally the wrong way to do it. The problem is that the Unison process makes temp directories and files in the file system of the target. So my Unison job on the laptop would be trying to syn files and, in the process, create temp files which would kick off a Unison sync on the sever which would make temp files on the laptop&#8230; I think you can see how convoluted this gets.</p>
<p>So a much better solution is to only run Unison from one machine (I chose my laptop) and have the other machine (server in my case) send an SSH command (over the aforementioned reverse proxy) to the laptop asking the laptop to kick off a Unison sync. This way all of the syncs happen from the laptop.</p>
<p>So, in short, both machines run lsyncd which monitors files for changes. I keep up an SSH tunnel with reverse port forwarding which forwards a remote machine port back to my laptop&#8217;s port 22 (SSH). Unison need be installed ONLY on my laptop. When a change happens on my laptop, lsyncd fires off a Unison sync from my laptop that syncs it with the server. When a file changes on the server, the lsyncd job on the server makes a connection to my laptop via ssh and fires off a Unsion sync between my laptop and the server.</p>
<p>Here&#8217;s an example of my lsyncd config scripts:</p>
<p><strong>Laptop:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/laptop/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/laptop/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p><strong>Server:</strong></p>
<blockquote><p>settings = {<br />
logfile    = &#8220;/home/jal/lsyncd/server/lsyncd.log&#8221;,<br />
statusFile = &#8220;/home/jal/lsyncd/server/lsyncd.status&#8221;,<br />
maxDelays  = 15,<br />
&#8211;nodaemon   = true,<br />
}</p>
<p>runUnison2 = {<br />
maxProcesses = 1,<br />
delay = 15,<br />
onAttrib  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onCreate  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onDelete  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onModify  = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
onMove    = &#8220;ssh localhost -p 5432 unison -batch  /home/jal/Documents ssh://12.34.56.78//home/jal/Documents&#8221;,<br />
}</p>
<p>sync{runUnison2, source=&#8221;/home/jal/Documents&#8221;}</p></blockquote>
<p>Keep in mind that I am using version 2 of lsyncd which can be downloaded here: <a href="http://code.google.com/p/lsyncd/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/lsyncd/?referer=');">http://code.google.com/p/lsyncd/</a></p>
<p>The version of lsyncd available in the Ubuntu repo is version 1.x which does not use the same config format as I illustrate above. However, if you run into dependency issues with v2, the easiest thing to do is install the repo version which will install dependencies and then manually download and install v2 from the above URL.</p>
<p>My reverse port forwarding set up looks like this:</p>
<blockquote><p>autossh -2 -4 -X -R 5432:localhost:22 12.34.56.78</p></blockquote>
<p>the -R bit forwards remote port 5432 to my laptop&#8217;s port 22 which is the ssh. So on my server if I run ssh localhost -p 5432 what actually happens is I am sshing from the remote machine to my laptop.</p>
<p><strong>Notes:</strong></p>
<ul>
<li>The IP address of my server in this example is 12.34.56.78.</li>
<li>Don&#8217;t try and sync the directories where the lsyncd logs are kept. That will results in an endless sync cycle as each machine keeps noticing changes endlessly. Don&#8217;t ask me how I know this.</li>
<li>The command to start the sync on the laptop is &#8220;lsyncd /home/jal/lsyncd/laptop/configfile&#8221; where configfile is the above lsyncd configuration file.</li>
<li>lsyncd could, conceivably, tell Unison to sync only the part of the directory tree that changed. I have not been able to make that feature work right, however. And it only takes Unison a few seconds to sync, so I&#8217;ve not worried about it.</li>
</ul>
<p>This has greatly sped up my <a href="http://rstudio.org" onclick="pageTracker._trackPageview('/outgoing/rstudio.org?referer=');">RStudio</a> based workflow when doing analysis with R. Now when I change files on my server using RStudio they are immediately (well it waits 15 seconds) replicated to my local machine and vice versa!</p>
<p>Good luck and if you have any suggestions please post a comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Fast Two Way Sync in Ubuntu!</title>
		<link>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/</link>
		<comments>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/#comments</comments>
		<pubDate>Sat, 09 Apr 2011 15:32:48 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sync]]></category>
		<category><![CDATA[workflow]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=955</guid>
		<description><![CDATA[I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I&#8217;m in the office because it bogs down my [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png"><img class="alignleft size-full wp-image-956" title="sync" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/SyncDifferent.png" alt="" width="128" height="128" /></a>I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I&#8217;m in the office because it bogs down my laptop and all those videos on <a href="http://www.thesuperficial.com/" onclick="pageTracker._trackPageview('/outgoing/www.thesuperficial.com/?referer=');">The Superficial</a> get all jerky and stuff.</p>
<p>I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I&#8217;m going to call these collectively &#8220;my servers&#8221; for the rest of this post). The nagging problem with this has been keeping files in sync. <a href="http://rstudio.org/" onclick="pageTracker._trackPageview('/outgoing/rstudio.org/?referer=');">RStudio Server</a> has been a great help to my workflow because it lets me edit files in my browser and they run on my servers. But when a long running R job blows out files I want those IMMEDIATELY synced with my laptop. That way I know when I undock my laptop to run to the train station that all my files will be there for me to spill Old Style beer on as I ride the Metra North line.</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/dropbox_logo_home.png"><img class="alignleft size-full wp-image-958" style="margin: 5px;" title="dropbox_logo_home" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/dropbox_logo_home.png" alt="" width="209" height="54" /></a>I experimented with <a href="https://www.dropbox.com/" onclick="pageTracker._trackPageview('/outgoing/www.dropbox.com/?referer=');">Dropbox</a> and I gotta say, it&#8217;s great. It really is well engineered, fast, and drop dead simple. I love that with Dropbox I could pull up most any file from my Dropbox on my iPad or iPhone. That&#8217;s a very handy feature. And it&#8217;s fast. If I created a small text file on my server, it would be synced with my laptop in a few seconds. Perfect! Wel&#8230; almost. Dropbox has a huge limitation: encryption. Dropbox encrypts for transmission and may even store files encrypted on their end. However, Dropbox controls the key. So if a rogue employee, a crafty Russian hacker, or a law enforcement officer with a subpoena gained access to Dropbox, they could get access to my files without my knowledge. As a risk manager I can&#8217;t help but see Dropbox&#8217;s security as a huge, targeted, single point of failure. It&#8217;s hard to say which would be a bigger payday: cracking GMail, or cracking Dropbox. But I&#8217;m suspicious it&#8217;s Dropbox. There are some workarounds to try and shoehorn file encryption into Dropbox, and they all suck.</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2011/04/logo.gif"><img class="alignleft size-full wp-image-960" style="margin: 5px; border: 0pt none;" title="logo" src="http://www.cerebralmastication.com/wp-content/uploads/2011/04/logo.gif" alt="" width="85" height="80" /></a>So Dropbox can&#8217;t really give me what I want (what I really really want). But I stumbled into <a href="https://spideroak.com/" onclick="pageTracker._trackPageview('/outgoing/spideroak.com/?referer=');">Spideroak</a> who are like the smarter, but lesser known cousins of Dropbox. Their software does everything Dropbox does (including tracking all revisions!) but they have a &#8220;trust no one&#8221; model which encrypts all files before leaving my computer using, and this is critical, MY key which they don&#8217;t store. Pretty cool, eh? Spideroak also has a iPad/iPhone app and offers a neat feature that allows emailing any file in my Spideroak &#8220;bucket&#8221; to anyone using my iPhone without having to upload the file to my iPhone first. They do this by sending a special link to the email recipient that allows them to open only the file you wanted them to have. This could be a huge bacon saver on the road.</p>
<p>So Spideroak&#8217;s the panacea then? Well&#8230; um&#8230; no. They have two critical flaws: 1) They depend on time stamps on files to determine most recent file. 2) Syncs are slow, sometimes taking more than 5 minutes for very small files. The time stamp issue is an engineering failure, plain and simple. I&#8217;ve talked to their tech support and been assured that they are going to change this and index using server time, not system time in the future. But as of April 6, 2011, Spideroak uses local system time. For most users this is no big deal. For my use case this is painful. My server and my laptop were 6 seconds different and that time difference was enough for me to get Spideroak confused about which files were the freshest. This is a big deal when syncing two file systems with fast changing files. The other issue, slow sync, was actually more painful but probably the result of their attempt to be nice with CPU time and also encryption. When jobs on my server finished, I expected those files to start syncing within seconds and the only delay I expected was bandwidth constraints. With Spideroak syncs might take 5 minutes to start and then it would go out for coffee, come back jittery and then finally complete. Even if SPideroak fixed the time sync issue (or I forced my laptop to set its time based on my server), it still would not work for my sync because of the huge lags.</p>
<p>So looking at Dropbox and Spideroak I realized that I liked everything about Spideroak except its sync. It&#8217;s a great cloud backup tool that seems to properly do encryption, it&#8217;s multiplatform (win, linux, mac), has an iPad/iPhone app for viewing/sending files, it&#8217;s smart about backups and won&#8217;t upload the same file twice (even if the file is on two different computers). For my business use, I just can&#8217;t use Dropbox. The lack of &#8220;trust no one&#8221; encryption is a deal killer. So what I really need is a sync solution to use along side Spideroak.</p>
<p>There are some neat projects out there for sync. Projects like <a href="http://www.sparkleshare.org/" onclick="pageTracker._trackPageview('/outgoing/www.sparkleshare.org/?referer=');">Sparkleshare</a> look really promising but they are trying to do all sorts of things, not just sync. I&#8217;ve already settled on letting Spideroak do backup and version tracking so I don&#8217;t really need all those features&#8230; OK, OK, I can hear you muttering, &#8220;just use rsync and be done with it already.&#8221; Yeah, that&#8217;s a good idea. But rsync is single directional and does a lot of things well, but can also be a bit of an asshole if you don&#8217;t set all the flags right and rub its belly the right way. If you google for &#8220;bidirectional sync&#8221; you&#8217;re going to see this problem has plagued a lot of folks. This blog post has already gone on long enough so I&#8217;ll cut to the chase. Here&#8217;s the stack of tools I settled on for cobbling together my own secure, real-time, bidirectional sync between two Ubuntu boxes (one of which changes IP address and is often behind a NAT router):</p>
<p>1) <a href="http://www.cis.upenn.edu/~bcpierce/unison/" onclick="pageTracker._trackPageview('/outgoing/www.cis.upenn.edu/_bcpierce/unison/?referer=');">Unison</a> &#8211; Fast sync using rsync-esque algos and really fast caching/scanning</p>
<p>2) <a href="http://code.google.com/p/lsyncd/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/lsyncd/?referer=');">lsyncd</a> &#8211; Live (real-time) sync daemon</p>
<p>3) <a href="http://linux.die.net/man/1/autossh" onclick="pageTracker._trackPageview('/outgoing/linux.die.net/man/1/autossh?referer=');">autossh</a> &#8211; ssh client with a nifty wrapper that keeps the connection alive and respawns the connection if dropped</p>
<p>I&#8217;ll do another post with the nitty-gritty of how I set this up, but the short version is that I installed Unison and lsyncd on both the laptop and the server. Single direction sync from my laptop to the server is pretty straight forward: lsyncd watches files, if one changes it calls unison which syncs the files with the server. The tricky bit was getting my server to be able to sync with my laptop which is often behind a NAT router. The solution was to open an ssh connection from my laptop to my server using autossh and reverse port forward port 5555 from the server back to my laptop&#8217;s port 22. That way an lsyncd process on the server can monitor the file system and when it sees a change can kick off a unison job that syncs the server to ssh://localhost:5555//some/path which is forwarded to my laptop! Autossh makes sure that connection does not get dropped and respawns if it does get dropped. So with a little shell scripting to start the lsyncd daemon on both machines, some config of lsyncd, and a local shell script to fire off the autossh connection, I&#8217;ve got real-time bidirectional sync!</p>
<p>In a follow up post I&#8217;ll put of the details of this configuration. Stay tuned. (EDIT: <a href="http://www.cerebralmastication.com/2011/04/details-of-two-way-sync-between-two-ubuntu-machines/">Update posted</a>!)</p>
<p>If you&#8217;ve solved sync a different way and you like your solution, please comment. I&#8217;ve not settled that this is my long-term solution. It&#8217;s just a solution that works. Which is more than I had yesterday.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/04/fast-two-way-sync-in-ubuntu/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Where the heck has JD been?</title>
		<link>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/</link>
		<comments>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/#comments</comments>
		<pubDate>Wed, 23 Mar 2011 02:14:37 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=947</guid>
		<description><![CDATA[It&#8217;s been pointed out to me that I haven&#8217;t had any blog posts in a while. It&#8217;s true. I&#8217;m fairly slack. But in the last few months I&#8217;ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been pointed out to me that I haven&#8217;t had any blog posts in a while. It&#8217;s true. I&#8217;m fairly slack. But in the last few months I&#8217;ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, I&#8217;m nothing compared to <a href="http://www.badassoftheweek.com/akaiwa.html" onclick="pageTracker._trackPageview('/outgoing/www.badassoftheweek.com/akaiwa.html?referer=');">Hideaki Akaiwa</a>.</p>
<p>And you may have heard that the R Cookbook by Chicago&#8217;s own Paul Teeter has been printed! Way to go Paul! And for a limited time you can get the book 50% off <a href="http://oreilly.com/store/dd-jpn.html" onclick="pageTracker._trackPageview('/outgoing/oreilly.com/store/dd-jpn.html?referer=');">direct from O&#8217;Reilly</a>.</p>
<p>And let it be known: I&#8217;ve double dog dared you to find a stats or programming book with any better back cover quotes:</p>
<p><img class="alignnone" title="back cover" src="http://static.ow.ly/photos/original/9rtx.png" alt="" width="553" height="572" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2011/03/where-the-heck-has-jd-been/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Controlling Amazon Web Services using rJava and the AWS Java SDK</title>
		<link>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/</link>
		<comments>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/#comments</comments>
		<pubDate>Tue, 30 Nov 2010 19:51:17 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[rJava]]></category>
		<category><![CDATA[S3]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=937</guid>
		<description><![CDATA[ I&#8217;ve been messing around with using Amazon Web Services for a while. I&#8217;ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I&#8217;ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone" style="border: 1px solid black; margin: 4px;" title="aws" src="http://awsmedia.s3.amazonaws.com/logo_aws.gif" alt="" width="164" height="60" /> I&#8217;ve been messing around with using Amazon Web Services for a while. I&#8217;ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I&#8217;ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. This has some real disadvantages, however. Using the command line tools means each tool has to be configured individually which is painful on a new machine. It&#8217;s also much harder to roll my R code up into a CRAN package because I have to check dependencies on the command line tools and ensure that the user has properly configured each tool. Clearly a pain in the ass.</p>
<p>So I was looking for more simple/elegant solutions. After thinking the <a href="http://code.google.com/p/boto/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/boto/?referer=');">Boto</a> library for Python might be helpful, I realized that the easiest way to use that would be with <a href="http://rjython.r-forge.r-project.org/" onclick="pageTracker._trackPageview('/outgoing/rjython.r-forge.r-project.org/?referer=');">rJython</a> which meant having to interact with R, Python, AND Java. Considering I don&#8217;t program in Python or Java, that seemed like a fair bit of complexity. Then I realized that the canonical implementation of the AWS API was the AWS Java SDK. The <a href="http://www.rforge.net/rJava/" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/rJava/?referer=');">rJava</a> package makes interacting with Java from R a viable option.</p>
<p>Since I&#8217;ve never written a single line of Java code in my pathetic life, this was somewhat harder than it could have been. But with some help from <a href="http://romainfrancois.blog.free.fr/" onclick="pageTracker._trackPageview('/outgoing/romainfrancois.blog.free.fr/?referer=');">Romain Francois</a> I was able to cobble together &#8220;something that works.&#8221; The code below gives a simple example of interfacing with S3. The example will look to see if a given bucket exists on S3, if not it will create the bucket. Then it will upload a single file from your PC into that bucket. You will have to <a href="http://aws.amazon.com/sdkforjava/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/sdkforjava/?referer=');">download the SDK</a>, unzip it in the location of your choice, and then change the script to reflect your configuration.</p>
<p>If you are running R in Ubuntu, you should install rJava using apt-get instead of using install.packages() from inside of R:</p>
<blockquote><p>sudo apt-get install r-cran-rjava</p></blockquote>
<p>Here&#8217;s the codez. And a <a href="https://gist.github.com/722230" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/722230?referer=');">direct link</a> for you guys reading this through an RSS reader:<br />
<script src="http://gist.github.com/722230.js"></script></p>
<p>I realize that Duncan Temple Lang has created the <a href="http://www.omegahat.org/RAmazonS3/" onclick="pageTracker._trackPageview('/outgoing/www.omegahat.org/RAmazonS3/?referer=');">RAmazonS3</a> package which can easily do what the above code sample does. The advantage of using rJava and the AWS Java SDK is the ability to apply the same approach to <strong>ALL</strong> the AWS services. And since Amazon maintains the SDK this guarantees that future AWS services and features will be supported as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/11/controlling-amazon-web-services-using-rjava-and-the-aws-java-sdk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The O&#8217;Reilly Safari Books Online app broke my heart</title>
		<link>http://www.cerebralmastication.com/2010/11/the-oreilly-safari-books-online-app-broke-my-heart/</link>
		<comments>http://www.cerebralmastication.com/2010/11/the-oreilly-safari-books-online-app-broke-my-heart/#comments</comments>
		<pubDate>Thu, 11 Nov 2010 22:15:54 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[strategy]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=910</guid>
		<description><![CDATA[
I&#8217;m a huge O&#8217;Reilly Media fan boy. I can&#8217;t hide it. I hear Tim O&#8217;Reilly speak at conferences and I think to myself, &#8220;Screw being president, I want to be Tim O&#8217;Reilly.&#8221; I&#8217;ve been a subscriber to their online book services called Safari Books Online for years. Every month I see the bill for $43 [...]]]></description>
			<content:encoded><![CDATA[<p><!-- p { margin-bottom: 0.08in; } --></p>
<p>I&#8217;m a huge O&#8217;Reilly<a href="http://www.cerebralmastication.com/wp-content/uploads/2010/11/Screenshot-8.png"><img class="alignleft size-medium wp-image-913" style="border: 1px solid black; margin: 5px;" title="Screenshot-8" src="http://www.cerebralmastication.com/wp-content/uploads/2010/11/Screenshot-8-300x200.png" alt="" width="300" height="200" /></a> Media fan boy. I can&#8217;t hide it. I hear Tim O&#8217;Reilly speak at conferences and I think to myself, &#8220;Screw being president, I want to be Tim O&#8217;Reilly.&#8221; I&#8217;ve been a subscriber to their online book services called Safari Books Online for years. Every month I see the bill for $43 come through and I think to myself, &#8220;Self, that&#8217;s the best $43 you spent all month.&#8221; But the real downside of Safari Books Online is that it is, as the name implies, an online service. I spend 90 minutes each day on a train and I would LOVE to spend a huge chunk of that time reading O&#8217;Reilly books. My iPad is not the 3g model so reading Safari Books Online is not an option for me. Then earlier this week I read that they had released the O&#8217;Reilly Safari to Go app for the iPad. I was stoked and excited! I got so breathless that I even tweeted my excitement and then was re-tweeted by @OreillyMedia as you can see from the image in the upper left corner.</p>
<p>I immediately downloaded the app and started playing with it. The fit and finish was not too good, but this is a first release product so I was cutting it some slack. It was a little slow and the screens visibly flashed when I changed screens. Typing<a href="http://www.cerebralmastication.com/wp-content/uploads/2010/11/photo1.png"><img class="alignleft size-medium wp-image-919" title="photo" src="http://www.cerebralmastication.com/wp-content/uploads/2010/11/photo1-280x300.png" alt="" width="280" height="300" /></a> was so sluggish that the cursor would lag behind my typing for 5-6 letters. This was all annoying but I was so excited to have these books on my train ride. After struggling a little to figure out how to get books into my offline book-bag I loaded 6 books into the bag and then left the app up and iPad running so they could download while I worked. When I got on the train I was dismayed to discover that I had no books at all in my off line book back. Odd. I know I put 6 in there. After tucking my daughter into bed I spent 3 hours fighting with the app. My final conclusion is that the app is complete and utter shit. It&#8217;s poorly designed, poorly executed, and horrible to use. And the UI is nothing like an iPad app. It has zero redeeming value. The offline book bag is so buggy that it takes me ~8 tries to get a single book in the book bag. Often this after waiting for &gt; 5 minutes for the book to download only to have it fail and I have to start over. For online book reading on the iPad the mobile version of the Safari Books website is far superior to the iPad app. Most of my time with the app was reading error messages like the one to the left. What I found bemusing was that I really did feel mad at O&#8217;Reilly for this app. It wasn&#8217;t the mad that I feel when I get ripped off, it was the mad that I feel when my 3 year old dumps her plate out on the table like a baby. It was a feeling of being let down by someone who I know can do better. And it appears I&#8217;m not the only one. The pissed off comments on the Safari Books Online <a href="http://safaribooksonline.wordpress.com/2010/11/08/safari-to-go-update/" onclick="pageTracker._trackPageview('/outgoing/safaribooksonline.wordpress.com/2010/11/08/safari-to-go-update/?referer=');">official blog</a> are down right angry. So I did a little soul searching and asked myself why I felt so angry about my experience with the app. What I uncovered I tried to capture in a response post I made to CJ Rayhill, SVP Product Management &amp; Technology. You can see my response <a href="http://safaribooksonline.wordpress.com/2010/11/08/safari-to-go-update/#comment-939" onclick="pageTracker._trackPageview('/outgoing/safaribooksonline.wordpress.com/2010/11/08/safari-to-go-update/_comment-939?referer=');">here</a>. And here&#8217;s the same text for your easy reading enjoyment:</p>
<blockquote><p>CJ, I know you and your team have to be in pain over this app. It&#8217;s terrible. You know that. And now you have a sunk cost problem, a vendor issue, and a “pissed off geeks with pitchforks” problem. Many of us have been there. There are bound to be multiple come-to-Jesus meetings over this. I&#8217;ve sat in meetings like that. I&#8217;ve led meetings like that. It sucks for every single person at the table.</p>
<p>I&#8217;m not sure if the vitriol in the tone of the comments above makes sense to you or your leadership team. Some folks reading this blog might think that the responses are a little over the top. Let me take a shot at helping this make sense through a personal anecdote.</p>
<p>I love O&#8217;Reilly Publishing. Recently I was invited to be a tech reviewer for _R Cookbook_ and I was over the moon to be asked by O&#8217;Reilly to be a reviewer because I love O&#8217;Reilly and I have a ton of positive feelings about those fantastic animal clad book covers. So, it&#8217;s an understatement to say I&#8217;m a fan. And I have this very personal device, my iPad, which I also love. This device is so intimate that I bring it to bed with me and my wife sometimes feels jealousy toward the time and attention I give to this device. So I invited O&#8217;Reilly, who I love and trust, to come join me for a shared experience on this very personal device. And when O&#8217;Reilly came over, in the form of the Safari to Go app, it was like having a trusted friend over who then decides to rub their muddy shoes on my suede couch while yelling &#8220;F*ck your couch! F*ck your couch!&#8221;  The app is shockingly bad and totally inconsistent with the rest of my experience with O&#8217;Reilly. Hours which I could have spent kicking ass were spent being mocked by this poorly coded and dysfunctional app now hogging the resources of my most intimate personal companion.</p>
<p>You can see this level of hurt and frustration in the blog comments above. The relationship O&#8217;Reilly has with its customers is special. You help us kick ass each and every day. When we want to learn something we go to you and you teach us through your books, your blogs, and your magazines. We&#8217;re the ones who download IT Conversations podcast and scan through the playlist deleting Dr. Moira Gunn in order to move Tim O&#8217;Reilly higher up in the playlist. When we daydream about being rock stars, we don&#8217;t think about which model of Fender we&#8217;ll play, we think about which animal the editors will pick to go on the cover of our book. And we hope to god they don&#8217;t pick some overly cuddly critter or a 3 toed sloth. We want to be like Randal Schwartz and have our book known simply by the animal on the cover.</p>
<p>CJ, you&#8217;re an ass kicker too. You graduated from the Navel [sic] Academy, for crying out loud. You&#8217;re a trail blazer and the Safari to Go app is a trailblazer. But I (and many others) think this project has lost its way. It seems the trail you tried to blaze was creating a multi-platform reader. Please allow me to be so bold as to suggest this is not the right goal. A better goal is to thrill your rock star fans with the best possible mobile off line Safari reading experience that helps them kick serious ass.  You&#8217;ve got some hard choices to make about your vendor, your technology stack, and your implementation strategy. They are hard choices. But hard choices are the cost of being a trailblazer. If it was easy, someone else would have already done it.</p>
<p>I believe that the Safari mobile initiative could revolutionize not only technical books, but also text books. But the core of the platform has to be solid. Not only is the current core not solid, it&#8217;s unusable. But I know you can fix it. I&#8217;m glad Safari Books Online has you at the helm of their ship. Fix this thing, CJ, so we can all be rock stars with you. We&#8217;re mad because were disappointed. But we want so much to be thrilled.</p>
<p>-JD Long<br />
@CMastication</p></blockquote>
<p>If the couch reference is not entirely obvious, then you should brush up on your Dave Chappelle:</p>
<p><object width="640" height="385"><param name="movie" value="http://www.youtube.com/v/dUb06iLjTKA?fs=1&amp;hl=en_US"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/dUb06iLjTKA?fs=1&amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/11/the-oreilly-safari-books-online-app-broke-my-heart/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Connecting to SQL Server from R using RJDBC</title>
		<link>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/</link>
		<comments>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/#comments</comments>
		<pubDate>Wed, 22 Sep 2010 18:00:26 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[sql server]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=891</guid>
		<description><![CDATA[A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo.png"><img class="alignleft size-medium wp-image-901" style="border: 2px solid black; margin: 3px;" title="sql_server_2008_logo" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sql_server_2008_logo-300x187.png" alt="" width="235" height="146" /></a>A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.</p>
<p><a href="http://www.fosstrading.com/" onclick="pageTracker._trackPageview('/outgoing/www.fosstrading.com/?referer=');">Joshua Ulrich</a> was kind enough to help me out by pointing me to <a href="http://www.rforge.net/RJDBC/" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/RJDBC/?referer=');">RJDBC</a> which scared me a little (I&#8217;m easily spooked) because it involves Java. The only thing I know about Java is every time I touch it I <a href="http://stackoverflow.com/questions/3311940/r-rjava-package-install-failing" target="_blank" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/3311940/r-rjava-package-install-failing?referer=');">spend days trying to get environment variables</a> loaded just exactly the way it wants them. But Josh assured me that it was really not that hard. Here&#8217;s the short version:</p>
<p><a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707&amp;displaylang=en" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.microsoft.com/downloads/en/details.aspx?FamilyID=a737000d-68d0-4531-b65d-da0f2a735707_amp_displaylang=en&amp;referer=');">Download the RJDBC driver from Microsoft</a>. There&#8217;s Win and *nix versions, so grab which ever you need. Unpack the driver in a known location (I used /etc/sqljdbc_2.0/). Then access the driver from R like so:</p>
<pre>require(RJDBC)
drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "/etc/sqljdbc_2.0/sqljdbc4.jar") 
  conn &lt;- dbConnect(drv, "jdbc:sqlserver://serverName", "userID", "password")
#then build a query and run it
sqlText &lt;- paste("
   SELECT * FROM myTable
  ", sep="")
queryResults &lt;- dbGetQuery(conn, sqlText)</pre>
<p>I have a few scripts that I want to run on both my Ubuntu laptop and my Windows Server. To accommodate that I made my scripts compatible with both by doing the following to my drv line:</p>
<pre>if (.Platform$OS.type == "unix"){
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
         "/etc/sqljdbc_2.0/sqljdbc4.jar")
} else {
         drv &lt;- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
        "C:/Program Files/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0
         /enu/sqljdbc4.jar")
 }</pre>
<p>Obviously if you unpacked your drivers in different locations you&#8217;ll need to molest the code to fit your life situation.</p>
<p><span style="color: #ff6600;"><strong>EDIT: </strong>A MUCH better place to put the JDBC drivers in Ubuntu would be the /opt/ path as opposed to /etc/ which I used above. In Ubuntu the /opt/ directory is where one should put user executables and /etc/ should be reserved for packages installed by apt. I&#8217;m not familiar with all the conventions in Ubuntu (or even Linux in general) so I didn&#8217;t realize this until I got some reader feedback. </span></p>
<p>Be forewarned, RJDBC is pretty damn slow and it appears to no longer be in active development. For my use case, RODBC was clearly faster. But RJDBC works for me in Ubuntu and that was my biggest need.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/connecting-to-sql-server-from-r-using-rjdbc/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Principal Component Analysis (PCA) vs Ordinary Least Squares (OLS): A Visual Explanation</title>
		<link>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/</link>
		<comments>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 17:11:27 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=866</guid>
		<description><![CDATA[Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it MEAN?&#8221;
I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png"><img class="size-full wp-image-876 alignleft" style="border: 2px solid black; margin: 3px;" title="sa" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/sa.png" alt="" width="299" height="82" /></a>Over at stats.stackexchange.com recently, a <a href="http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700#2700" onclick="pageTracker._trackPageview('/outgoing/stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/2700_2700?referer=');">really interesting question was raised</a> about principal component analysis (PCA). The gist was &#8220;Thanks to my college class I can do the math, but what does it <strong>MEAN</strong>?&#8221;</p>
<p>I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled &#8220;Why I give a shit.&#8221; A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster.  OK, that&#8217;s a lie. Later in life I learned that bordered Hessian matrices are a second derivative test used in some optimizations. Not that I would EVER do that shit by hand. I&#8217;d use some R package and blindly trust that it was coded properly.</p>
<p>So back to PCA: as I was reading the aforementioned stats question I was reminded of a recent presentation that <a href="http://quanttrader.info/public/" onclick="pageTracker._trackPageview('/outgoing/quanttrader.info/public/?referer=');">Paul Teetor</a> gave at a August Chicago R User Group. In his presentation on spread trading with R he showed a graphic that illustrated the difference between OLS and PCA. I took some notes and went home and made sure I could recreate the same thing. If you have wondered what makes OLS and PCA different, open up an R session and play along.</p>
<p><strong>Your Independent Variable Matters:</strong></p>
<p>The first observation to make is that regressing x ~ y is not the same as y ~ x even in a simple univariate regression. You can illustrate this by doing the following:</p>
<blockquote><p>set.seed(2)<br />
x &lt;- 1:100</p>
<p>y &lt;- 20 + 3 * x<br />
e &lt;- rnorm(100, 0, 60)<br />
y &lt;- 20 + 3 * x + e</p>
<p>plot(x,y)<br />
yx.lm &lt;- lm(y ~ x)<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)</p>
<p>xy.lm &lt;- lm(x ~ y)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>You should get something that looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols.png"><img class="size-medium wp-image-867 alignnone" title="olsVSols" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSols-280x300.png" alt="" width="280" height="300" /></a></p>
<p>So it&#8217;s obvious they give different lines. But why? Well, OLS minimizes the error between the dependent and the model. Two of these errors are illustrated for the y ~ x case in the following picture:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1.png"><img class="alignnone size-medium wp-image-870" title="OLS1" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS1-280x300.png" alt="" width="280" height="300" /></a></p>
<p>But when we flip the model around and regress x ~ y then OLS minimizes these errors:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2.png"><img class="alignnone size-medium wp-image-871" title="OLS2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/OLS2-280x300.png" alt="" width="280" height="300" /></a></p>
<p>Ok, so what about PCA?</p>
<p>Well let&#8217;s draw the first principal component the old school way:</p>
<blockquote><p>#normalize means and cbind together<br />
xyNorm &lt;- cbind(x=x-mean(x), y=y-mean(y))<br />
plot(xyNorm)</p>
<p>#covariance<br />
xyCov &lt;- cov(xyNorm)<br />
eigenValues &lt;- eigen(xyCov)$values<br />
eigenVectors &lt;- eigen(xyCov)$vectors</p>
<p>plot(xyNorm, ylim=c(-200,200), xlim=c(-200,200))<br />
lines(xyNorm[x], eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x])<br />
lines(xyNorm[x], eigenVectors[2,2]/eigenVectors[1,2] * xyNorm[x])</p>
<p># the largest eigenValue is the first one<br />
# so that&#8217;s our principal component.<br />
# but the principal component is in normalized terms (mean=0)<br />
# and we want it back in real terms like our starting data<br />
# so let&#8217;s denormalize it<br />
plot(xy)<br />
lines(x, (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y))<br />
# that looks right. line through the middle as expected</p>
<p># what if we bring back our other two regressions?<br />
lines(x, predict(yx.lm), col=&#8221;red&#8221;)<br />
lines(predict(xy.lm), y, col=&#8221;blue&#8221;)</p></blockquote>
<p>PCA minimizes the error orthogonal (perpendicular) to the model line. So first principal component  looks like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca.png"><img class="alignnone size-medium wp-image-872" title="pca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/pca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The two yellow lines, as in the previous images, examples of two of the errors which the routine minimizes.</p>
<p>So if we plot all three lines on the same scatter plot we can see the differences:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca.png"><img class="alignnone size-medium wp-image-873" title="olsVSpca" src="http://www.cerebralmastication.com/wp-content/uploads/2010/09/olsVSpca-280x300.png" alt="" width="280" height="300" /></a></p>
<p>The x ~ y OLS and the first principal component are pretty close, but click on the image to get a better view and you will see they are not exactly the same.</p>
<p>All the code from the above examples can be found in a <a href="http://gist.github.com/582767" onclick="pageTracker._trackPageview('/outgoing/gist.github.com/582767?referer=');">gist over at GitHub.com</a>. It&#8217;s best to copy and past from the github as sometimes Wordpress molests my quotes and breaks the codez.</p>
<p>The best introduction to PCA which I have read is the one I link to on Stats.StackExchange.com. It&#8217;s titled <a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf" onclick="pageTracker._trackPageview('/outgoing/www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf?referer=');">&#8220;A Tutorial on Principal Components Analysis&#8221; by Lindsay I Smith</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/09/principal-component-analysis-pca-vs-ordinary-least-squares-ols-a-visual-explination/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

