<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cerebral Mastication &#187; howto</title>
	<atom:link href="http://www.cerebralmastication.com/tag/howto/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Fri, 16 Jul 2010 22:07:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>You can Hadoop it! It&#8217;s elastic! Boogie woogie woog-ie!</title>
		<link>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/</link>
		<comments>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 18:31:23 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=592</guid>
		<description><![CDATA[I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_594" class="wp-caption alignleft" style="width: 271px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png"><img class="size-full wp-image-594 " style="border: 1px solid black; margin: 3px;" title="I paid an old man in Chinatown $200 for this!" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/bad_egg.png" alt="" width="261" height="144" /></a><p class="wp-caption-text">This blog&#39;s name in Chinese! </p></div>
<p>I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng  (肏你娘) which your friend in grad school told you means &#8220;Live happy with many blessings&#8221;. Trust me, I&#8217;ve been hanging with Madam Wu and she told me it doesn&#8217;t mean that.</p>
<p>So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">As I have told you before</a>, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn&#8217;t make this up if I tried, so you know it&#8217;s true! And what&#8217;s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their <a href="http://aws.amazon.com/elasticmapreduce/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/elasticmapreduce/?referer=');">Elastic MapReduce service</a>! I think Amazon coined the term &#8220;Time Travel as a Service&#8221; or TTaaS  generally pronounced as &#8220;ta-tas&#8221; in <a href="http://www.savethetatas.com/" onclick="pageTracker._trackPageview('/outgoing/www.savethetatas.com/?referer=');">the industry</a>. If you are a CTO be sure and use this in your next &#8220;vision statement&#8221; pitch so everyone will know you&#8217;re hip to all this cloud stuff.</p>
<p>So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don&#8217;t you? Well it&#8217;s a damn good thing you read this blog because I&#8217;m going to give you the keys to the Wu dynasty and a little 福寿 while we&#8217;re at it.</p>
<p>I&#8217;ve never had an original thought in my life so I started with <a href="http://developer.amazonwebservices.com/connect/thread.jspa?messageID=128995&amp;#128995" onclick="pageTracker._trackPageview('/outgoing/developer.amazonwebservices.com/connect/thread.jspa?messageID=128995_amp_128995&amp;referer=');">this discussion </a>over at the AMZN E M/R discussion forum. Peter Skomoroch from <a href="http://www.datawrangling.com/" onclick="pageTracker._trackPageview('/outgoing/www.datawrangling.com/?referer=');">Data Wrangling </a>gives a very good example with all the data and code provided so you can run it yourself.  Pete&#8217;s example really shakes the  yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg&#8217;s <a href="http://docs.google.com/viewer?url=http%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2FHadoopStreaming%2FHadoopStreaming.pdf" onclick="pageTracker._trackPageview('/outgoing/docs.google.com/viewer?url=http_3A_2F_2Fcran.r-project.org_2Fweb_2Fpackages_2FHadoopStreaming_2FHadoopStreaming.pdf&amp;referer=');">HadoopStreaming package</a> which was good for insight, but I didn&#8217;t use the package as it&#8217;s really focused on the &#8216;big data&#8217; problem.</p>
<div id="attachment_639" class="wp-caption alignleft" style="width: 218px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg"><img class="size-full wp-image-639 " style="border: 1px solid black; margin: 3px;" title="hadoop elephant" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/hadoop-elephant.jpeg" alt="" width="208" height="156" /></a><p class="wp-caption-text">That elephant is so freaking cute! </p></div>
<p>Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don&#8217;t have big data. I have big CPU hogging models (mostly slow because I don&#8217;t code worth a shit). What got me reconsidering my world view was <cite></cite><a onclick="pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com/?referer=');pageTracker._trackPageview('/outgoing/www.johnmyleswhite.com?referer=http%3A%2F%2Fwww.cerebralmastication.com%2F');" rel="external nofollow" href="http://www.johnmyleswhite.com/">John Myles White</a>&#8217;s comment on my <a href="http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/">multicore post </a>earlier. John encouraged me to look into running my simulations on AMZN&#8217;s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.</p>
<p>I learned a few &#8220;gotchas&#8221; or, as we say in the future: 臭婊子(I think that should be plural). I&#8217;ll do a whole blog post on gotchas soon, but here&#8217;s the bullet points:</p>
<ul>
<li>AMZN is currently running the version of Debian Linux named Lenny which has version 2.7.1 of R installed. No matter what the documentation says, don&#8217;t let Lenny tend to the rabbits.</li>
<li>Test all code by firing up an interactive Pig instance and logging in as &#8216;hadoop&#8217;. Instead of running Pig, run R and test your code. And as it says in the FAQ: &#8220;The Pig don&#8217;t care either way. &#8221; Which, despite sounding like buggery, is the truth.</li>
<li>If your code runs inside of R on a Hadoop instance, drop back to the command line on the Hadoop instance and run &#8216;cat infile.txt | yourMapper.R | sort | yourReducer.R &gt; outfile.txt&#8217;. This pipes your input file into your mapper file which does it&#8217;s thing and then pipes the results to your reducer file which then &#8220;pumps up the jam&#8221; into an output file.  What you see in the outfile.txt is what Hadoop will produce. So it you don&#8217;t like what you see, you better do some more coding.</li>
<li>You CAN load packages into R in a Hadoop instance running in AMZN E M/R. There are a few caveats, of course:</li>
</ul>
<ol>
<li>Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.</li>
<li>As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see <a href="http://stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again" onclick="pageTracker._trackPageview('/outgoing/stackoverflow.com/questions/2258511/r-serialize-objects-to-text-file-and-back-again?referer=');">from this question</a>, I am not yet comfortable with.</li>
<li>There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there&#8217;s no possibility for a conflict! Here&#8217;s an example of how I load the Hmisc package:</li>
<p><script src="http://gist.github.com/304262.js?file=AMZNloadPackage.R"></script></ol>
<ul>
<li>You&#8217;ll probably want to provide some data to R. This is done by uploading your files to S3 and then passing the &#8220;-cacheFile&#8221; option to Hadoop. To get the plyr package to load in R 2.7.1 I had to edit the package. I then uploaded the altered package thusly:</li>
</ul>
<blockquote><p>-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz</p></blockquote>
<p>More to come later. I&#8217;ve gotta get back to the future.</p>
<div id="attachment_631" class="wp-caption alignleft" style="width: 314px"><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg"><img class="size-full wp-image-631" style="border: 1px solid black; margin: 3px;" title="christopher_lloyd" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/christopher_lloyd.jpg" alt="" width="304" height="224" /></a><p class="wp-caption-text">You hold the elephant and I&#39;ll plug this in. </p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/you-can-hadoop-it-its-elastic-boogie-woogie-woog-ie/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using the R multicore package in Linux with wild and passionate abandon</title>
		<link>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/</link>
		<comments>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 19:57:20 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=562</guid>
		<description><![CDATA[One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg"><img class="alignleft size-full wp-image-586" style="border: 0pt none; margin: 20px;" title="amd_mc_processing" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/amd_mc_processing.jpg" alt="" width="214" height="193" /></a>One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It&#8217;s not uncommon for each of my simulations to take 20 seconds or more to complete (if you&#8217;re doing the math, that&#8217;s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an <a href="http://www.virtualbox.org/" onclick="pageTracker._trackPageview('/outgoing/www.virtualbox.org/?referer=');">Oracle VirtualBox </a>(Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I&#8217;ve moved to running my sims on EC2 machines. I&#8217;m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It&#8217;s not very sexy, but it gets the job done&#8230; I guess the same could be said of myself, except substitute &#8220;makes slurping sounds eating udon&#8221; in the place of &#8220;gets the job done.&#8221;</p>
<p>When running processor intensive crap (that&#8217;s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the <a href="http://www.rforge.net/doc/packages/multicore/multicore.html" onclick="pageTracker._trackPageview('/outgoing/www.rforge.net/doc/packages/multicore/multicore.html?referer=');">multicore package </a>is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It&#8217;s almost like magic, except without the rabbit and hat.</p>
<p>The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for &#8216;magic carpet&#8217; and is an allusion to the worm hole technology. So how does one harness this package for <span style="text-decoration: line-through;">nefarious self interest </span>doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that&#8217;s basically it.</p>
<p>How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I&#8217;ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the &#8216;valuation&#8217; based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:</p>
<blockquote><p>myOutput &lt;- mclapply( drawList, function(x) valuationReturns(drawNumber=x))</p></blockquote>
<p>The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!</p>
<p>I recommend the <a href="http://htop.sourceforge.net/" onclick="pageTracker._trackPageview('/outgoing/htop.sourceforge.net/?referer=');">htop program </a>for tracking what&#8217;s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:</p>
<p><a href="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png"><img class="size-full wp-image-564 alignnone" title="r on ec2" src="http://www.cerebralmastication.com/wp-content/uploads/2010/02/r-on-ec21.png" alt="" width="535" height="400" /></a></p>
<p>I don&#8217;t understand time travel, but I&#8217;ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.</p>
<p>Let me know if you have other multicore tips and tricks.</p>
<p>If you want to give me shit for running my simulations as root, feel free. I&#8217;m impervious to your &#8220;best practices&#8221; mumbo jumbo. La la la la la la!! Not listening!</p>
<p>Special thanks to <a href="http://www.cis.udel.edu/~cavazos/index.php?page=multicore-programming" onclick="pageTracker._trackPageview('/outgoing/www.cis.udel.edu/_cavazos/index.php?page=multicore-programming&amp;referer=');">John Cavazos over at the University of Delaware</a> from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2010/02/using-the-r-multicore-package-in-linux-with-wild-and-passionate-abandon/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Not Just Normal&#8230; Gaussian</title>
		<link>http://www.cerebralmastication.com/2009/06/not-just-normal-gaussian/</link>
		<comments>http://www.cerebralmastication.com/2009/06/not-just-normal-gaussian/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 16:26:46 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=296</guid>
		<description><![CDATA[Dave, over at The Revolutions Blog, posted about the big &#8216;ol list of graphs created with R that are over at Wikimedia Commons. As I was scrolling through the list I recognized the standard normal distribution from the Wikipedia article on the same topic.
Below is the fairly simple source code with lots of comments. Here&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption alignleft" style="width: 198px"><a href="http://en.wikipedia.org/wiki/Normal_distribution#Standard_deviation_and_confidence_intervals" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Normal_distribution_Standard_deviation_and_confidence_intervals?referer=');"><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/325px-Standard_deviation_diagram.svg.png" alt="Pretty Normal" width="188" height="94" /></a><p class="wp-caption-text">Pretty Normal</p></div>
<p>Dave, over at The Revolutions Blog,<a href="http://blog.revolution-computing.com/2009/06/graphs-created-with-r-on-wikimedia-commons.html" onclick="pageTracker._trackPageview('/outgoing/blog.revolution-computing.com/2009/06/graphs-created-with-r-on-wikimedia-commons.html?referer=');"> posted about the big &#8216;ol list of graphs</a> created with R that are over at <a href="http://commons.wikimedia.org/wiki/Category:Created_with_R" onclick="pageTracker._trackPageview('/outgoing/commons.wikimedia.org/wiki/Category_Created_with_R?referer=');">Wikimedia Commons</a>. As I was scrolling through the list I recognized the standard normal distribution from the <a href="http://en.wikipedia.org/wiki/Normal_distribution#Standard_deviation_and_confidence_intervals" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Normal_distribution_Standard_deviation_and_confidence_intervals?referer=');">Wikipedia article on the same topic</a>.</p>
<p>Below is the fairly simple source code with lots of comments. <a href="http://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg" onclick="pageTracker._trackPageview('/outgoing/commons.wikimedia.org/wiki/File_Standard_deviation_diagram.svg?referer=');">Here&#8217;s the source</a>. Run it at home&#8230; for fun and profit.</p>
<blockquote>
<pre># # External package to generate four shades of blue
# library(RColorBrewer)
# cols &lt;- rev(brewer.pal(4, "Blues"))
cols &lt;- c("#2171B5", "#6BAED6", "#BDD7E7", "#EFF3FF")

# Sequence between -4 and 4 with 0.1 steps
x &lt;- seq(-4, 4, 0.1)

# Plot an empty chart with tight axis boundaries, and axis lines on bottom and left
plot(x, type="n", xaxs="i", yaxs="i", xlim=c(-4, 4), ylim=c(0, 0.4),
     bty="l", xaxt="n", xlab="x-value", ylab="probability density")

# Function to plot each coloured portion of the curve, between "a" and "b" as a
# polygon; the function "dnorm" is the normal probability density function
polysection &lt;- function(a, b, col, n=11){
    dx &lt;- seq(a, b, length.out=n)
    polygon(c(a, dx, b), c(0, dnorm(dx), 0), col=col, border=NA)
    # draw a white vertical line on "inside" side to separate each section
    segments(a, 0, a, dnorm(a), col="white")
}

# Build the four left and right portions of this bell curve
for(i in 0:3){
    polysection(   i, i+1,  col=cols[i+1]) # Right side of 0
    polysection(-i-1,  -i,  col=cols[i+1]) # Left right of 0
}

# Black outline of bell curve
lines(x, dnorm(x))

# Bottom axis values, where sigma represents standard deviation and mu is the mean
axis(1, at=-3:3, labels=expression(-3*sigma, -2*sigma, -1*sigma, mu,
                                    1*sigma,  2*sigma,  3*sigma))

# Add percent densities to each division, between x and x+1
pd &lt;- sprintf("%.1f%%", 100*(pnorm(1:4) - pnorm(0:3)))
text(c((0:3)+0.5,(0:-3)-0.5), c(0.16, 0.05, 0.04, 0.02), pd, col=c("white","white","black","black"))
segments(c(-2.5, -3.5, 2.5, 3.5), dnorm(c(2.5, 3.5)), c(-2.5, -3.5, 2.5, 3.5), c(0.03, 0.01))</pre>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/06/not-just-normal-gaussian/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Box plot vs. Violin plot in R</title>
		<link>http://www.cerebralmastication.com/2009/02/box-plot-vs-violin-plot-in-r/</link>
		<comments>http://www.cerebralmastication.com/2009/02/box-plot-vs-violin-plot-in-r/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 19:50:15 +0000</pubDate>
		<dc:creator>JD Long</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=156</guid>
		<description><![CDATA[So Andrew Gelman hates box plots. Not that you should give a buck what Gelman thinks. I&#8217;m just setting this blog post up, OK. So stick with me. Gelman also thought this XKCD cartoon was NOT funny :

There&#8217;s some correlation as well as causation. I could be wrong, but I suspect that the reason Gelman [...]]]></description>
			<content:encoded><![CDATA[<p>So Andrew Gelman hates box plots. Not that you should give a buck what Gelman thinks. I&#8217;m just setting this blog post up, OK. So stick with me. Gelman also thought this XKCD cartoon <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2009/02/cartoon.html" onclick="pageTracker._trackPageview('/outgoing/www.stat.columbia.edu/_cook/movabletype/archives/2009/02/cartoon.html?referer=');">was NOT funny</a> :</p>
<p><a href="http://xkcd.com/539/" onclick="pageTracker._trackPageview('/outgoing/xkcd.com/539/?referer=');"><img class="alignnone" src="http://imgs.xkcd.com/comics/boyfriend.png" alt="" width="636" height="189" /></a></p>
<p>There&#8217;s some correlation as well as causation. I could be wrong, but I suspect that the reason Gelman does not like the XKCD cartoon is because he&#8217;s very literal, as geeks can be. Trust me, my wife is married to a geek. It probably also has something to do with how much Gelman hates box plots. He hates them so much that he is holding a contest to see if anyone can prove to him that a box plot is an <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2009/02/boxplot-challen.html" onclick="pageTracker._trackPageview('/outgoing/www.stat.columbia.edu/_cook/movabletype/archives/2009/02/boxplot-challen.html?referer=');">appropriate way to show something</a>. I don&#8217;t think I can persuade him that box plots are &#8216;appropriate&#8217; as that sounds like a matter of taste&#8230; like when I break wind at the breakfast table and my wife says, &#8216;that&#8217;s not appropriate.&#8217; However I can demonstrate the ease at which one can make box plots, and my preferred violin plots using R.</p>
<p>So stick with me and I&#8217;ll give you some free code to take home and try!<span id="more-156"></span></p>
<p>Here&#8217;s the type of thing you see a lot with box plots:</p>
<blockquote><p>x &lt;- rnorm(200)<br />
y &lt;- rlnorm(200)<br />
plot(x, y, xlim=c(-5,5), ylim=c(-2,8))<br />
boxplot(x, col=&#8221;gold&#8221;, horizontal=TRUE, at=-1, add=TRUE,lty=2, rectCol=&#8221;gray&#8221;)<br />
boxplot(y, col=&#8221;blue&#8221;, horizontal=FALSE, at=-4, add=TRUE,lty=2)</p></blockquote>
<p>That produces output like this:</p>
<div id="attachment_163" class="wp-caption alignnone" style="width: 310px"><img class="size-medium wp-image-163" title="tdbox1" src="http://www.cerebralmastication.com/wp-content/uploads/2009/02/tdbox1-300x282.png" alt="Box Plot" width="300" height="282" /><p class="wp-caption-text">Box Plot</p></div>
<p>That&#8217;s kinda cute. You can see the log normal shape of y and the normal shape of x.  An alternative would be to use a violin chart using the following syntax:</p>
<blockquote><p>library(vioplot)<br />
plot(x, y, xlim=c(-5,5), ylim=c(-2,8))<br />
vioplot(x, col=&#8221;gold&#8221;, horizontal=TRUE, at=-1, add=TRUE,lty=2, rectCol=&#8221;gray&#8221;)<br />
vioplot(y, col=&#8221;blue&#8221;, horizontal=FALSE, at=-4, add=TRUE,lty=2)</p></blockquote>
<p>I kept the values for X and Y the same, but the new plot looks like this:</p>
<div id="attachment_164" class="wp-caption alignnone" style="width: 310px"><img class="size-medium wp-image-164" title="vdbox1" src="http://www.cerebralmastication.com/wp-content/uploads/2009/02/vdbox1-300x282.png" alt="Violin Plot" width="300" height="282" /><p class="wp-caption-text">Violin Plot</p></div>
<p>I like that a little better. The violin plot captures the shape of the density mass function (PDF). But in both of these examples we would probably be just as well off if we simply plotted the PDF instead of either the violin plot or the box plot. So they aren&#8217;t really adding anything. So is Gelman right, the box/violin plot is useless? Here&#8217;s what I think it IS good for:</p>
<p><img class="alignnone size-full wp-image-159" title="vioplot" src="http://www.cerebralmastication.com/wp-content/uploads/2009/02/vioplot.png" alt="vioplot" width="451" height="425" /></p>
<p>These are plots of state crop yields in terms of deviation from an expected trend. So 0 on the Y axis means no deviation from trend and 2 is 200% better than trend and you just can&#8217;t do any worse than -100% below trend. I look at this type of stuff all the time, and a box or violin chart is really nice because I can lay out a bunch of states along the X axis and look at how they compare. It&#8217;s easy to compare all the moments of the distributions visually. We can easily see that most states have a mean around 0, but Kansas has MUCH more dispersion as well as a lot of skew. And since you are wondering, no, I didn&#8217;t force the symmetry you see in the graph, it just turned out that way. Luck put Kansas in the middle and luck gave MO and IL the same relative tail. *shrug* sometimes this stuff just looks good. Kinda like me!</p>
<p>Here&#8217;s the same info as above but in a box plot:</p>
<p><img class="alignnone size-full wp-image-165" title="boxplot" src="http://www.cerebralmastication.com/wp-content/uploads/2009/02/boxplot.png" alt="boxplot" width="453" height="427" /></p>
<p>I think the violin plot is more elegant, but the box plot is still OK, albeit rather noisy and less elegant.</p>
<p>So that&#8217;s all the free code you get from me today. Try not to spend it all on candy this time, OK?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cerebralmastication.com/2009/02/box-plot-vs-violin-plot-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
