<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Bootstrapping the latest R into Amazon Elastic Map Reduce</title>
	<atom:link href="http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/</link>
	<description>Something to Chew On</description>
	<lastBuildDate>Wed, 07 Dec 2011 13:07:56 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: JD Long</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-27766</link>
		<dc:creator>JD Long</dc:creator>
		<pubDate>Thu, 07 Apr 2011 16:00:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-27766</guid>
		<description>Aliona, I can think of about 3 ways to do what you are talking about. 

1) Create your own AMI with all needed tools and use that for all nodes. I&#039;ve been thinking about building &amp; maintaining an R/Ubuntu AMI but I&#039;ve just not made the time to do it. 

2) Use Chef or Puppet (sys admin tools) to fire up the cluster and do the configuration. This is kind of a pain since you&#039;d have to learn an admin tool. But If you want a script for loading the latest R and related packages on Debian (may work in Ubuntu too) I have a script here: http://code.google.com/p/segue/source/browse/inst/bootstrapLatestR.sh

3) If your R computational problem can be structured as an lapply() across a list, you might be interested in my Segue package: http://code.google.com/p/segue/

If you end up trying out Segue feel free to post questions/comments on our discussion list: http://groups.google.com/group/segue-r

Segue is currently in alpha. It&#039;s under current development, but I&#039;m using it on a regular basis for real life work. 

-JD</description>
		<content:encoded><![CDATA[<p>Aliona, I can think of about 3 ways to do what you are talking about. </p>
<p>1) Create your own AMI with all needed tools and use that for all nodes. I&#8217;ve been thinking about building &#038; maintaining an R/Ubuntu AMI but I&#8217;ve just not made the time to do it. </p>
<p>2) Use Chef or Puppet (sys admin tools) to fire up the cluster and do the configuration. This is kind of a pain since you&#8217;d have to learn an admin tool. But If you want a script for loading the latest R and related packages on Debian (may work in Ubuntu too) I have a script here: <a href="http://code.google.com/p/segue/source/browse/inst/bootstrapLatestR.sh" rel="nofollow" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/segue/source/browse/inst/bootstrapLatestR.sh?referer=');">http://code.google.com/p/segue/source/browse/inst/bootstrapLatestR.sh</a></p>
<p>3) If your R computational problem can be structured as an lapply() across a list, you might be interested in my Segue package: <a href="http://code.google.com/p/segue/" rel="nofollow" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/segue/?referer=');">http://code.google.com/p/segue/</a></p>
<p>If you end up trying out Segue feel free to post questions/comments on our discussion list: <a href="http://groups.google.com/group/segue-r" rel="nofollow" onclick="pageTracker._trackPageview('/outgoing/groups.google.com/group/segue-r?referer=');">http://groups.google.com/group/segue-r</a></p>
<p>Segue is currently in alpha. It&#8217;s under current development, but I&#8217;m using it on a regular basis for real life work. </p>
<p>-JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aliona</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-27765</link>
		<dc:creator>Aliona</dc:creator>
		<pubDate>Thu, 07 Apr 2011 15:47:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-27765</guid>
		<description>Hi JD, 
I&#039;ve been familiarizing myself with Amazon&#039;s web services lately. Your presentation and posts are extremely helpful. Thank you!

To be honest, I am pretty new to AWS. I have played with s3/EMR and ec2 separately, but I am completely lacking the understanding of their interaction. If you have any resources on that - I would greatly appreciate those as well.
 
My main question, though, is related to the one you are describing here. 
I need to run an R computation on 20 instances. The issue is: it requires some latest packages and R.version &gt; 2.11. 
Your bootstrapping suggestion is great. Unfortunately, though, for 20 instances, it means that each will go separately to cran to update R and install all packages. As you imagine, this could be very time-consuming,  not to mention pressure on bandwidth... 
So, I am wondering if there is a way to set up an ec2 instance or image - with all necessary R and packages, and direct jobflow to use that instance/image as a master node? 

Thanks,
Aliona</description>
		<content:encoded><![CDATA[<p>Hi JD,<br />
I&#8217;ve been familiarizing myself with Amazon&#8217;s web services lately. Your presentation and posts are extremely helpful. Thank you!</p>
<p>To be honest, I am pretty new to AWS. I have played with s3/EMR and ec2 separately, but I am completely lacking the understanding of their interaction. If you have any resources on that &#8211; I would greatly appreciate those as well.</p>
<p>My main question, though, is related to the one you are describing here.<br />
I need to run an R computation on 20 instances. The issue is: it requires some latest packages and R.version &gt; 2.11.<br />
Your bootstrapping suggestion is great. Unfortunately, though, for 20 instances, it means that each will go separately to cran to update R and install all packages. As you imagine, this could be very time-consuming,  not to mention pressure on bandwidth&#8230;<br />
So, I am wondering if there is a way to set up an ec2 instance or image &#8211; with all necessary R and packages, and direct jobflow to use that instance/image as a master node? </p>
<p>Thanks,<br />
Aliona</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Ramey</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-3151</link>
		<dc:creator>John Ramey</dc:creator>
		<pubDate>Thu, 15 Jul 2010 20:36:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-3151</guid>
		<description>Thanks for the feedback. I certainly have a problem that is &quot;embarrassingly parallel,&quot; and my serial implementation of it is too slow for practical purposes. It&#039;s for my dissertation, so I don&#039;t have an immediate need to fix it, but it&#039;s hard for me to play around with new ideas when each iteration is very time-consuming (much like you described with your 40,000 sims example).

Hadley Wickham informed me last night that &quot;plyr will be parallel by end of summer.&quot; He turned me onto that library when I met him at a conference in Mexico a few months ago, and I haven&#039;t looked back since. So the possibility of using this in parallel will be phenomenal!

I enjoy your blog. Your ramblings have given me a lot to think about, so therefore they must be of some value.</description>
		<content:encoded><![CDATA[<p>Thanks for the feedback. I certainly have a problem that is &#8220;embarrassingly parallel,&#8221; and my serial implementation of it is too slow for practical purposes. It&#8217;s for my dissertation, so I don&#8217;t have an immediate need to fix it, but it&#8217;s hard for me to play around with new ideas when each iteration is very time-consuming (much like you described with your 40,000 sims example).</p>
<p>Hadley Wickham informed me last night that &#8220;plyr will be parallel by end of summer.&#8221; He turned me onto that library when I met him at a conference in Mexico a few months ago, and I haven&#8217;t looked back since. So the possibility of using this in parallel will be phenomenal!</p>
<p>I enjoy your blog. Your ramblings have given me a lot to think about, so therefore they must be of some value.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: JD Long</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-3145</link>
		<dc:creator>JD Long</dc:creator>
		<pubDate>Thu, 15 Jul 2010 13:32:51 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-3145</guid>
		<description>John, 

Glad I&#039;ve piqued your interest! Let me see if I can shed some light on your questions:

1) M/R with other infrastructure: I have not done M/R on infrastructure other than Amazon/Hadoop. I&#039;m interested in investigation other structures but EMR just works so dang well that I&#039;ve not spent any time looking at other options. 

2) Yep, I&#039;ve done a little bit of &#039;larger than memory&#039; work with R. Generally the rule of thumb is &#039;big data&#039; means &quot;more data than will fit on one machine.&quot; I&#039;ve NOT done any work with this amount of data. I just don&#039;t have any petabyte scale problems. Most of the &#039;larger than memory&#039; work I&#039;ve done is the type of thing that can be broken into chunks and analyzed one chunk at a time. For example, if I&#039;m looking at statistical data for the US and my modeling resolution is at the state level then I can read one state worth of data into R, do my analysis, spit out some results and then dump my source data and read in the next state from an external DB. 

If you&#039;re interested in &#039;larger than memory&#039; problems and high performance computing with R you should read through the CRAN HPC task view: http://cran.r-project.org/web/views/HighPerformanceComputing.html Lots of good stuff in there. 

In terms of where to start: 

A) Define your problem. IMO it&#039;s very hard to just &quot;learn all you can about HPC and R.&quot; The problem I started with was fairly straight forward: I had simulations that took a minute each and I had to run 40,000 of them. I just wanted to parallelize that. And as I solved that one particular use case I learned a lot about other use cases and when I might want to use them.  

B) foreach is a great abstraction. The thing that makes it great is it has a backend infrastructure. So you write code for foreach once. Then you can run it on a single machine in single thread mode or you can use the multicore backend to run in multithreaded on a single machine or you can change backends and run it on a grid. With foreach changing the backend means only changing one line of code. That, IMO, is a very flexible abstraction. 

I&#039;m working on code to allow parallel processing from R on Amazon EMR. It&#039;s really just a mapper with no reducer. After I get the kinks worked out it&#039;s my intent to create a foreach backened out of it. Actually I hope to community will help me with that ;)

Thanks for reading my blog and I hope my ramblings are of some value... at least on the margin.</description>
		<content:encoded><![CDATA[<p>John, </p>
<p>Glad I&#8217;ve piqued your interest! Let me see if I can shed some light on your questions:</p>
<p>1) M/R with other infrastructure: I have not done M/R on infrastructure other than Amazon/Hadoop. I&#8217;m interested in investigation other structures but EMR just works so dang well that I&#8217;ve not spent any time looking at other options. </p>
<p>2) Yep, I&#8217;ve done a little bit of &#8216;larger than memory&#8217; work with R. Generally the rule of thumb is &#8216;big data&#8217; means &#8220;more data than will fit on one machine.&#8221; I&#8217;ve NOT done any work with this amount of data. I just don&#8217;t have any petabyte scale problems. Most of the &#8216;larger than memory&#8217; work I&#8217;ve done is the type of thing that can be broken into chunks and analyzed one chunk at a time. For example, if I&#8217;m looking at statistical data for the US and my modeling resolution is at the state level then I can read one state worth of data into R, do my analysis, spit out some results and then dump my source data and read in the next state from an external DB. </p>
<p>If you&#8217;re interested in &#8216;larger than memory&#8217; problems and high performance computing with R you should read through the CRAN HPC task view: <a href="http://cran.r-project.org/web/views/HighPerformanceComputing.html" rel="nofollow" onclick="pageTracker._trackPageview('/outgoing/cran.r-project.org/web/views/HighPerformanceComputing.html?referer=');">http://cran.r-project.org/web/views/HighPerformanceComputing.html</a> Lots of good stuff in there. </p>
<p>In terms of where to start: </p>
<p>A) Define your problem. IMO it&#8217;s very hard to just &#8220;learn all you can about HPC and R.&#8221; The problem I started with was fairly straight forward: I had simulations that took a minute each and I had to run 40,000 of them. I just wanted to parallelize that. And as I solved that one particular use case I learned a lot about other use cases and when I might want to use them.  </p>
<p>B) foreach is a great abstraction. The thing that makes it great is it has a backend infrastructure. So you write code for foreach once. Then you can run it on a single machine in single thread mode or you can use the multicore backend to run in multithreaded on a single machine or you can change backends and run it on a grid. With foreach changing the backend means only changing one line of code. That, IMO, is a very flexible abstraction. </p>
<p>I&#8217;m working on code to allow parallel processing from R on Amazon EMR. It&#8217;s really just a mapper with no reducer. After I get the kinks worked out it&#8217;s my intent to create a foreach backened out of it. Actually I hope to community will help me with that <img src='http://www.cerebralmastication.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Thanks for reading my blog and I hope my ramblings are of some value&#8230; at least on the margin.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Ramey</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-3139</link>
		<dc:creator>John Ramey</dc:creator>
		<pubDate>Thu, 15 Jul 2010 04:19:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-3139</guid>
		<description>JD,

After reading this post and watching your presentation, my interest in Map/Reduce (M/R) is certainly piqued.  A couple of questions:

1) Have you attempted to use this M/R with R using another infrastructure, say a cluster? If so, does the code stay much the same as your Amazon code?

2) In the video presentation, you make it clear that the data that you use is &quot;small&quot;.  Have you played with any larger data sets using M/R and R? I&#039;m curious if I&#039;d have to use something like the R package bigmatrix in order to deal with this situation on top of M/R.

Right now I know all the buzz words, but I don&#039;t know what steps to take -- I&#039;m kind of overwhelmed with all of the different options for parallel computation in R (e.g. foreach and multicore), so any advice on where the hell I should start would be much appreciated.</description>
		<content:encoded><![CDATA[<p>JD,</p>
<p>After reading this post and watching your presentation, my interest in Map/Reduce (M/R) is certainly piqued.  A couple of questions:</p>
<p>1) Have you attempted to use this M/R with R using another infrastructure, say a cluster? If so, does the code stay much the same as your Amazon code?</p>
<p>2) In the video presentation, you make it clear that the data that you use is &#8220;small&#8221;.  Have you played with any larger data sets using M/R and R? I&#8217;m curious if I&#8217;d have to use something like the R package bigmatrix in order to deal with this situation on top of M/R.</p>
<p>Right now I know all the buzz words, but I don&#8217;t know what steps to take &#8212; I&#8217;m kind of overwhelmed with all of the different options for parallel computation in R (e.g. foreach and multicore), so any advice on where the hell I should start would be much appreciated.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Siah</title>
		<link>http://www.cerebralmastication.com/2010/06/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/comment-page-1/#comment-2729</link>
		<dc:creator>Siah</dc:creator>
		<pubDate>Mon, 28 Jun 2010 22:30:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.cerebralmastication.com/?p=736#comment-2729</guid>
		<description>Any chance to see some of your EC2/R codes? I&#039;d love to put some of these fancy elastic codes in my dissertation and dress it up a little! 

These Map/Reduce thing sounds like a great buzz word for my dissertation :)</description>
		<content:encoded><![CDATA[<p>Any chance to see some of your EC2/R codes? I&#8217;d love to put some of these fancy elastic codes in my dissertation and dress it up a little! </p>
<p>These Map/Reduce thing sounds like a great buzz word for my dissertation <img src='http://www.cerebralmastication.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
</channel>
</rss>

