Bootstrapping the latest R into Amazon Elastic Map Reduce

I’ve been continuing to muck around with using R inside of Amazon Elastic Map reduce jobs. I’ve been working on abstracting the lapply() logic so that R will farm the pieces out to Amazon EMR. This is coming along really well, thanks in no small part to the Stack Overflow [r] community. I have no idea how crappy coders like me got anything at all done before the Interwebs.

One of the immediate hurdles faced when trying to use AMZN EMR in anger is that the default version of R on EMR is 2.7.1. Yes, that is indeed the version that Moses taught the Israelites to use while they wandered in the desert. I’m impressed by your religious knowledge. At any rate, all kinds of things go to hell when you try to run code and load packages in 2.7.1. When I first started fighting with EMR the only solution was to backport my code and alter any packages so they would run in 2.7.1. Yes, that is, as Moses would say, a Nudnik. Nudnik also happens to be the pet name my neighbors have given me. They love me. Where was I? Oh yeah, Methusla’s R version. Recently Amazon released a neat feature called “Bootstrapping” for EMR. Before you start thinking about sampling and resampling and all that  crap, let me clarify. This is NOT statistical bootstrapping. It’s called bootstrapping because it’s code that runs after each node boots up, but before the mapper procedure runs. So to get a more modern version of R loaded on to each node I set up a little script that updates the sources.list file and then installs the latest version of R. And since I’m a caring, sharing guy, here’s my script:

And if that doesn’t show up for some reason, you can find all 5 lines of its bash glory here over at github.

If you’re not conveniently located in Chicago, IL you may want to change your R mirror location. The bootstrap action can be set up from the EMR web GUI or if you’re firing the jobs off using the elastic-mapreduce command line tools you just add the following option: “–bootstrap-action s3://myBucket/bootstrap.sh” assuming myBucket is the bucket with your script in it and bootstrap.sh contains your bootstrap shell script. And then, as my buddies in Dublin say, “Bob’s your mother’s brother.”

And before you ask, yes, this slows crap down. I’ll probably hack together a script that will take the R binaries and other needed upgrades out of Amazon S3 and load them in a bootstrap action which will greatly speed things up. The above example has one clear advantage over loading binaries from S3: It works right now. And remember folks, code that works right now kicks code that “might work someday” right in the balls. And then mocks it while it cries.

6 Comments

  1. Siah says:

    Any chance to see some of your EC2/R codes? I’d love to put some of these fancy elastic codes in my dissertation and dress it up a little!

    These Map/Reduce thing sounds like a great buzz word for my dissertation :)

  2. John Ramey says:

    JD,

    After reading this post and watching your presentation, my interest in Map/Reduce (M/R) is certainly piqued. A couple of questions:

    1) Have you attempted to use this M/R with R using another infrastructure, say a cluster? If so, does the code stay much the same as your Amazon code?

    2) In the video presentation, you make it clear that the data that you use is “small”. Have you played with any larger data sets using M/R and R? I’m curious if I’d have to use something like the R package bigmatrix in order to deal with this situation on top of M/R.

    Right now I know all the buzz words, but I don’t know what steps to take — I’m kind of overwhelmed with all of the different options for parallel computation in R (e.g. foreach and multicore), so any advice on where the hell I should start would be much appreciated.

  3. JD Long says:

    John,

    Glad I’ve piqued your interest! Let me see if I can shed some light on your questions:

    1) M/R with other infrastructure: I have not done M/R on infrastructure other than Amazon/Hadoop. I’m interested in investigation other structures but EMR just works so dang well that I’ve not spent any time looking at other options.

    2) Yep, I’ve done a little bit of ‘larger than memory’ work with R. Generally the rule of thumb is ‘big data’ means “more data than will fit on one machine.” I’ve NOT done any work with this amount of data. I just don’t have any petabyte scale problems. Most of the ‘larger than memory’ work I’ve done is the type of thing that can be broken into chunks and analyzed one chunk at a time. For example, if I’m looking at statistical data for the US and my modeling resolution is at the state level then I can read one state worth of data into R, do my analysis, spit out some results and then dump my source data and read in the next state from an external DB.

    If you’re interested in ‘larger than memory’ problems and high performance computing with R you should read through the CRAN HPC task view: http://cran.r-project.org/web/views/HighPerformanceComputing.html Lots of good stuff in there.

    In terms of where to start:

    A) Define your problem. IMO it’s very hard to just “learn all you can about HPC and R.” The problem I started with was fairly straight forward: I had simulations that took a minute each and I had to run 40,000 of them. I just wanted to parallelize that. And as I solved that one particular use case I learned a lot about other use cases and when I might want to use them.

    B) foreach is a great abstraction. The thing that makes it great is it has a backend infrastructure. So you write code for foreach once. Then you can run it on a single machine in single thread mode or you can use the multicore backend to run in multithreaded on a single machine or you can change backends and run it on a grid. With foreach changing the backend means only changing one line of code. That, IMO, is a very flexible abstraction.

    I’m working on code to allow parallel processing from R on Amazon EMR. It’s really just a mapper with no reducer. After I get the kinks worked out it’s my intent to create a foreach backened out of it. Actually I hope to community will help me with that ;)

    Thanks for reading my blog and I hope my ramblings are of some value… at least on the margin.

  4. John Ramey says:

    Thanks for the feedback. I certainly have a problem that is “embarrassingly parallel,” and my serial implementation of it is too slow for practical purposes. It’s for my dissertation, so I don’t have an immediate need to fix it, but it’s hard for me to play around with new ideas when each iteration is very time-consuming (much like you described with your 40,000 sims example).

    Hadley Wickham informed me last night that “plyr will be parallel by end of summer.” He turned me onto that library when I met him at a conference in Mexico a few months ago, and I haven’t looked back since. So the possibility of using this in parallel will be phenomenal!

    I enjoy your blog. Your ramblings have given me a lot to think about, so therefore they must be of some value.

  5. Aliona says:

    Hi JD,
    I’ve been familiarizing myself with Amazon’s web services lately. Your presentation and posts are extremely helpful. Thank you!

    To be honest, I am pretty new to AWS. I have played with s3/EMR and ec2 separately, but I am completely lacking the understanding of their interaction. If you have any resources on that – I would greatly appreciate those as well.

    My main question, though, is related to the one you are describing here.
    I need to run an R computation on 20 instances. The issue is: it requires some latest packages and R.version > 2.11.
    Your bootstrapping suggestion is great. Unfortunately, though, for 20 instances, it means that each will go separately to cran to update R and install all packages. As you imagine, this could be very time-consuming, not to mention pressure on bandwidth…
    So, I am wondering if there is a way to set up an ec2 instance or image – with all necessary R and packages, and direct jobflow to use that instance/image as a master node?

    Thanks,
    Aliona

  6. JD Long says:

    Aliona, I can think of about 3 ways to do what you are talking about.

    1) Create your own AMI with all needed tools and use that for all nodes. I’ve been thinking about building & maintaining an R/Ubuntu AMI but I’ve just not made the time to do it.

    2) Use Chef or Puppet (sys admin tools) to fire up the cluster and do the configuration. This is kind of a pain since you’d have to learn an admin tool. But If you want a script for loading the latest R and related packages on Debian (may work in Ubuntu too) I have a script here: http://code.google.com/p/segue/source/browse/inst/bootstrapLatestR.sh

    3) If your R computational problem can be structured as an lapply() across a list, you might be interested in my Segue package: http://code.google.com/p/segue/

    If you end up trying out Segue feel free to post questions/comments on our discussion list: http://groups.google.com/group/segue-r

    Segue is currently in alpha. It’s under current development, but I’m using it on a regular basis for real life work.

    -JD

Leave a Reply