Using the R multicore package in Linux with wild and passionate abandon

One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It’s not uncommon for each of my simulations to take 20 seconds or more to complete (if you’re doing the math, that’s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an Oracle VirtualBox (Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I’ve moved to running my sims on EC2 machines. I’m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It’s not very sexy, but it gets the job done… I guess the same could be said of myself, except substitute “makes slurping sounds eating udon” in the place of “gets the job done.”

When running processor intensive crap (that’s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the multicore package is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It’s almost like magic, except without the rabbit and hat.

The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for ‘magic carpet’ and is an allusion to the worm hole technology. So how does one harness this package for nefarious self interest doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that’s basically it.

How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I’ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the ‘valuation’ based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:

myOutput <- mclapply( drawList, function(x) valuationReturns(drawNumber=x))

The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!

I recommend the htop program for tracking what’s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:

I don’t understand time travel, but I’ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.

Let me know if you have other multicore tips and tricks.

If you want to give me shit for running my simulations as root, feel free. I’m impervious to your “best practices” mumbo jumbo. La la la la la la!! Not listening!

Special thanks to John Cavazos over at the University of Delaware from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.

19 Comments

  1. Boris Shor says:

    Very cool, JD. I’ve also dipped my toes in the multiprocessing water. In my case, I use Windows 64 bit, so Revolution R Enterprise is necessary for me. And it works really well. Here’s an example of how it works under Windows:

    http://blog.revolution-computing.com/2009/05/parallelized-backtesting-with-foreach.html

  2. JD Long says:

    I’ll certainly be looking at foreach(). I’ve gotten some good input that foreach() makes migrating to MPI easier later. And I can’t wait to try out some MPI magic!

  3. Since it sounds like your base simulation code runs in each thread totally separately, why not hack together a use of Elastic MapReduce? If you want N simulations with some summary statistics put out for each of them, call a Mapper that maps an index running from 1 to N to a simulation’s summary statistics, then just have the Reducer pool these statistics into a single output file.

    Of course, it there’s any dependencies between simulations, this approach is not going to help you one bit.

  4. JD Long says:

    JM White, that sounds like a hell of a good idea!! I don’t do path dependent sims so each one is totally independent. Do you have any links to R examples using Elastic MapReduce? I see a lot of stuff that says I should be able to do it but, honestly, I don’t know where to start. For example, I’m not even sure if I write my code on the desktop and then call the cloud and have it pass back answers as if my desktop were the supernode… or do I write special code that gets batched up and sent out then automagically aggregated back together with AMZN working like the super node? Any tips on where to start would be fabulous in the most geeky possibly way.

  5. JD Long,

    I actually haven’t run a R script using Elastic MapReduce so far, but have worked with Ruby and Python a decent amount. The Python example that I found most helpful was the following,

    http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2294

    which I also think is just a really cool data analysis to begin with. I think that example provides a pretty smooth walkthrough of everything, but feel free to e-mail me if you have specific questions.

    The big takeaway message for using Hadoop is really simple: your mapper just needs to (in theory) get input from standard in and output data in a canonical key-value pair output format. In practice, you can ignore standard in and just have the mappers execute arbitrary code. For R, this sums up to saying: just make sure you use the right cat() commands. You can write a test version on your desktop, but eventually you need to upload the mapper and reducer scripts to AWS and let their infrastructure do all the magic.

    There are also more sophisticated tools available for working with Hadoop in R, like David Rosenberg’s Hadoop Streaming package:

    http://r-forge.r-project.org/projects/hadoopstreaming/

  6. Saptarshi Hija says:

    Hello,
    Is there any way I can get how you generated the 10,000 rows and 20 columns? Would it be possible to share the code for valuationReturns?

    Thank you
    Saptarshi

  7. JD Long says:

    Saptarshi, in short I use the QRMlib library to create correlated random deviates. The valuationReturns is an internal model that I can’t share. Sorry about that, but it’s my secret sauce.

  8. Santosh says:

    Thanks JD. I have a bunch of nonlinear estimations with bootstrapped confidence , I think this will make the task easy. I was wondering if this will be useful for calculating VaR for large portfolios(too many assets) where one has to simulate the joint path of constituent assets(that include derivatives).

  9. JD Long says:

    Santosh, multicore should work for your applications. One thing to consider, however, is the issue with path dependency. Mulitcore runs the jobs in parallel so one job can’t depend on another job currently running in parallel. So you could, for example, value all the positions at time 1 using mclapply() then when that step is complete do some logic (random draw, or whatever) then have that step feed another valuation step which is done with mclapply(). Once the price draws are complete you can do each valuation step.

  10. JD,

    Have you profiled your code to determine the bottleneck? Is it something in QRMlib or does your secret sauce need some spice?

    Best,
    Josh

  11. JD Long says:

    Josh:

    Yeah I profiled and know exactly what’s time consuming. Unfortunately it’s the sauce ;)

    I do all my random draws prior to starting the runs, so the QRMlib process doesn’t hold things up. In my sims I take my random draw and then do some curve fitting to resolve a number of relationships then it has to calculate gains/losses for 40K+ policy units. It’s basically the valuation of the policies that takes the longest.

    Probably a reasonable analogy would be valuing 40K+ options in 40 markets.

  12. [...] to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. As I have told you before, multicore makes your R code run fast by using worm holes to shoot your results back from the [...]

  13. dude says:

    How about some seamless integration of plyr with multicore – that would be awesome!

  14. userR says:

    Have you tried running Ubuntu’s revolurion-r package in your VM? It includes a library with faster (multi-threaded) versions of many of the base functions.

    I’ve noticed that for certain operations, there are some substantial speedups (for my toy comparison on a 2-core Ubuntu VM, going from 8 seconds to 1.7)

    On Ubuntu 9.10, just run “sudo aptitude install revolution-r”

    My benchmark was:

    m <- matrix(rnorm(2 * 10^7), ncol = 10^4)
    system.time(crossprod(m))

    Good luck!

  15. ayman says:

    The way mc.preschedule works is this. Lets say you have 2 cores and a list of 10 things.

    If its set to TRUE like: mclapply(1:10, identity, mc.preschedule=TRUE) then R spawns two children (one for each core) and roughly takes all the odd numbers and sends it to the first core and the even numbers go to the second core. Lets say one of the even numbers (lets say #6) takes a long time to do. Then the odd numbers finish faster, then core #1 will appear to be idle while core #2 finishes.

    If its set to FALSE like: mclapply(1:10, identity, mc.preschedule=TRUE) the R will spawn 1 child for each number. As each child finishes, it launches the next. So in this example as #6 takes a long time holding up a core, it can still throttle the other # jobs through the remaining core. In the end, R will have launched as many children as was the length of the input sequence (1:10 in this case) – this takes some time to do as it has to launch a child and clean up afterwards for each thing.

    So generally, if you have a lot of little fast jobs, set it to TRUE to get better performance. If you have a lot of variance in the run time, set it to FALSE.

  16. JD Long says:

    Ayman, thanks for posting that info! That is VERY helpful and I did not understand this before you posted it.

  17. Aleks Clark says:

    ever tried snowfall? sfClusterApplyLB is pretty efficient at maxing out my cores, not sure how the internals work, but it gets the job done without having to mess with what kind of jobs they are.

  18. John Ramey says:

    I have several networked computers that each have multiple cores. What R package would you recommend to take advantage of not only the multiple cores but the multiple computers?

    An even nicer feature that I’m seeking is a job queueing system for this kind of setup; if this is not available, that is okay. I’m really trying to figure out what my options are.

    Please note that I prefer an easier setup over an efficient setup because many of the people that will use this system are not proficient in R.

  19. JD Long says:

    John, I think you want to use snow or RMPI. If you are already running MPI then RMPI is the way to go. If you just have a bunch of computers that are not currently running as a grid, I think snow is the way to go. I highly recommend also using the “for each” package. For each has backends for MC, snow, and RMPI so you only have to change one line of code to switch from one parallel method to another.

Leave a Reply