Chicago R User Group… It’s for the sexy people!

Give it up for Morris Day and The Mother Fucking Time!!!!

Morris Day, y'all!

I think we all know that Morris Day was talking about when he wrote the lyrics to “The Bird”:

Yes! Hold on now, this dance ain’t for everybody.
Just the sexy people.
White folks, you’re much too tight.
You gotta shake your head like the black folks.
You might get some tonight.
Look out!

That’s right, he was talking about the new R User Group in Chicago! a.k.a Chicago RUG! We know that R is sexy because statistical analysis is sexy. That is, if you’re doing it right! Even Mike Driscol at Dataspora knows that Data Geeks have to get their sexy on.  There is no doubt that Chicago is sexy. The second city is so damned sexy that Karen Abbott wrote Sin in the Second City and managed to get it on the NYT best sellers list. She makes me reconsider my agrarian interpretation of Chicago’s “meat packing” heritage. *rim shot* Thank you, thank you. I’ll be here all week. Try the veal!

If you’re in Chicagoland and reading this blog then you have every reason to get over to the Chicago R User Group web site and sign up! I’m looking forward to meeting all the Chicago R users in the near future. In case you’re afraid you won’t recognize me I’ll be the one that looks just like Morris Day… only white… and not as well dressed… and kinda nerdy. But otherwise, just like Morris.

Now shut up and dance!

Morris Day and the Time on Grooveshark!

The Future of Math is Statistics

The future of math is statistics… and the language of that future is R:

I’ve often thought there was way too little “statistical intuition” in the workplace. I think Author Benjamin would agree.

Lookup Performance in R

Rumor has it that Joe Adler, author of the O’Reilly Book R in a Nutshell, has joined Linked In as a data scientist.  But that does not keep him from still pumping out some interesting content over at OReilly.com. His latest article is about lookup performance in R. He does a great job giving code samples and explaining what he is doing. Worth reading, for sure.

Real-World, Real-Time Analytics

Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and read Peter Skomoroch’s interview with Bradford Cross of FlightCaster.

Peter wrote up this interview back in August 2009, so I’m a little late to this party. There’s some really great quotes in this interview. Here’s a few of my fav quotes from Cross:

At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.

Building layer upon layer of abstraction is a big key…  The technical term for this is “wrap the crap.”

Here’s a problem I think anyone who works with data and models can relate to:

I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.

You can Hadoop it! It’s elastic! Boogie woogie woog-ie!

This blog's name in Chinese!

I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng (肏你娘) which your friend in grad school told you means “Live happy with many blessings”. Trust me, I’ve been hanging with Madam Wu and she told me it doesn’t mean that.

So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. As I have told you before, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn’t make this up if I tried, so you know it’s true! And what’s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their Elastic MapReduce service! I think Amazon coined the term “Time Travel as a Service” or TTaaS  generally pronounced as “ta-tas” in the industry. If you are a CTO be sure and use this in your next “vision statement” pitch so everyone will know you’re hip to all this cloud stuff.

So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don’t you? Well it’s a damn good thing you read this blog because I’m going to give you the keys to the Wu dynasty and a little 福寿 while we’re at it.

I’ve never had an original thought in my life so I started with this discussion over at the AMZN E M/R discussion forum. Peter Skomoroch from Data Wrangling gives a very good example with all the data and code provided so you can run it yourself.  Pete’s example really shakes the yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg’s HadoopStreaming package which was good for insight, but I didn’t use the package as it’s really focused on the ‘big data’ problem.

That elephant is so freaking cute!

Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don’t have big data. I have big CPU hogging models (mostly slow because I don’t code worth a shit). What got me reconsidering my world view was John Myles White’s comment on my multicore post earlier. John encouraged me to look into running my simulations on AMZN’s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.

I learned a few “gotchas” or, as we say in the future: 臭婊子(I think that should be plural). I’ll do a whole blog post on gotchas soon, but here’s the bullet points:

  • AMZN is currently running the version of Debian Linux named Lenny which has version 2.7.1 of R installed. No matter what the documentation says, don’t let Lenny tend to the rabbits.
  • Test all code by firing up an interactive Pig instance and logging in as ‘hadoop’. Instead of running Pig, run R and test your code. And as it says in the FAQ: “The Pig don’t care either way. ” Which, despite sounding like buggery, is the truth.
  • If your code runs inside of R on a Hadoop instance, drop back to the command line on the Hadoop instance and run ‘cat infile.txt | yourMapper.R | sort | yourReducer.R > outfile.txt’. This pipes your input file into your mapper file which does it’s thing and then pipes the results to your reducer file which then “pumps up the jam” into an output file.  What you see in the outfile.txt is what Hadoop will produce. So it you don’t like what you see, you better do some more coding.
  • You CAN load packages into R in a Hadoop instance running in AMZN E M/R. There are a few caveats, of course:
  1. Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.
  2. As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see from this question, I am not yet comfortable with.
  3. There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there’s no possibility for a conflict! Here’s an example of how I load the Hmisc package:
  • You’ll probably want to provide some data to R. This is done by uploading your files to S3 and then passing the “-cacheFile” option to Hadoop. To get the plyr package to load in R 2.7.1 I had to edit the package. I then uploaded the altered package thusly:

-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz

More to come later. I’ve gotta get back to the future.

You hold the elephant and I'll plug this in.

Using the R multicore package in Linux with wild and passionate abandon

One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It’s not uncommon for each of my simulations to take 20 seconds or more to complete (if you’re doing the math, that’s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an Oracle VirtualBox (Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I’ve moved to running my sims on EC2 machines. I’m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It’s not very sexy, but it gets the job done… I guess the same could be said of myself, except substitute “makes slurping sounds eating udon” in the place of “gets the job done.”

When running processor intensive crap (that’s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the multicore package is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It’s almost like magic, except without the rabbit and hat.

The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for ‘magic carpet’ and is an allusion to the worm hole technology. So how does one harness this package for nefarious self interest doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that’s basically it.

How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I’ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the ‘valuation’ based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:

myOutput <- mclapply( drawList, function(x) valuationReturns(drawNumber=x))

The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!

I recommend the htop program for tracking what’s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:

I don’t understand time travel, but I’ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.

Let me know if you have other multicore tips and tricks.

If you want to give me shit for running my simulations as root, feel free. I’m impervious to your “best practices” mumbo jumbo. La la la la la la!! Not listening!

Special thanks to John Cavazos over at the University of Delaware from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.

Remote Backup Fail and How to Silently Copy Files

Today I called my firms desktop support to talk to them about how to get Iron Mountain Connected Backup to archive files located somewhere other than [C:\Documents and Settings\user\] and through talking with my desktop support guy I discovered that it doesn’t support that. Oh, and by the way it’s a “desktop backup” so it’s not backup up my MS Access files or Outlook PST files. I told the guy that I had gone in and made sure it was backing those files up and they were checked in the UI. He informed me that it may look like they are backed up, but I can’t restore them. To which I responded “Any developer who writes backup software that will backup a file it can’t restore should be kicked squarely in the nuts and then never allowed near a computer for life” I’m not kidding. Honest to god I would kick an Iron Mountain developer right in the baby maker for passing this piece of shit program off as “enterprise ready.” The only way this program could be more useless is if it actually deleted files from my PC instead of backing them up. If the software is crippled because they are selling it as a “desktop backup” then, by god, they better tell me that in big fucking blinking letters and a marching band playing John Philip Sousa on my lap.

Alternatives: I’ve been running Jungle Disk at home and really like it. I could use that at work except I have not set up an Amazon or RackSpace account with my work credit card. But I am in Chicago and my database server/ file server is in Dallas TX. So I decided to just create a mirror on my laptop onto a shared drive on my server. There’s lots of ways to do this, but the path I chose was to use RoboCopy, a command line copy tool from Microsoft that is part of the Windows Server 2003 Resource Kit. I’m running XP and I wanted the mirroring of my machine to be invisible, silent, and scheduled. To do this I found I needed to take the following steps:

  1. Install RoboCopy
  2. Create a batch file to mirror the directory I wanted
  3. Create a windows script to call the batch silently
  4. Schedule the windows script to run automagically

Install RoboCopy: Download the Windows Server 2003 Resource Kit and install it. Very easy.

Create a batch file to run RoboCopy: I named mine c:/backup.bat and it looks something like this:

Set Source=”C:\Documents and Settings\jdlong”
Set Dest=”\\myDallasServer\backup\jdlong”
Robocopy %Source% %Dest% /MIR /Z /R:0  >nul

This simply sets the source and destination and then runs RoboCopy with the /MIR (mirror) and /Z (restartable) switches invoked

Create a windows script: The problem with the batch file is that it is noisy when it runs. Even piping the output to nul it still produces a CMD window that stays up until it finishes running. That’s where the Windows Script file comes into play. It calls the batch file but hides the CMD window. I created a file called c:\runBackup.vbs that has this in it:

Set WshShell = CreateObject(“WScript.Shell”)
WshShell.Run chr(34) & “C:\backup.bat” & Chr(34), 0
Set WshShell = Nothing

Schedule the windows script: Control Panel -> Scheduled Tasks. Then I created a new task that runs  c:\runBackup.vbs every night at 11PM. The only down side is that when I change my password I have to remember to change the password associated with the scheduled task or it will fail.
The only upside is that I figured out that Iron Mountain sucks prior to having data loss. I got lucky. Next week I am going to test my backup. And then test it every quarter after that. And I won’t depend on my corporate IT do to my backups.

Struggling with apply() in R

It’s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the following discussion on Stack Overflow:

apply_struggle

Dirk’s comment is actually spot on. I’ve asked the same damn question at least 4-5 times. Only I didn’t really understand it was the same question. That’s one of the problems of not really being good at something; it’s hard to think abstractly about it. I’m not really good at R, so sometimes I don’t realize that multiple concepts are related. As I talk with other new users of R it’s clear that unless they come from a programming language with an apply-esque construct they likely are struggling with R. I think most of the confusion comes from a) not understanding what data format apply() is going to return and b) not understanding anonymous functions.

With this in mind I did a little screencast illustrating how this struggle plays out for a new users. I also show why I use the plyr package for much of the stuff other folks use apply() for.

Any feedback you have is appreciated. This is my first stab at a screencast, so I am still trying to figure out the best approach/method as well as how many drinks puts me on the Ballmer Peak.

EDIT: it’s been pointed out that I misuse some terminology a number of times. I should have named my year vector “yearVector.” By calling it “yearList” I then refer to the vector as a list. I was using “list” in the vernacular, but since list is a specific R data structure it is confusing that I named a vector a name with “list” in it.

Loading Big (ish) Data into R

So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:

This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)

2gib

I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.

So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:

library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.

So how long did it take to run? Just under 5 minutes.

So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:

system.time(big.df <- read.csv(‘bigdf.csv’))

And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.

-JD

Using Amazon EC2 to Thwart Crappy Internal IT Services

ec2 tweet

The alternative title of this blog post is “How to get your sorry ass fired by violating your internal IT policies.” So keep that in mind as you read this.

I say lots of silly crap. Twitter allows me the pleasure of sharing this blather with the world. I was a little surprised that of all the things I have said over the last few months the above Tweet received the most discussion. Apparently this tweet captured the imagination and consternation of some fellow Tweeters. I had people follow up with me and basically ask, “what do you mean?” Twitter is good for a sound bite, but less so for an elaborate answer. Which brings us to this:

What are the top ways Amazon EC2 can allow a business user to escape the manipulative and counterproductive grip of corporate IT? Well I’m glad you asked!

1) Over-restrictive web filtering policies:  When I worked as a risk manager for a Fortune 500 insurance firm I was shocked on the first day when I could not search Google Groups. At the time Google Groups was one of my favorite resources for figuring out everything from SQL syntax to Excel formulas. The firm, like most firms, outsourced the filtering of web content. Apparently they signed up for “Super Freaking Restrictive” filtering. I could not even search the web for “Ubuntu” as all sites with the word Ubuntu in the title or with the world “Ubuntu” passed as a form submission were blocked. Apparently Ubuntu is not just a Linux distro, but also a militant organization of African computer programmers, or something. So how did I get around this with EC2? I would fire up an EC2 Ubuntu instance running Squid proxy before I left home, then ssh into the cloud from work and use a little SSH port forwarding to route my web traffic through the ssh connection and out via Squid. I set up my EC2 instance to listen for ssh on port 443 and my firm’s firewall would let the connection pass as it assumed it was simply ssl traffic into Amazon. Brilliant!

2) Under powered database servers: At another point I was responsible for data analytics on a portfolio of insurance policies. I had to join together data from multiple systems (underwriting, admin, claims, etc.). The firm was an Oracle shop and none of the Oracle machines had enough user space for me to make the big ass join that had to be made in order to cobble together my analytics. For a while I hobbled along using PROC SQL in SAS to bring all the data together inside of SAS running on a PC. Finally I just gave up and built my own data mart in the cloud. And I could totally cut my internal IT politics out of the system. Whew, once the politics and begging for resources was over I could kick ass at analytics without having to beg borrow and plead for permissions and space.

3) Failure to backup desktop machines / inadequate shared drive space: Another experience I had was with a firm that decided it was a good policy to NOT back up desktop PCs at all. Each department was given shared drive space on a central server where “business critical” files were supposed to be kept (whatever the hell that means). Only the files on the central server were backed up. I was in the risk management department (ironically) and we had a whopping 100 MB allocated to us. Yes, this was 2004 and 100 MB was not enough to hold 2 years of risk reviews. Not to mention any ad hoc analysis and all the supporting documents. So everyone had their desktop drives, at least one USB drive, and no off site backup. It was during this period that I discovered Jungle Disk which allows client side encrypted data to be backed up to Amazon! Off site backup problem solved! And, once again, corp IT cut out of the system. (yes, this is a use of S3, not EC2) By the way, I paid for backups out of my own pocket because I felt it was very important. Well, I did have the firm buy me books which I happily kept when I left. We’ll call it even.

Let me reiterate that all three of the above uses may have put me in direct violation of my corporate IT policies. And let me also state that ultimately I found a job at a firm where internal IT sees their job as helping the business units get crap done. If you are an IT professional and you find your self thinking, “damn, I have to make sure I restrict my users from all of these crafty uses of EC2″ then, jackass,you are the problem with your firm’s IT department. If you see your job as stopping users then you are a useless burden on your firm and you should be not only fired, but spat upon. The way to prevent users from doing these, and other “shadow IT” behaviors is to provide the IT services that help your users be awesome! If you do that then you don’t have to worry about what your users are up to. They’ll be too damn busy being awesome to have time to mess with Amazon EC2.

All the examples above took place at previous places of employment. I currently use Amazon EC2 in order to scale some of my analytics, but it is done with the knowledge and support of my internal IT team. They fully understand what I am doing and they want to help me be awesome at analysis. It’s amazing how much less time I am wasting these days now that I don’t have to be so creative about avoiding the manipulative and counterproductive intervention of my internal IT team.