Starting an EC2 Machine Then Setting Up a Socks Proxy… From R!

I do some work from home, some work from an office in Chicago and some work on the road. It’s not uncommon for me to want to tunnel all my web traffic through a VPN tunnel. In one of my previous blog posts I alluded to using Amazon EC2 as a way to get around your corporate IT mind control voyeurs service providers. This tunneling method is one of the 5 or so ways I have used EC2 to set up a tunnel. I used to fire these tunnels up manually using the Amazon AWS Management Console then opening a shell prompt and entering:

ssh -i ~/MyPersonalKey.pem -D 9999 root@ec2-184-73-41-72.compute-1.amazonaws.com

the -i switch tells ssh to use my RSA identity file stored in ~/MyPersonalKey.pem

the machine name (ec2-184-73-41-72.compute-1.amazonaws.com) I get from the AWS Management Console

the -D is the magic. -D opens an dynamic port forwarding tunnel between my Linux box and the EC2 machine. This is, for all intent and purposes, an encrypted SOCKS4 proxy on port 9999 of localhost. Then I just have to change my proxy settings in Firefox to use use a SOCKS host.

Now that’s all pretty easy. And I like easy. But it’s not easy ENOUGH. You see, I’m lazy. I’m not just lazy in the “I’ll do it mañana” sort of way, but in the “I’m too damn lazy to click my mouse 5 times” way.

So I want this easier. Well, I can make the proxy settings in Firefox easier through the use of the Quick Proxy extension for Firefox. That’s a good start. It turns on and off the proxy with a single mouse click. But I still have to go into the AWS management web site, fire up a machine then log in via SSH. Let’s make that part easier!

While it’s not simple to install and configure, the EC2 command line tools are going to be required in order to make a script that fires up EC2 and then connects to the instance with ssh. I struggled getting the tools to run until I found this tutorial.

Your file locations and names may be different than the tutorial. Change appropriately. I followed the tutorial instructions but I created a key named ec2ApiTools which will come in handy later.

After you get the EC2 tool up and running and you can do something like list the available AMIs without an error you can stop with the tutorial. I’ve been doing a lot of shell scripting lately so I said to myself, “Self, let’s script the ssh connection in R!” For the record, I always end my impredicative in an explanation point which I verbally pronounce as, “BANG!” As a result, when I talk to myself it sounds like two 10 year old boys playing cops and robbers. Anyhow, I did script it with R using Rscript. Because I’m a man who listens to myself.

And since you were kind enough to slog through my channeling the drunken ghost of James Joyce, here’s my script:

If you’re reading this in an RSS reader of for some other reason don’t see an R script above, here’s your link.

The only two EC2 API commands I use in the script are  ec2-run-instances which starts the instance and ec2-describe-instances which gives me a list of running instances and their details.The rest of the script is simply parsing the output and figuring out which instances was started last.

I’ve now set up a launcher panel item that starts the script. Then when I see the xterm window come up I click the little red button in the lower right corner of my browser which switches on the Firefox proxy. Then I’m safe to surf Soldier of Fortune Magazine without the interference of my corp firewall.

Bootstrapping the latest R into Amazon Elastic Map Reduce

I’ve been continuing to muck around with using R inside of Amazon Elastic Map reduce jobs. I’ve been working on abstracting the lapply() logic so that R will farm the pieces out to Amazon EMR. This is coming along really well, thanks in no small part to the Stack Overflow [r] community. I have no idea how crappy coders like me got anything at all done before the Interwebs.

One of the immediate hurdles faced when trying to use AMZN EMR in anger is that the default version of R on EMR is 2.7.1. Yes, that is indeed the version that Moses taught the Israelites to use while they wandered in the desert. I’m impressed by your religious knowledge. At any rate, all kinds of things go to hell when you try to run code and load packages in 2.7.1. When I first started fighting with EMR the only solution was to backport my code and alter any packages so they would run in 2.7.1. Yes, that is, as Moses would say, a Nudnik. Nudnik also happens to be the pet name my neighbors have given me. They love me. Where was I? Oh yeah, Methusla’s R version. Recently Amazon released a neat feature called “Bootstrapping” for EMR. Before you start thinking about sampling and resampling and all that  crap, let me clarify. This is NOT statistical bootstrapping. It’s called bootstrapping because it’s code that runs after each node boots up, but before the mapper procedure runs. So to get a more modern version of R loaded on to each node I set up a little script that updates the sources.list file and then installs the latest version of R. And since I’m a caring, sharing guy, here’s my script:

And if that doesn’t show up for some reason, you can find all 5 lines of its bash glory here over at github.

If you’re not conveniently located in Chicago, IL you may want to change your R mirror location. The bootstrap action can be set up from the EMR web GUI or if you’re firing the jobs off using the elastic-mapreduce command line tools you just add the following option: “–bootstrap-action s3://myBucket/bootstrap.sh” assuming myBucket is the bucket with your script in it and bootstrap.sh contains your bootstrap shell script. And then, as my buddies in Dublin say, “Bob’s your mother’s brother.”

And before you ask, yes, this slows crap down. I’ll probably hack together a script that will take the R binaries and other needed upgrades out of Amazon S3 and load them in a bootstrap action which will greatly speed things up. The above example has one clear advantage over loading binaries from S3: It works right now. And remember folks, code that works right now kicks code that “might work someday” right in the balls. And then mocks it while it cries.

Chicago R Meetup: Healthier than Drinking Alone

I’m kinda blown away by the number of folks who have joined the Chicago R User Group (RUG) in the last few weeks. As of this morning we have 65 people signed up for the group and 25 who have said that they are planning on attending the meetup this Thursday (yes, only 3 days away!) I’m very pleased that this many people in Chicago find the R language interesting and/or valuable. Of course, there is the possibility that some of the 25 who are attending are simply hoping for some free beer. I was a member of a vegan society for 2 years because they had free beer. The week I accidentally showed up with a six pack of White Castle sliders really blew my cover. That’s how I discovered that you can scare off angry vegans by waving a steaming hot onion covered meat-like patty in their face. True story. And when I say “true story” I mean “total lie”.

By the way, I’m already recruiting presenters for next month’s RUG meetup. And I’m also looking for locations. So if you have an idea for either, let me know. I promise to not throw any mini burgers at you.

Virtual Conference: R the Language

On Tuesday May 4th at 9:30 PM central, 10:30 eastern, I’ll be giving a live online presentation as part of the Vconf.org open conference series. I’ll be speaking about R and why I started using R a couple years ago. This is NOT going to be a technical presentation but rather an illustration of how an R convert was created and why R became part of my daily tool set.

If your not familiar with the vconf.org project, you should read a little about it. It’s just getting started but I love the idea that it’s not for profit and all presentations are Creative Commons license. You know that cool new technology you’ve been playing with? Yeah that one. You really should give a vconf about it. I know I’d like to hear about it!

Simulating Dart Throws in R

Back in November 2009 Wired wrote an article about some grad students who decided to try to stochastically model throwing darts. Because I don’t actually read printed material I didn’t see the article until a couple of months ago. My immediate thought was, “hey, I drink beer. I throw darts. I build stochastic models. Why haven’t I done this?” Well we all know why I haven’t done this. I have a job and a 2 year old daughter and I like my wife. Well a funny thing happened a few weeks ago. I sat down and was thinking about this problem and then 5 hours later I had a working dart simulator in my text editor. I don’t remember writing this. So Occam’s Razor says that the most likely explanation is the simplest explanation. So clearly I was abducted by aliens and someone broke into my office and built a dart simulator.

I do reinsurance modeling to pay the bills and it immediacy hit me that this type of modeling is very similar to what I do for work. This similarity became the impetus for my presentation at R in Finance 2010 which starts today.

I dumped the dart board code into a github gist which can be found here:

If the embedded code is not showing up, you can get to it directly on Github.

I don’t even know how wrong I am!

"as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know." US Defense Secretary Donald Rumsfeld, February 12, 2002

I’ve been a long time reader of the blog “Messy Matters” (which invokes terrible images now that I am potty training a toddler). The authors, Sharad Goel and Daniel Reeves are academics who work in the Microeconomics and Social Systems (get it, MESS?!?) lab funded by Yahoo!. (What does Strunk and White say about punctuation after a proper noun which includes punctuation as part of the proper noun?) Anyhow, the Messy Matters blog had a very interesting post recently about testing to see if you are overconfident. The gist is this: take a test and try to not answer each question exactly but give an upper and lower bound which you think represents a 90% confidence band around the right answer. If you haven’t seen this done, you should go take and look and then read the rest of this blog post.

I didn’t do worth a shit on their “overconfidence” test. I think I got 5 of the ranges right. The other 5 times the real answer fell outside my bounds. As I was answering the questions I had this strong feeling of not being confident at all. I was very tempted to answer HUGE ranges on some of the questions because I felt totally unable to make a good guess. But I took a swag and tried to put in big ranges, but not TOO big, if I didn’t know the answer. I’m not the only one who struggled with this test. In their summary of results I fall in the 76th percentile. Hey, I’m above average… or at least above the median. Clearly I didn’t know how wrong I was in many cases. But does this mean I am “overconfident”? I don’t think so. I think this means something a bit more subtle. This exercise reminded me of creating a forecasting model and trying to predict values far outside the training data.

Having read the book On Intelligence I am convinced that one of the main functions of the human brain (or at least the prefrontal  cortex)  is to be a pattern matching machine. We all build little mental models in our head all the time. And these models are trained, by definition, on the situations which we run into day in and day out. And these models are VERY accurate around the mean (i.e. around the experiences we are used to having). For example, how small of a piece of sand can you feel between your teeth? Our brains have a ‘model’ of what it normally feels like when our teeth close against each other. The slightest unexpected disruption in that pattern triggers our brain to notice. Ever miss a step when walking down stairs? When did you know you were in trouble? Probably when your foot was about 2 inches past where you expected the next step to be. You didn’t have to wait for your face to hit the railing before your mental model of step walking was throwing warning bells. Us humans are freaking amazing mental model makers!

Well we’re amazing… except when we suck. When we suck is when we are faced with trying to predict something that is orders of magnitude outside our experience. The question on the MESS test which I struggled the most was the question about how much an empty 747 weighs. I don’t ever deal with massive weights. Ever. I only had two reference points which I could think up: 1) my first car was a ‘69 Cadillac which I know weighed 5,040 lbs. We used to call it “Two and a half tons of fun.” and 2) a hopper bottom rail car carries ~3500 bushels of corn which is ~ 196,000 lbs. And I’ve never been up next to a 747. But they are HUGE. I’ve seen pictures of the space shuttle riding around on the back of one of those bad boys. But they have to be pretty light relative to their volume because they have a lot of cargo room. And then I did the math on how many 1969 Cadillacs = 1 rail car of corn… almost 39!?!? But rail cars on not that big. I’ve climbed up on rail cars of grain. Kinda seems like it should be about 10 Cadillacs big. At that point I was pretty perplexed and just guessed a range which turned out to be WAAAY too high. It turns out that a 747 weighs around 360,000 lbs, which is less than 2 rail cars of corn (not including the actual cars, just the weight of the corn!). My intuition, as trained by my two data points, didn’t do worth a tinker’s damn at guessing the weight of airplanes.

But here’s the whole point of that last paragraph: If a human has no reference points and no experience with a domain, we (or at least me) can’t make good guesses and, more importantly, we can’t know how bad our guesses are!  We CAN’T know how much we suck! If you think in terms of distributions, this exercise is akin to having a very small sample size and trying to guess the distribution’s second moment (the standard deviation). Well shit, we know in practice that if we have small samples the mean has a big error term but the standard deviation has an even BIGGER error term.

So simply put, providing confidence bands around a guess which is out of my area of experience is really hard and I’m not good at it. The biggest problem is knowing when I’m out of my domain. In both The Black Swan and Fooled by Randomness, Nassim Nicholas Taleb points out that the single strongest predictor for how bad someone is going to do at the confidence band game is if they hold a PhD. If anyone has a reference on the study he refers to, I’d love to see it. I’m resisting the temptation to throw stones at both actuaries and finance quants right here. And if I didn’t live in a glass house, I would!

My take away from all this is that confidence bands around a guess should not be expected to be statistically accurate. That’s the very nature of not knowing something at all. We don’t even know what we don’t know (thank you Donald Rumsfeld). The very definition of an expert might be someone who, if they don’t know the exact answer, can at least put confidence bands around their guess. In other words, you have to have some level of knowledge to put accurate confidence bands around a guess. And failing to be able to do that is not necessarily overconfidence. It might just be ignorance.

Chicago R User Group… It’s for the sexy people!

Give it up for Morris Day and The Mother Fucking Time!!!!

Morris Day, y'all!

I think we all know that Morris Day was talking about when he wrote the lyrics to “The Bird”:

Yes! Hold on now, this dance ain’t for everybody.
Just the sexy people.
White folks, you’re much too tight.
You gotta shake your head like the black folks.
You might get some tonight.
Look out!

That’s right, he was talking about the new R User Group in Chicago! a.k.a Chicago RUG! We know that R is sexy because statistical analysis is sexy. That is, if you’re doing it right! Even Mike Driscol at Dataspora knows that Data Geeks have to get their sexy on.  There is no doubt that Chicago is sexy. The second city is so damned sexy that Karen Abbott wrote Sin in the Second City and managed to get it on the NYT best sellers list. She makes me reconsider my agrarian interpretation of Chicago’s “meat packing” heritage. *rim shot* Thank you, thank you. I’ll be here all week. Try the veal!

If you’re in Chicagoland and reading this blog then you have every reason to get over to the Chicago R User Group web site and sign up! I’m looking forward to meeting all the Chicago R users in the near future. In case you’re afraid you won’t recognize me I’ll be the one that looks just like Morris Day… only white… and not as well dressed… and kinda nerdy. But otherwise, just like Morris.

Now shut up and dance!

Morris Day and the Time on Grooveshark!

The Future of Math is Statistics

The future of math is statistics… and the language of that future is R:

I’ve often thought there was way too little “statistical intuition” in the workplace. I think Author Benjamin would agree.

Lookup Performance in R

Rumor has it that Joe Adler, author of the O’Reilly Book R in a Nutshell, has joined Linked In as a data scientist.  But that does not keep him from still pumping out some interesting content over at OReilly.com. His latest article is about lookup performance in R. He does a great job giving code samples and explaining what he is doing. Worth reading, for sure.

Real-World, Real-Time Analytics

Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and read Peter Skomoroch’s interview with Bradford Cross of FlightCaster.

Peter wrote up this interview back in August 2009, so I’m a little late to this party. There’s some really great quotes in this interview. Here’s a few of my fav quotes from Cross:

At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.

Building layer upon layer of abstraction is a big key…  The technical term for this is “wrap the crap.”

Here’s a problem I think anyone who works with data and models can relate to:

I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.