I don’t even know how wrong I am!

"as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know." US Defense Secretary Donald Rumsfeld, February 12, 2002

I’ve been a long time reader of the blog “Messy Matters” (which invokes terrible images now that I am potty training a toddler). The authors, Sharad Goel and Daniel Reeves are academics who work in the Microeconomics and Social Systems (get it, MESS?!?) lab funded by Yahoo!. (What does Strunk and White say about punctuation after a proper noun which includes punctuation as part of the proper noun?) Anyhow, the Messy Matters blog had a very interesting post recently about testing to see if you are overconfident. The gist is this: take a test and try to not answer each question exactly but give an upper and lower bound which you think represents a 90% confidence band around the right answer. If you haven’t seen this done, you should go take and look and then read the rest of this blog post.

I didn’t do worth a shit on their “overconfidence” test. I think I got 5 of the ranges right. The other 5 times the real answer fell outside my bounds. As I was answering the questions I had this strong feeling of not being confident at all. I was very tempted to answer HUGE ranges on some of the questions because I felt totally unable to make a good guess. But I took a swag and tried to put in big ranges, but not TOO big, if I didn’t know the answer. I’m not the only one who struggled with this test. In their summary of results I fall in the 76th percentile. Hey, I’m above average… or at least above the median. Clearly I didn’t know how wrong I was in many cases. But does this mean I am “overconfident”? I don’t think so. I think this means something a bit more subtle. This exercise reminded me of creating a forecasting model and trying to predict values far outside the training data.

Having read the book On Intelligence I am convinced that one of the main functions of the human brain (or at least the prefrontal  cortex)  is to be a pattern matching machine. We all build little mental models in our head all the time. And these models are trained, by definition, on the situations which we run into day in and day out. And these models are VERY accurate around the mean (i.e. around the experiences we are used to having). For example, how small of a piece of sand can you feel between your teeth? Our brains have a ‘model’ of what it normally feels like when our teeth close against each other. The slightest unexpected disruption in that pattern triggers our brain to notice. Ever miss a step when walking down stairs? When did you know you were in trouble? Probably when your foot was about 2 inches past where you expected the next step to be. You didn’t have to wait for your face to hit the railing before your mental model of step walking was throwing warning bells. Us humans are freaking amazing mental model makers!

Well we’re amazing… except when we suck. When we suck is when we are faced with trying to predict something that is orders of magnitude outside our experience. The question on the MESS test which I struggled the most was the question about how much an empty 747 weighs. I don’t ever deal with massive weights. Ever. I only had two reference points which I could think up: 1) my first car was a ‘69 Cadillac which I know weighed 5,040 lbs. We used to call it “Two and a half tons of fun.” and 2) a hopper bottom rail car carries ~3500 bushels of corn which is ~ 196,000 lbs. And I’ve never been up next to a 747. But they are HUGE. I’ve seen pictures of the space shuttle riding around on the back of one of those bad boys. But they have to be pretty light relative to their volume because they have a lot of cargo room. And then I did the math on how many 1969 Cadillacs = 1 rail car of corn… almost 39!?!? But rail cars on not that big. I’ve climbed up on rail cars of grain. Kinda seems like it should be about 10 Cadillacs big. At that point I was pretty perplexed and just guessed a range which turned out to be WAAAY too high. It turns out that a 747 weighs around 360,000 lbs, which is less than 2 rail cars of corn (not including the actual cars, just the weight of the corn!). My intuition, as trained by my two data points, didn’t do worth a tinker’s damn at guessing the weight of airplanes.

But here’s the whole point of that last paragraph: If a human has no reference points and no experience with a domain, we (or at least me) can’t make good guesses and, more importantly, we can’t know how bad our guesses are!  We CAN’T know how much we suck! If you think in terms of distributions, this exercise is akin to having a very small sample size and trying to guess the distribution’s second moment (the standard deviation). Well shit, we know in practice that if we have small samples the mean has a big error term but the standard deviation has an even BIGGER error term.

So simply put, providing confidence bands around a guess which is out of my area of experience is really hard and I’m not good at it. The biggest problem is knowing when I’m out of my domain. In both The Black Swan and Fooled by Randomness, Nassim Nicholas Taleb points out that the single strongest predictor for how bad someone is going to do at the confidence band game is if they hold a PhD. If anyone has a reference on the study he refers to, I’d love to see it. I’m resisting the temptation to throw stones at both actuaries and finance quants right here. And if I didn’t live in a glass house, I would!

My take away from all this is that confidence bands around a guess should not be expected to be statistically accurate. That’s the very nature of not knowing something at all. We don’t even know what we don’t know (thank you Donald Rumsfeld). The very definition of an expert might be someone who, if they don’t know the exact answer, can at least put confidence bands around their guess. In other words, you have to have some level of knowledge to put accurate confidence bands around a guess. And failing to be able to do that is not necessarily overconfidence. It might just be ignorance.

Chicago R User Group… It’s for the sexy people!

Give it up for Morris Day and The Mother Fucking Time!!!!

Morris Day, y'all!

I think we all know that Morris Day was talking about when he wrote the lyrics to “The Bird”:

Yes! Hold on now, this dance ain’t for everybody.
Just the sexy people.
White folks, you’re much too tight.
You gotta shake your head like the black folks.
You might get some tonight.
Look out!

That’s right, he was talking about the new R User Group in Chicago! a.k.a Chicago RUG! We know that R is sexy because statistical analysis is sexy. That is, if you’re doing it right! Even Mike Driscol at Dataspora knows that Data Geeks have to get their sexy on.  There is no doubt that Chicago is sexy. The second city is so damned sexy that Karen Abbott wrote Sin in the Second City and managed to get it on the NYT best sellers list. She makes me reconsider my agrarian interpretation of Chicago’s “meat packing” heritage. *rim shot* Thank you, thank you. I’ll be here all week. Try the veal!

If you’re in Chicagoland and reading this blog then you have every reason to get over to the Chicago R User Group web site and sign up! I’m looking forward to meeting all the Chicago R users in the near future. In case you’re afraid you won’t recognize me I’ll be the one that looks just like Morris Day… only white… and not as well dressed… and kinda nerdy. But otherwise, just like Morris.

Now shut up and dance!

Morris Day and the Time on Grooveshark!

The Future of Math is Statistics

The future of math is statistics… and the language of that future is R:

I’ve often thought there was way too little “statistical intuition” in the workplace. I think Author Benjamin would agree.

Lookup Performance in R

Rumor has it that Joe Adler, author of the O’Reilly Book R in a Nutshell, has joined Linked In as a data scientist.  But that does not keep him from still pumping out some interesting content over at OReilly.com. His latest article is about lookup performance in R. He does a great job giving code samples and explaining what he is doing. Worth reading, for sure.

Real-World, Real-Time Analytics

Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and read Peter Skomoroch’s interview with Bradford Cross of FlightCaster.

Peter wrote up this interview back in August 2009, so I’m a little late to this party. There’s some really great quotes in this interview. Here’s a few of my fav quotes from Cross:

At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.

Building layer upon layer of abstraction is a big key…  The technical term for this is “wrap the crap.”

Here’s a problem I think anyone who works with data and models can relate to:

I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.

You can Hadoop it! It’s elastic! Boogie woogie woog-ie!

This blog's name in Chinese!

I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng (肏你娘) which your friend in grad school told you means “Live happy with many blessings”. Trust me, I’ve been hanging with Madam Wu and she told me it doesn’t mean that.

So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. As I have told you before, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn’t make this up if I tried, so you know it’s true! And what’s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their Elastic MapReduce service! I think Amazon coined the term “Time Travel as a Service” or TTaaS  generally pronounced as “ta-tas” in the industry. If you are a CTO be sure and use this in your next “vision statement” pitch so everyone will know you’re hip to all this cloud stuff.

So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don’t you? Well it’s a damn good thing you read this blog because I’m going to give you the keys to the Wu dynasty and a little 福寿 while we’re at it.

I’ve never had an original thought in my life so I started with this discussion over at the AMZN E M/R discussion forum. Peter Skomoroch from Data Wrangling gives a very good example with all the data and code provided so you can run it yourself.  Pete’s example really shakes the yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg’s HadoopStreaming package which was good for insight, but I didn’t use the package as it’s really focused on the ‘big data’ problem.

That elephant is so freaking cute!

Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don’t have big data. I have big CPU hogging models (mostly slow because I don’t code worth a shit). What got me reconsidering my world view was John Myles White’s comment on my multicore post earlier. John encouraged me to look into running my simulations on AMZN’s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.

I learned a few “gotchas” or, as we say in the future: 臭婊子(I think that should be plural). I’ll do a whole blog post on gotchas soon, but here’s the bullet points:

  • AMZN is currently running the version of Debian Linux named Lenny which has version 2.7.1 of R installed. No matter what the documentation says, don’t let Lenny tend to the rabbits.
  • Test all code by firing up an interactive Pig instance and logging in as ‘hadoop’. Instead of running Pig, run R and test your code. And as it says in the FAQ: “The Pig don’t care either way. ” Which, despite sounding like buggery, is the truth.
  • If your code runs inside of R on a Hadoop instance, drop back to the command line on the Hadoop instance and run ‘cat infile.txt | yourMapper.R | sort | yourReducer.R > outfile.txt’. This pipes your input file into your mapper file which does it’s thing and then pipes the results to your reducer file which then “pumps up the jam” into an output file.  What you see in the outfile.txt is what Hadoop will produce. So it you don’t like what you see, you better do some more coding.
  • You CAN load packages into R in a Hadoop instance running in AMZN E M/R. There are a few caveats, of course:
  1. Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.
  2. As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see from this question, I am not yet comfortable with.
  3. There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there’s no possibility for a conflict! Here’s an example of how I load the Hmisc package:
  • You’ll probably want to provide some data to R. This is done by uploading your files to S3 and then passing the “-cacheFile” option to Hadoop. To get the plyr package to load in R 2.7.1 I had to edit the package. I then uploaded the altered package thusly:

-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz

More to come later. I’ve gotta get back to the future.

You hold the elephant and I'll plug this in.

Using the R multicore package in Linux with wild and passionate abandon

One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It’s not uncommon for each of my simulations to take 20 seconds or more to complete (if you’re doing the math, that’s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an Oracle VirtualBox (Oracle now owns Virtualbox! *gasp* ) running Ubuntu. Lately I’ve moved to running my sims on EC2 machines. I’m not yet doing RMPI clustering, although that is on my roadmap. Currently I just fire up a couple of 8 core instances and run 5K sims on each one then FTP the results back to my desktop. It’s not very sexy, but it gets the job done… I guess the same could be said of myself, except substitute “makes slurping sounds eating udon” in the place of “gets the job done.”

When running processor intensive crap (that’s a stochastic modeling term) the single threaded nature of R is painful. In Linux or Mac (i.e. NOT Windows) the multicore package is a real godsend. I did a quick code review and, from what I can tell, multicore exploits worm holes to travel back in time and reports your results in a fraction of the time you would expect it to take. Seriously. I expect that as the code matures my computer will fill up with simulation results from simulations which I have not even coded yet. It’s almost like magic, except without the rabbit and hat.

The crux of the package is a parallel-ized version of lapply() called mclapply(). I believe the mc stands for ‘magic carpet’ and is an allusion to the worm hole technology. So how does one harness this package for nefarious self interest doing parallel operations in R? The ultra short answer is: write your R code so that the most processor intensive bit is done with an lapply() function. Then replace the lapply() with mclapply().  Of course you have to load the multicore package before you run it. But that’s basically it.

How I implement mcapply() is thusly: I build a table with all my random draws for my simulations. So if I have 20 variables and want to run 10,000 simulations then I’ll build a data frame with all 200,000 values (generally 10K rows and 21 columns for 20 variables + and index). The index keeps track of the draw number. Then I have code that performs the ‘valuation’ based on a single observation of the 20 variables. I wrap the valuation step in a function and then call the valuation process 10,000 times with mclapply(). So it might look something like this:

myOutput <- mclapply( drawList, function(x) valuationReturns(drawNumber=x))

The drawList object is simply a list of the possible indexes (i.e. 1:10000). When the code has iterated over each value from drawList the results will be in the myOutput object. Tada!

I recommend the htop program for tracking what’s going on with processor utilization in Linux (I presume Mac too if you ask Steve Jobs nicely). If everything is cranking well, and you have 8 cores, you might see an image that looks something like this:

I don’t understand time travel, but I’ve found that I have better luck if I set mc.preschedule=FALSE. Apparently prescheduled magic carpets are finicky. If I leave mc.preschedule to the default of TRUE then I find that often some of my cores go underutilized.

Let me know if you have other multicore tips and tricks.

If you want to give me shit for running my simulations as root, feel free. I’m impervious to your “best practices” mumbo jumbo. La la la la la la!! Not listening!

Special thanks to John Cavazos over at the University of Delaware from whom I stole the MC for Dummies image. John, your a gentleman and a humble scholar. Damn few of us left.

Remote Backup Fail and How to Silently Copy Files

Recently I’ve run into frustrations with Iron Mountain Connected Backup so I’ve been looking for alternatives.

Alternatives: I’ve been running Jungle Disk at home and really like it. I could use that at work except I have not set up an Amazon or RackSpace account with my work credit card. But I am in Chicago and my database server/ file server is in Dallas TX. So I decided to just create a mirror on my laptop onto a shared drive on my server. There’s lots of ways to do this, but the path I chose was to use RoboCopy, a command line copy tool from Microsoft that is part of the Windows Server 2003 Resource Kit. I’m running XP and I wanted the mirroring of my machine to be invisible, silent, and scheduled. To do this I found I needed to take the following steps:

  1. Install RoboCopy
  2. Create a batch file to mirror the directory I wanted
  3. Create a windows script to call the batch silently
  4. Schedule the windows script to run automagically

Install RoboCopy: Download the Windows Server 2003 Resource Kit and install it. Very easy.

Create a batch file to run RoboCopy: I named mine c:/backup.bat and it looks something like this:

Set Source=”C:\Documents and Settings\jdlong”
Set Dest=”\\myDallasServer\backup\jdlong”
Robocopy %Source% %Dest% /MIR /Z /R:0  >nul

This simply sets the source and destination and then runs RoboCopy with the /MIR (mirror) and /Z (restartable) switches invoked

Create a windows script: The problem with the batch file is that it is noisy when it runs. Even piping the output to nul it still produces a CMD window that stays up until it finishes running. That’s where the Windows Script file comes into play. It calls the batch file but hides the CMD window. I created a file called c:\runBackup.vbs that has this in it:

Set WshShell = CreateObject(“WScript.Shell”)
WshShell.Run chr(34) & “C:\backup.bat” & Chr(34), 0
Set WshShell = Nothing

Schedule the windows script: Control Panel -> Scheduled Tasks. Then I created a new task that runs  c:\runBackup.vbs every night at 11PM. The only down side is that when I change my password I have to remember to change the password associated with the scheduled task or it will fail.
The only upside is that I figured out that Iron Mountain sucks prior to having data loss. I got lucky. Next week I am going to test my backup. And then test it every quarter after that. And I won’t depend on my corporate IT do to my backups.

Struggling with apply() in R

It’s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the following discussion on Stack Overflow:

apply_struggle

Dirk’s comment is actually spot on. I’ve asked the same damn question at least 4-5 times. Only I didn’t really understand it was the same question. That’s one of the problems of not really being good at something; it’s hard to think abstractly about it. I’m not really good at R, so sometimes I don’t realize that multiple concepts are related. As I talk with other new users of R it’s clear that unless they come from a programming language with an apply-esque construct they likely are struggling with R. I think most of the confusion comes from a) not understanding what data format apply() is going to return and b) not understanding anonymous functions.

With this in mind I did a little screencast illustrating how this struggle plays out for a new users. I also show why I use the plyr package for much of the stuff other folks use apply() for.

Any feedback you have is appreciated. This is my first stab at a screencast, so I am still trying to figure out the best approach/method as well as how many drinks puts me on the Ballmer Peak.

EDIT: it’s been pointed out that I misuse some terminology a number of times. I should have named my year vector “yearVector.” By calling it “yearList” I then refer to the vector as a list. I was using “list” in the vernacular, but since list is a specific R data structure it is confusing that I named a vector a name with “list” in it.

Loading Big (ish) Data into R

So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:

This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)

2gib

I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.

So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:

library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.

So how long did it take to run? Just under 5 minutes.

So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:

system.time(big.df <- read.csv(‘bigdf.csv’))

And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.

-JD