Connecting to SQL Server from R using RJDBC

A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.

Joshua Ulrich was kind enough to help me out by pointing me to RJDBC which scared me a little (I’m easily spooked) because it involves Java. The only thing I know about Java is every time I touch it I spend days trying to get environment variables loaded just exactly the way it wants them. But Josh assured me that it was really not that hard. Here’s the short version:

Download the RJDBC driver from Microsoft. There’s Win and *nix versions, so grab which ever you need. Unpack the driver in a known location (I used /etc/sqljdbc_2.0/). Then access the driver from R like so:

require(RJDBC)
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
  "/etc/sqljdbc_2.0/sqljdbc4.jar") 
  conn <- dbConnect(drv, "jdbc:sqlserver://serverName", "userID", "password")
#then build a query and run it
sqlText <- paste("
   SELECT * FROM myTable
  ", sep="")
queryResults <- dbGetQuery(conn, sqlText)

I have a few scripts that I want to run on both my Ubuntu laptop and my Windows Server. To accommodate that I made my scripts compatible with both by doing the following to my drv line:

if (.Platform$OS.type == "unix"){
         drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
         "/etc/sqljdbc_2.0/sqljdbc4.jar")
} else {
         drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",
        "C:/Program Files/Microsoft SQL Server JDBC Driver 3.0/sqljdbc_3.0
         /enu/sqljdbc4.jar")
 }

Obviously if you unpacked your drivers in different locations you’ll need to molest the code to fit your life situation.

EDIT: A MUCH better place to put the JDBC drivers in Ubuntu would be the /opt/ path as opposed to /etc/ which I used above. In Ubuntu the /opt/ directory is where one should put user executables and /etc/ should be reserved for packages installed by apt. I’m not familiar with all the conventions in Ubuntu (or even Linux in general) so I didn’t realize this until I got some reader feedback.

Be forewarned, RJDBC is pretty damn slow and it appears to no longer be in active development. For my use case, RODBC was clearly faster. But RJDBC works for me in Ubuntu and that was my biggest need.

Principal Component Analysis (PCA) vs Ordinary Least Squares (OLS): A Visual Explanation

Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was “Thanks to my college class I can do the math, but what does it MEAN?”

I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled “Why I give a shit.” A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster. OK, that’s a lie. Later in life I learned that bordered Hessian matrices are a second derivative test used in some optimizations. Not that I would EVER do that shit by hand. I’d use some R package and blindly trust that it was coded properly.

So back to PCA: as I was reading the aforementioned stats question I was reminded of a recent presentation that Paul Teetor gave at a August Chicago R User Group. In his presentation on spread trading with R he showed a graphic that illustrated the difference between OLS and PCA. I took some notes and went home and made sure I could recreate the same thing. If you have wondered what makes OLS and PCA different, open up an R session and play along.

Your Independent Variable Matters:

The first observation to make is that regressing x ~ y is not the same as y ~ x even in a simple univariate regression. You can illustrate this by doing the following:

set.seed(2)
x <- 1:100

y <- 20 + 3 * x
e <- rnorm(100, 0, 60)
y <- 20 + 3 * x + e

plot(x,y)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col=”red”)

xy.lm <- lm(x ~ y)
lines(predict(xy.lm), y, col=”blue”)

You should get something that looks like this:

So it’s obvious they give different lines. But why? Well, OLS minimizes the error between the dependent and the model. Two of these errors are illustrated for the y ~ x case in the following picture:

But when we flip the model around and regress x ~ y then OLS minimizes these errors:

Ok, so what about PCA?

Well let’s draw the first principal component the old school way:

#normalize means and cbind together
xyNorm <- cbind(x=x-mean(x), y=y-mean(y))
plot(xyNorm)

#covariance
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

plot(xyNorm, ylim=c(-200,200), xlim=c(-200,200))
lines(xyNorm[x], eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x])
lines(xyNorm[x], eigenVectors[2,2]/eigenVectors[1,2] * xyNorm[x])

# the largest eigenValue is the first one
# so that’s our principal component.
# but the principal component is in normalized terms (mean=0)
# and we want it back in real terms like our starting data
# so let’s denormalize it
plot(xy)
lines(x, (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y))
# that looks right. line through the middle as expected

# what if we bring back our other two regressions?
lines(x, predict(yx.lm), col=”red”)
lines(predict(xy.lm), y, col=”blue”)

PCA minimizes the error orthogonal (perpendicular) to the model line. So first principal component looks like this:

The two yellow lines, as in the previous images, examples of two of the errors which the routine minimizes.

So if we plot all three lines on the same scatter plot we can see the differences:

The x ~ y OLS and the first principal component are pretty close, but click on the image to get a better view and you will see they are not exactly the same.

All the code from the above examples can be found in a gist over at GitHub.com. It’s best to copy and past from the github as sometimes Wordpress molests my quotes and breaks the codez.

The best introduction to PCA which I have read is the one I link to on Stats.StackExchange.com. It’s titled “A Tutorial on Principal Components Analysis” by Lindsay I Smith.

Third, and Hopefully Final, Post on Correlated Random Normal Generation (Cholesky Edition)

André-Louis Cholesky is my homeboy

When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I’ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, “the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.”  And I think we’re all smart enough to read through Mr. Gappy’s comment and see that he’s saying I’m a complicated, opaque, and slow, גוי‎. My wife called and said his list would be more accurate if he added ‘emotionally detached.’ I have no idea what she means.

At any rate, in response to Gappy’s comment, here is the third verse (same as the first). The crux of the change is the following lines:

# shift the mean of ourData to zero ourData0 <- as.data.frame(sweep(ourData,2,colMeans(ourData),"-")) #Cholesky Decomposition of the covariance matrix C <- chol(nearPD(cov(ourData0))$mat) #create a matrix of random standard normals Z <- matrix(rnorm(n * ncol(ourData)), ncol(ourData)) #multiply the standard normals by the transpose of the Cholesky X <- t(C) %*% Z myDraws <- data.frame(as.matrix(t(X))) names(myDraws) <- names(ourData) # we still need to shift the means of the samples. # shift the mean of the draws over to match the starting data myDraws <- as.data.frame(sweep(myDraws,2,colMeans(ourData),"+"))

Edit: When I first publishes this example, I didn’t shift the means prior to taking the cov(). I’ve sense corrected that.  Also thanks to @fdaapproved on Twitter who pointed out that I can replace the loop above with myDraws <- as.data.frame(sweep(t(X),2,colMeans(ourData),”+”))

This method, which uses Cholesky decomposition, is how I initially learned to create correlated random draws. I think this method is comparable to the mvrnorm() method. mvrnorm() is handy because it wraps everything above in one single line of code. But the above method is reliant only on the Matrix package and that’s only for the nearPD() function. If you are familiar with the guts of the mvrnorm() function and the chol() function, I’d love for you to comment on any technical differences. I looked briefly at the code for both and quickly realized my matrix math was rusty enough that it was going to take a while for me to sort through the code.

If you want the whole script you can find it embedded below and on Github.

Even Simpler Multivariate Correlated Simulations

So after yesterday’s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, “Dude, why are you making this so hard?” The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick

?mvrnorm

and, I’ll be damned, he’s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values. So that gives the flexibility to make the tails of the distributions more correlated, for example. But my example yesterday was purposefully simple… so simple that a copula was not even needed.

After creating my sample data all I really needed to do was this:

myDraws <- mvrnorm(1e5, mu=mean(ourData), Sigma=cov(ourData))

So I  took my example from yesterday and updated it using the mvrnorm() code and, as is my custom, put up a Github gist. The code is embedded below as well. I added a little ggplot2 code at the end that will create a facet plot of the 4 distributions showing the shape of the distributions of both the starting data and the simulated data. The plot in the upper left of this post is the ggplot output.

EDIT: The email hipping me to this was sent by Dirk Eddelbuettel who’s been very helpful to me more times than I can count. I had omitted his name initially. However after confirming with Dirk, he told me it was OK to mention him by name in this post.

Stochastic Simulation With Copulas in R

You know we do! A friend of mine gave me a call last week and was wondering if I had a little R code that could illustrate how to do a Cholesky decomposition. He ultimately wanted to build a Monte Carlo model with correlated variables. I pointed him to a number of packages that do Cholesky decomp but then I recommended he consider just using a Gaussian Copula and R for the whole simulation. For most of my copula needs in R, I use the QRMlib package which is a code companion to the book Quantitative Risk Management: Concepts, Techniques and Tools by Alexander J. McNeil, Rudiger Frey and Paul Embrechts. The book is only loosely coupled (pun intended) with the code in the QRMlib package. I really wish the book had been written with code examples and tight linkage between the book and the code. Of course I’m the type of guy who prefers code snip-its to mathematical notation.

I had some code where I used the QRMlib package, but it was really messy and fairly specific to my use case. So I whipped up very simple example of how to create correlated random draws from a multivariate distribution. In this example I used normally distributed marginals and Gaussian correlation to keep things simple and easy to follow. Rather than blogging through the code, I added a shit load (metric ass ton, if you’re in Canada) of comments. The code is designed to be stepped through. So don’t just run the whole blob and wonder what happened.

Walk through the code and if you find any errors be sure and let me know.

The code is embedded in a Github gist below, but if you are reading this in an aggregator (shout out to R-Bloggers) you’ll need to manually go to the gist.

Give Away Something Then Sell Something

Radiant Heat System. Not in my house... yet!

My wife and I bought a foreclosed house a few months ago. This house had been part of mortgage fraud and we bought it at auction. Interesting life experience, to say the least. The finished basement was built with radiant heat tubing poured into the concrete. These pipes are designed to be hooked to a hot water heater so the warm water can provide radiant heat through the floors in the basement. I love radiant heat in basements. It makes the floor warm and the whole basement feels cozy. The heat radiates up to the rest of the house and it’s fairly energy efficient.

The radiant heat system in our basement was never finished, however. The pipes were there but there was no hot water heater, pumps, or thermostat. I started scouring the Web for information on radiant heat systems. There’s lots of sites selling radiant heat related bits, but it was VERY hard to find detailed info on the types of systems or what the options are for radiant heat. I wanted to educate myself on the pros and cons of different systems. Do I use a hot water heater? Maybe a boiler? Should hook it up with my potable hot water heater? I was full of questions and struggled to find anything useful. Until I stumbled on Radiant Floor Company. Their whole business model is around selling assemblies to help the DIY market install/upgrade/maintain their radiant floor systems. And their site has, hands down, the best information about radiant floor systems. It’s not the prettiest site in the world, but they have a great intro, then sections on each major type of system.

They provided me information that I couldn’t get anywhere else. And as a result I’m probably going to buy my parts from them (they are working up a quote for me today!). I’ll probably go with an on-demand hot water heater and a fully closed system. And they will get my business because they gave me the best information AND they have good prices. It takes BOTH info and price to make a sale on-line. Amazon gives me info through their customer reviews and ratings. Radiant Floor Co gives me info through great documentation and background info. This combo of price + info means that some providers will compete on information + reasonable price while others will compete on absolute lowest price with little info. I love this. It gives me, as a consumer, options.

Starting an EC2 Machine Then Setting Up a Socks Proxy… From R!

I do some work from home, some work from an office in Chicago and some work on the road. It’s not uncommon for me to want to tunnel all my web traffic through a VPN tunnel. In one of my previous blog posts I alluded to using Amazon EC2 as a way to get around your corporate IT mind control voyeurs service providers. This tunneling method is one of the 5 or so ways I have used EC2 to set up a tunnel. I used to fire these tunnels up manually using the Amazon AWS Management Console then opening a shell prompt and entering:

ssh -i ~/MyPersonalKey.pem -D 9999 root@ec2-184-73-41-72.compute-1.amazonaws.com

the -i switch tells ssh to use my RSA identity file stored in ~/MyPersonalKey.pem

the machine name (ec2-184-73-41-72.compute-1.amazonaws.com) I get from the AWS Management Console

the -D is the magic. -D opens an dynamic port forwarding tunnel between my Linux box and the EC2 machine. This is, for all intent and purposes, an encrypted SOCKS4 proxy on port 9999 of localhost. Then I just have to change my proxy settings in Firefox to use use a SOCKS host.

Now that’s all pretty easy. And I like easy. But it’s not easy ENOUGH. You see, I’m lazy. I’m not just lazy in the “I’ll do it mañana” sort of way, but in the “I’m too damn lazy to click my mouse 5 times” way.

So I want this easier. Well, I can make the proxy settings in Firefox easier through the use of the Quick Proxy extension for Firefox. That’s a good start. It turns on and off the proxy with a single mouse click. But I still have to go into the AWS management web site, fire up a machine then log in via SSH. Let’s make that part easier!

While it’s not simple to install and configure, the EC2 command line tools are going to be required in order to make a script that fires up EC2 and then connects to the instance with ssh. I struggled getting the tools to run until I found this tutorial.

Your file locations and names may be different than the tutorial. Change appropriately. I followed the tutorial instructions but I created a key named ec2ApiTools which will come in handy later.

After you get the EC2 tool up and running and you can do something like list the available AMIs without an error you can stop with the tutorial. I’ve been doing a lot of shell scripting lately so I said to myself, “Self, let’s script the ssh connection in R!” For the record, I always end my impredicative in an explanation point which I verbally pronounce as, “BANG!” As a result, when I talk to myself it sounds like two 10 year old boys playing cops and robbers. Anyhow, I did script it with R using Rscript. Because I’m a man who listens to myself.

And since you were kind enough to slog through my channeling the drunken ghost of James Joyce, here’s my script:

If you’re reading this in an RSS reader of for some other reason don’t see an R script above, here’s your link.

The only two EC2 API commands I use in the script are  ec2-run-instances which starts the instance and ec2-describe-instances which gives me a list of running instances and their details.The rest of the script is simply parsing the output and figuring out which instances was started last.

I’ve now set up a launcher panel item that starts the script. Then when I see the xterm window come up I click the little red button in the lower right corner of my browser which switches on the Firefox proxy. Then I’m safe to surf Soldier of Fortune Magazine without the interference of my corp firewall.

Bootstrapping the latest R into Amazon Elastic Map Reduce

I’ve been continuing to muck around with using R inside of Amazon Elastic Map reduce jobs. I’ve been working on abstracting the lapply() logic so that R will farm the pieces out to Amazon EMR. This is coming along really well, thanks in no small part to the Stack Overflow [r] community. I have no idea how crappy coders like me got anything at all done before the Interwebs.

One of the immediate hurdles faced when trying to use AMZN EMR in anger is that the default version of R on EMR is 2.7.1. Yes, that is indeed the version that Moses taught the Israelites to use while they wandered in the desert. I’m impressed by your religious knowledge. At any rate, all kinds of things go to hell when you try to run code and load packages in 2.7.1. When I first started fighting with EMR the only solution was to backport my code and alter any packages so they would run in 2.7.1. Yes, that is, as Moses would say, a Nudnik. Nudnik also happens to be the pet name my neighbors have given me. They love me. Where was I? Oh yeah, Methusla’s R version. Recently Amazon released a neat feature called “Bootstrapping” for EMR. Before you start thinking about sampling and resampling and all that  crap, let me clarify. This is NOT statistical bootstrapping. It’s called bootstrapping because it’s code that runs after each node boots up, but before the mapper procedure runs. So to get a more modern version of R loaded on to each node I set up a little script that updates the sources.list file and then installs the latest version of R. And since I’m a caring, sharing guy, here’s my script:

And if that doesn’t show up for some reason, you can find all 5 lines of its bash glory here over at github.

If you’re not conveniently located in Chicago, IL you may want to change your R mirror location. The bootstrap action can be set up from the EMR web GUI or if you’re firing the jobs off using the elastic-mapreduce command line tools you just add the following option: “–bootstrap-action s3://myBucket/bootstrap.sh” assuming myBucket is the bucket with your script in it and bootstrap.sh contains your bootstrap shell script. And then, as my buddies in Dublin say, “Bob’s your mother’s brother.”

And before you ask, yes, this slows crap down. I’ll probably hack together a script that will take the R binaries and other needed upgrades out of Amazon S3 and load them in a bootstrap action which will greatly speed things up. The above example has one clear advantage over loading binaries from S3: It works right now. And remember folks, code that works right now kicks code that “might work someday” right in the balls. And then mocks it while it cries.

Chicago R Meetup: Healthier than Drinking Alone

I’m kinda blown away by the number of folks who have joined the Chicago R User Group (RUG) in the last few weeks. As of this morning we have 65 people signed up for the group and 25 who have said that they are planning on attending the meetup this Thursday (yes, only 3 days away!) I’m very pleased that this many people in Chicago find the R language interesting and/or valuable. Of course, there is the possibility that some of the 25 who are attending are simply hoping for some free beer. I was a member of a vegan society for 2 years because they had free beer. The week I accidentally showed up with a six pack of White Castle sliders really blew my cover. That’s how I discovered that you can scare off angry vegans by waving a steaming hot onion covered meat-like patty in their face. True story. And when I say “true story” I mean “total lie”.

By the way, I’m already recruiting presenters for next month’s RUG meetup. And I’m also looking for locations. So if you have an idea for either, let me know. I promise to not throw any mini burgers at you.

Virtual Conference: R the Language

On Tuesday May 4th at 9:30 PM central, 10:30 eastern, I’ll be giving a live online presentation as part of the Vconf.org open conference series. I’ll be speaking about R and why I started using R a couple years ago. This is NOT going to be a technical presentation but rather an illustration of how an R convert was created and why R became part of my daily tool set.

If your not familiar with the vconf.org project, you should read a little about it. It’s just getting started but I love the idea that it’s not for profit and all presentations are Creative Commons license. You know that cool new technology you’ve been playing with? Yeah that one. You really should give a vconf about it. I know I’d like to hear about it!