Remote Backup Fail and How to Silently Copy Files

Today I called my firms desktop support to talk to them about how to get Iron Mountain Connected Backup to archive files located somewhere other than [C:\Documents and Settings\user\] and through talking with my desktop support guy I discovered that it doesn’t support that. Oh, and by the way it’s a “desktop backup” so it’s not backup up my MS Access files or Outlook PST files. I told the guy that I had gone in and made sure it was backing those files up and they were checked in the UI. He informed me that it may look like they are backed up, but I can’t restore them. To which I responded “Any developer who writes backup software that will backup a file it can’t restore should be kicked squarely in the nuts and then never allowed near a computer for life” I’m not kidding. Honest to god I would kick an Iron Mountain developer right in the baby maker for passing this piece of shit program off as “enterprise ready.” The only way this program could be more useless is if it actually deleted files from my PC instead of backing them up. If the software is crippled because they are selling it as a “desktop backup” then, by god, they better tell me that in big fucking blinking letters and a marching band playing John Philip Sousa on my lap.

Alternatives: I’ve been running Jungle Disk at home and really like it. I could use that at work except I have not set up an Amazon or RackSpace account with my work credit card. But I am in Chicago and my database server/ file server is in Dallas TX. So I decided to just create a mirror on my laptop onto a shared drive on my server. There’s lots of ways to do this, but the path I chose was to use RoboCopy, a command line copy tool from Microsoft that is part of the Windows Server 2003 Resource Kit. I’m running XP and I wanted the mirroring of my machine to be invisible, silent, and scheduled. To do this I found I needed to take the following steps:

  1. Install RoboCopy
  2. Create a batch file to mirror the directory I wanted
  3. Create a windows script to call the batch silently
  4. Schedule the windows script to run automagically

Install RoboCopy: Download the Windows Server 2003 Resource Kit and install it. Very easy.

Create a batch file to run RoboCopy: I named mine c:/backup.bat and it looks something like this:

Set Source=”C:\Documents and Settings\jdlong”
Set Dest=”\\myDallasServer\backup\jdlong”
Robocopy %Source% %Dest% /MIR /Z /R:0  >nul

This simply sets the source and destination and then runs RoboCopy with the /MIR (mirror) and /Z (restartable) switches invoked

Create a windows script: The problem with the batch file is that it is noisy when it runs. Even piping the output to nul it still produces a CMD window that stays up until it finishes running. That’s where the Windows Script file comes into play. It calls the batch file but hides the CMD window. I created a file called c:\runBackup.vbs that has this in it:

Set WshShell = CreateObject(“WScript.Shell”)
WshShell.Run chr(34) & “C:\backup.bat” & Chr(34), 0
Set WshShell = Nothing

Schedule the windows script: Control Panel -> Scheduled Tasks. Then I created a new task that runs  c:\runBackup.vbs every night at 11PM. The only down side is that when I change my password I have to remember to change the password associated with the scheduled task or it will fail.
The only upside is that I figured out that Iron Mountain sucks prior to having data loss. I got lucky. Next week I am going to test my backup. And then test it every quarter after that. And I won’t depend on my corporate IT do to my backups.

Struggling with apply() in R

It’s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the following discussion on Stack Overflow:

apply_struggle

Dirk’s comment is actually spot on. I’ve asked the same damn question at least 4-5 times. Only I didn’t really understand it was the same question. That’s one of the problems of not really being good at something; it’s hard to think abstractly about it. I’m not really good at R, so sometimes I don’t realize that multiple concepts are related. As I talk with other new users of R it’s clear that unless they come from a programming language with an apply-esque construct they likely are struggling with R. I think most of the confusion comes from a) not understanding what data format apply() is going to return and b) not understanding anonymous functions.

With this in mind I did a little screencast illustrating how this struggle plays out for a new users. I also show why I use the plyr package for much of the stuff other folks use apply() for.

Any feedback you have is appreciated. This is my first stab at a screencast, so I am still trying to figure out the best approach/method as well as how many drinks puts me on the Ballmer Peak.

EDIT: it’s been pointed out that I misuse some terminology a number of times. I should have named my year vector “yearVector.” By calling it “yearList” I then refer to the vector as a list. I was using “list” in the vernacular, but since list is a specific R data structure it is confusing that I named a vector a name with “list” in it.

Loading Big (ish) Data into R

So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff:

This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV. Sue me.)

2gib

I thought this was a dang good question. What I have always done in the past was load my data into SQL Server or Oracle using an ETL tool and then suck it from the database to R using either native database connections or the RODBC package. Matti Pastell (@mpastell) recommended using the sqldf (SQL to data frame) package to do the import. I’ve used sqldf before, but only to allow me to use SQL syntax to manipulate R data frames. I didn’t know it could import data, but that makes sense, given how sqldf works. How does it work? Well sqldf sets up an instance of the sqlite database server then shoves R data into the DB, does operations on the tables, and then spits out an R data frame of the results. What I didn’t realize is that we can call sqldf from within R and have it import a text file directly into sqlite and then return the data from sqlite directly into R using a pretty fast native connection. I did a little Googling and came up with this discussion on the R mailing list.

So enough background, here’s my setup: I have a Ubuntu virtual machine running with 2 cores and 10 gigs of memory. Here’s the code I ran to test:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, ‘bigdf.csv’, quote = F)

That code creates a data frame with 3 columns. I created a single letter text column, then two floating point columns. There are 40,000,000 records. When I run the write.csv step on my machine I get about 1.8GiB. That’s close enough to 2 gigs for me. I created the text file and then ran rm(list=ls()) to kill all objects. I then ran gc() and saw that I had hundreds of megs of something or other (I have not invested the brain cycles to understand the output that gc() gives). So I just killed and restarted R. I then ran the following:

library(sqldf)
f <- file(“bigdf.csv”)
system.time(bigdf <- sqldf(“select * from f”, dbname = tempfile(), file.format = list(header = T, row.names = F)))

That code loads the CSV into an sqlite DB then executes a select * query and returns the results to the R data frame bigdf. Pretty straightforward, ey? Well except for the dbname = tempfile() bit. In sqldf you can choose where it makes the sqlite db. If you don’t specify at all it makes it in memory which is what I first tried. I ran out of mem even on my 10GB box. So I read a little more and added the dbname = tempfile() which creates a temporary sqlite file on the disk. If I wanted to use an existing sqlite file I could have specified that instead.

So how long did it take to run? Just under 5 minutes.

So how long would the read.csv method take? Funny you should ask. I ran the following code to compare:

system.time(big.df <- read.csv(‘bigdf.csv’))

And I would love to tell you how long that took to run, but it’s been running for half an hour all night and I just don’t have that kind of patience.

-JD

Using Amazon EC2 to Thwart Crappy Internal IT Services

ec2 tweet

The alternative title of this blog post is “How to get your sorry ass fired by violating your internal IT policies.” So keep that in mind as you read this.

I say lots of silly crap. Twitter allows me the pleasure of sharing this blather with the world. I was a little surprised that of all the things I have said over the last few months the above Tweet received the most discussion. Apparently this tweet captured the imagination and consternation of some fellow Tweeters. I had people follow up with me and basically ask, “what do you mean?” Twitter is good for a sound bite, but less so for an elaborate answer. Which brings us to this:

What are the top ways Amazon EC2 can allow a business user to escape the manipulative and counterproductive grip of corporate IT? Well I’m glad you asked!

1) Over-restrictive web filtering policies:  When I worked as a risk manager for a Fortune 500 insurance firm I was shocked on the first day when I could not search Google Groups. At the time Google Groups was one of my favorite resources for figuring out everything from SQL syntax to Excel formulas. The firm, like most firms, outsourced the filtering of web content. Apparently they signed up for “Super Freaking Restrictive” filtering. I could not even search the web for “Ubuntu” as all sites with the word Ubuntu in the title or with the world “Ubuntu” passed as a form submission were blocked. Apparently Ubuntu is not just a Linux distro, but also a militant organization of African computer programmers, or something. So how did I get around this with EC2? I would fire up an EC2 Ubuntu instance running Squid proxy before I left home, then ssh into the cloud from work and use a little SSH port forwarding to route my web traffic through the ssh connection and out via Squid. I set up my EC2 instance to listen for ssh on port 443 and my firm’s firewall would let the connection pass as it assumed it was simply ssl traffic into Amazon. Brilliant!

2) Under powered database servers: At another point I was responsible for data analytics on a portfolio of insurance policies. I had to join together data from multiple systems (underwriting, admin, claims, etc.). The firm was an Oracle shop and none of the Oracle machines had enough user space for me to make the big ass join that had to be made in order to cobble together my analytics. For a while I hobbled along using PROC SQL in SAS to bring all the data together inside of SAS running on a PC. Finally I just gave up and built my own data mart in the cloud. And I could totally cut my internal IT politics out of the system. Whew, once the politics and begging for resources was over I could kick ass at analytics without having to beg borrow and plead for permissions and space.

3) Failure to backup desktop machines / inadequate shared drive space: Another experience I had was with a firm that decided it was a good policy to NOT back up desktop PCs at all. Each department was given shared drive space on a central server where “business critical” files were supposed to be kept (whatever the hell that means). Only the files on the central server were backed up. I was in the risk management department (ironically) and we had a whopping 100 MB allocated to us. Yes, this was 2004 and 100 MB was not enough to hold 2 years of risk reviews. Not to mention any ad hoc analysis and all the supporting documents. So everyone had their desktop drives, at least one USB drive, and no off site backup. It was during this period that I discovered Jungle Disk which allows client side encrypted data to be backed up to Amazon! Off site backup problem solved! And, once again, corp IT cut out of the system. (yes, this is a use of S3, not EC2) By the way, I paid for backups out of my own pocket because I felt it was very important. Well, I did have the firm buy me books which I happily kept when I left. We’ll call it even.

Let me reiterate that all three of the above uses may have put me in direct violation of my corporate IT policies. And let me also state that ultimately I found a job at a firm where internal IT sees their job as helping the business units get crap done. If you are an IT professional and you find your self thinking, “damn, I have to make sure I restrict my users from all of these crafty uses of EC2″ then, jackass,you are the problem with your firm’s IT department. If you see your job as stopping users then you are a useless burden on your firm and you should be not only fired, but spat upon. The way to prevent users from doing these, and other “shadow IT” behaviors is to provide the IT services that help your users be awesome! If you do that then you don’t have to worry about what your users are up to. They’ll be too damn busy being awesome to have time to mess with Amazon EC2.

All the examples above took place at previous places of employment. I currently use Amazon EC2 in order to scale some of my analytics, but it is done with the knowledge and support of my internal IT team. They fully understand what I am doing and they want to help me be awesome at analysis. It’s amazing how much less time I am wasting these days now that I don’t have to be so creative about avoiding the manipulative and counterproductive intervention of my internal IT team.

Kicking Ass with plyr

Tonight (October 29, 2009) at 5:30 PM is the Chicago R meetup at Jaks tap. Here’s more info.  I’ll be making a presentation based on my earlier blog post about plyr. The presentation will only be 8 minutes long so I’ve had to pick and choose my info carefully. OK, who am I kidding? I had a couple of Schlitz (in a bottle!) for lunch over at Boni Vinos and slammed some slides together rather haphazardly. At any rate, here’s the presentation. I owe special thanks to all the folks in Twitter who reviewed these slides this week. A special shout out to @kenahoo who caught my one code typo! And also to @hadleywickham (author of plyr) who made some good suggestions, some of which I heeded. As a professor he should consider 15% application of his information to be a phenomenally high rate.

Click the graphic to download the slides as a PDF:

kickingasswithplry

If you’re wondering what my favorite beer is, I’ll give you a secret. My favorite beer is #3. That’s the one that makes me a persuasive and articulate public speaker. #4 makes me dance well.

I hope to see you tonight.

Why Stack Overflow Careers is a Disruptive Innovation

Today Joel (typo fixed) Jeff Atwood announced via the Stack Overflow blog a new site called Stack Overflow Careers, a programming job site focused at job hunters.  This is a compliment to the job listing service which allows companies who are hiring to advertise on Stack Overflow. Seems like the the world’s most ‘no shit’ idea, right? But this is more than a simple idea, this is disruptive innovation in job hunting that will revolutionize  how programmers, and later other technical talent, get jobs. Why so revolutionary? Because Stack Overflow is the venue where programmers actually prove their mettle. What SO has that no other job site can offer hiring companies is access to not just resumes, but code samples, writing samples, and conversations which the candidates have provided through use of SO. I’ve hired technical talent and it is hard to figure out if the person can get stuff done and do critical thinking. It’s even hard to figure out their level of mastery of a given technology without spending a lot of time testing. But with SO I can see how people answer questions, how they ask questions, etc.

The value of this service to job hunters is high enough that I think SO is smart to charge for providing the link between a user’s resume and their profile. The price is low (<$10/yr right now) But this could provide a meaningful revenue stream for SO without cluttering the site with ad noise.

This should also have positive feedback effects on Stack Overflow. I anticipate that the quality of answers and questions both will go up. While I’m still stupid enough to post a comment that says “hey dumb ass…” most users will be smarter than that. Even if it’s just in the back of someone’s mind that a potential employer might be reading, it should improve their answers.

Now, I’m an economist, so I think there will be some unintended consequences to the Stack Overflow Careers site. First unintended consequence will be an increase in the number of new accounts. Why? Because everyone has to set up a “dumbass” account for when they want to ask stupid questions or make a sarcastic remark. But then when someone is about to post a 750 word answer with 5 pictures and 3 screen shots they will put on their “game day” account and knock that SOB out of the park! But over time most folks will let the idea of future job hunts slip out of their conscious mind and will basically use their main account for all activity. Plus even dumb questions earn karma points, so why not!

The second unintended consequence will be a plethora of clinger-on-ers. These will range from the simple: I give Dice 2 months before they allow some sort of linking to a SO profile; to the more complex: some job site is going to not only link to a SO profile but also scrape all that user’s SO activity and dynamically link it to their resume. I am unsure how the Stack Overflow Creative Commons license will protect SO from this type of activity. Stack Overflow may end up spending more on money on lawyers than they ever anticipated.

The third unintended consequence is that Stack Overflow, the company, just went from taking nickels away from Expert’s Exchange to taking dollars away from job listing sites. I hope that SO has both lawyers and sales people because now that real money is on the table they will have some savvy competition.

Good luck Joel and Jeff. I’m rooting for you!

Disclaimer: I have no financial interest in Stack Overflow at all. I have been involved in getting the R programming language community involved in Stack Overflow because I think it is the best information exchange platform available and I want more R information to be exchanged. Plus it pissed me off that the R community was using a freaking mailing list as it’s primary Q/A platform. A mailing list. In 2009. Honestly.

A Fast Intro to PLYR for R

pliersI’m not dead yet! Although it has been rumored that I am. The new job is going great and I’m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association.

There’s been a lot going on recently in the R user community. There was an R flash mob of Stack Overflow which resulted in a noticeable increase in the number of R questions and answers in SO. I’ve been blown away by the quality of the participants. There has also been increased quality discussions on Twitter which are being tagged with #rstats. These changes in the community have not gone unnoticed.

Recently I posted a question about how to do a ‘group by’ in a regression with R. I had a way I had been doing this but I was suspicious there was a better way. One of the answers proposed using the PLYR package. I think I had seen the plyr package a few times but never really understood it. Although I didn’t select this as my top answer, it prompted me to look into PLYR more. What I discovered was really interesting.

The PLYR package is a tool for doing split-apply-combine (SAC) procedures. I’m very fluent in SQL so the best analogy for me was the GROUP BY statement in SQL. PLYR adds very little new functionality to R. What it does do is take the process of SAC and make it cleaner, more tidy and easier. I think I’m not the only one who wants a clean and tidy SAC. Here’s a quick example of making some summary stats using PLYR:

# install.packages("plyr") #run this if you don't have the package already
 library(plyr)

#make some example data
dd<-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")

#ddply is the plyr function
ddply(dd, c("dim1","dim2"), function(df)mean(df$v1))

result:

    dim1 dim2          V1
    1    A    J  0.02554362
    2    B    J -0.15839675
    3    B    K -0.06077399
    4    C    K -0.02326776

PLYR functions have a neat naming convention. The first two letters of the function tells the input and output data types, respectively. The one I use the most is ddply which takes a data frame in and spits out a data frame. Let me see if I can explain what ddply is doing. The first argument, dd, is the input data frame. The next argument is the “group by” variables. Since I want to group by two variables I send them as a vector (that’s what the c() bit does). What threw me for a loop initially was the third argument, the function. What I found myself trying (unsuccessfully) was just using mean(v1) as the third argument. If I did that, R would spit at me and bring the marital status of my parents into question. I discovered that the problem was the ddply function was splitting the data by my ‘group by’ variables and then it wanted to pass each of the resulting data frames to a function. So what does it mean to pass a data frame to mean(v1)? Yeah, it means Jack Crap, that’s what it means. So in one of the PLYR examples I saw they were using these inline functions. The idea behind function(df)mean(df$v1) is to create a function to which we can pass a data frame and get out a meaningful result. The subset (or split) of the data gets passed to the function and that subset is then known as df. mean(df$v1) calculates the mean of v1 and returns an answer. ddply holds on to the answers of each split and then reassembles them all in the end. Slick, ey?

As with most things in R the idea can be extended to a vector of functions in order to perform many operations on each split:

ddply(dd, c("dim1","dim2"), function(df)c(mean(df$v1),mean(df$v2),mean(df$v3),sd(df$v1),sd(df$v2),sd(df$v3)))

The result looks like this:

dim1 dim2          V1        V2         V3        V4        V5       V6
1    A    J  0.02554362 0.3400250  0.1206980 0.9326424 1.0044120 1.100762
2    B    J -0.15839675 0.3662559 -0.1784193 0.7447807 0.8752162 1.105258
3    B    K -0.06077399 0.5184403 -0.2076024 1.0385107 1.0609706 1.153153
4    C    K -0.02326776 0.2639328  0.1352895 0.7940938 0.9025207 1.072460

Pretty nifty.

The author of PLYR is Hadley Wickham who is also the man behind GGPLOT2. If you like PLYR or GGPLOT2 then you should immediately buy Hadley’s GGPLOT2 book on Amazon. But be sure and use the link on this site or the link on Hadley’s site so he can get Amazon associate payment. The authors I have talked to told me they get more from the Associate program than they get from publishing royalties.

My father is a retired pilot turned crop farmer. He ALWAYS carries a pair of pliers in a nylon pouch on his belt. I can see that Hadley’s PLRY package is going to become my proverbial ‘belt pliers.’

Of course if I wrote an R package I’d have to name it Super RamBar, cause that’s just how I roll.

Tolstoy Dichotomy, Part Two

Tatiana Samoilova as Anna-bobanna

Tatiana Samoilova as Anna-bobanna

So back in March, 2009 I blogged about a phenomenon I called the Anna Karenina Yield Anomaly. In short, I postulated that in the production of crops the idea of a national ‘good year’ pretty much means everyone had a good yield and a national ‘bad year’ meant that some had an OK year and some were having a terrible year. And thus I made myself seem more literate than I am by linking that phenomenon back to Tolstoy and his line “All happy families are happy in the same way. All miserable families are miserable in their own way.”

So a couple weeks ago a big ass storm conspired against me and resulted in me listening to the audio version of Guns Germs and Steel by Jared Diamond while stuck in the Minneapolis airport. I was struck that Diamond also used the Anna Kerenina quote when discussing the domestication of animals. He presented the idea that all domesticated animals are similar in a number of key characteristics and all the non-domesticated critters are different and failed to be put under the yolk of humans for their own special reasons.

Dont call him Jared Yoder. He hates that.

Don't call him Jared Yoder. He hates that.

Ok, so Diamond and I both like to make connections between Tolstoy and modern agriculture. And all us white guys like to look smarter than we probably are (actually he really is that smart, I’m just gonna fake it till I make it). But besides those obvious things, what’s behind the Tolstoy quote? How applicable is this idea to other situations?

I spent a long time in the Minneapolis Airport pondering what conditions result in what I began to call the Tolstoy Dichotomy: one outcome and all are similar vs. other outcome with different causes. OK, just for transparency, I was also trying not to tap my toe or do anything else obviously gay while in the pooper. That’s why I listen to audio books and not music when stuck in MSP.

While it’s not exactly profound, it seems to me that any time a given outcome of a process is dependent on many necessary and few/no sufficient conditions then the final set of actors who went through that process will comprise one group where all necessary conditions were met and another group where varying different conditions were not met. So all the actors in one group are the same but the actors in the other group are in that group together, but all for a different set of reasons.

In crop yields this makes a lot of sense. The growing season is 8 months + for corn in the Midwest US. During that period a LOT of things can go wrong. There can be a spring that is wet and delays planting, like this year. There can be a mid-season drought, like 1987. A big cloud can park itself over the Midwest and limit light and temp and increase flooding (1993). But all good years mean that there was moisture at the right times, not too much heat during pollination, etc. So with crops there are many necessary conditions and no sufficient conditions for a good crop.

One of the interesting analytical consequences of this dichotomy is that the two groups have very different correlations. If you are measuring the group that met all the conditions, they are a homogeneous bunch (i.e. high correlation). But the group that fell off the wagon for one reason or another… well they are all pretty heterogeneous (i.e. low correlation). Depending on your model it may be very important to consider the difference between the correlations in these groups and then vary your covariance accordingly. Or you can just assume that correlations are stable over time and find yourself kicked in the nuts by reality. Which ever. It’s your call.

So, as I said earlier, not exactly profound but every time you see a system with many necessary conditions and  no sufficient conditions be like me and pull a Tolstoy Anna Kerenina quote from your butt. You’ll be glad you did.

Who's Tweets Do I Read… Magic R Code Says…

tweetingme

So one glace at my user logs shows the truth: no one gives a rat’s rump that I just quit my job; you just love you some Twitter R code. And I’m nothing but an attention whore, so come get some!

So in my last ‘Twitter with R’ post I gave you some code I’d written ripped off that allowed you to update your status from R. That’s kinda cool, but really just for annoying your friends, tweeting when your code is finished running or, as Eva pointed out in the comments, maybe Tweeting the outcome of a routine. But R is good for analyzing data, plotting graphics, and cool stuff like that. Seems like under kill to just Tweet from it.

So let’s make some pretty pictures and stuff. Or more specifically, let’s plot a histogram of the last 200 Tweets you received and the people who sent them. An example of said histogram is above.

If you don’t already have the libraries XML, lattice, and RCurl, you will need them:
install.packages('RCurl')
install.packages('XML')
install.packages('lattice')

Then once you get those bad boys, you can run this code:

library("RCurl")
library("XML")
library("lattice")
#
#be sure and put your username and passy here
username<-"YourUserName"
passy<-"YourPass"
#
#This sets up the options for curl
#then makes the request from the Twitter API
#the count=200 option pulls the last 200 tweets from your friends
#the twitter api limits you to a max of 200.... yeah, well that's life
#
opts <- curlOptions(header = FALSE, userpwd = paste(username,":",passy,sep=""))
request <- "http://twitter.com/statuses/friends_timeline.xml?count=200"
timeline <- getURL(request,.opts = opts)
#
#Now let's beat up on the XML like it owes us money
doc <- xmlInternalTreeParse(timeline, useInternalNodes = TRUE)
#
#grab only the screen_names and make a list
xml_names_of_posters <- getNodeSet(doc, "/statuses/status/user/screen_name")
text_names_of_posters <- lapply(xml_names_of_posters,xmlValue)
#
#let's take it out of a list... just for kicks
Twitterbaters <- unlist(text_names_of_posters)
#
#and shove it into a data frame... seems like going around my

#ass to get to my elbow, but I want to put it in a table eventually
#table is kinda like a cross tab. It calcs the frequency for me
posters_list_df<-as.data.frame(Twitterbaters)
Tweets = table(posters_list_df)
#
#lets graph this monkey business with lattice
#
NiftyChart<-barchart(~sort(Tweets), main=list("Who's Tweets Am I Getting?" ,cex=1),xlab=list("Number of Tweets",cex=1))
NiftyChart
update(NiftyChart, col="brown")
#

EDIT: Look in the comments for a great base graphics solution from Paolo. He makes the same graph without Lattice.

Now be sure and change the username and password then run that mofo. So now you have a pretty picture like the one I made above. Pretty slick, no?

Special credit goes to @gappy3000 who tipped me off to making this with Lattice instead of ggplot because of the difficulty sorting with ggplot. @HarlanH for helping me know that my struggles with ggplot were not of my own making but were systemic.

The Twitter syntax I hacked together is from the Twitter API documentation. Have fun! And come back later for more attenion whoring blogging from your’s truly, @CMastication.

BTW, the reason I didn’t structure this as a function is that you should be stepping through this one line at a time to figure out how it works. That’s just harder with a function. So I did this for your own good. One day you’ll thank me.

Not Just Normal… Gaussian

Pretty Normal

Pretty Normal

Dave, over at The Revolutions Blog, posted about the big ‘ol list of graphs created with R that are over at Wikimedia Commons. As I was scrolling through the list I recognized the standard normal distribution from the Wikipedia article on the same topic.

Below is the fairly simple source code with lots of comments. Here’s the source. Run it at home… for fun and profit.

# # External package to generate four shades of blue
# library(RColorBrewer)
# cols <- rev(brewer.pal(4, "Blues"))
cols <- c("#2171B5", "#6BAED6", "#BDD7E7", "#EFF3FF")

# Sequence between -4 and 4 with 0.1 steps
x <- seq(-4, 4, 0.1)

# Plot an empty chart with tight axis boundaries, and axis lines on bottom and left
plot(x, type="n", xaxs="i", yaxs="i", xlim=c(-4, 4), ylim=c(0, 0.4),
     bty="l", xaxt="n", xlab="x-value", ylab="probability density")

# Function to plot each coloured portion of the curve, between "a" and "b" as a
# polygon; the function "dnorm" is the normal probability density function
polysection <- function(a, b, col, n=11){
    dx <- seq(a, b, length.out=n)
    polygon(c(a, dx, b), c(0, dnorm(dx), 0), col=col, border=NA)
    # draw a white vertical line on "inside" side to separate each section
    segments(a, 0, a, dnorm(a), col="white")
}

# Build the four left and right portions of this bell curve
for(i in 0:3){
    polysection(   i, i+1,  col=cols[i+1]) # Right side of 0
    polysection(-i-1,  -i,  col=cols[i+1]) # Left right of 0
}

# Black outline of bell curve
lines(x, dnorm(x))

# Bottom axis values, where sigma represents standard deviation and mu is the mean
axis(1, at=-3:3, labels=expression(-3*sigma, -2*sigma, -1*sigma, mu,
                                    1*sigma,  2*sigma,  3*sigma))

# Add percent densities to each division, between x and x+1
pd <- sprintf("%.1f%%", 100*(pnorm(1:4) - pnorm(0:3)))
text(c((0:3)+0.5,(0:-3)-0.5), c(0.16, 0.05, 0.04, 0.02), pd, col=c("white","white","black","black"))
segments(c(-2.5, -3.5, 2.5, 3.5), dnorm(c(2.5, 3.5)), c(-2.5, -3.5, 2.5, 3.5), c(0.03, 0.01))