A Fast Intro to PLYR for R

pliersI’m not dead yet! Although it has been rumored that I am. The new job is going great and I’m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association.

There’s been a lot going on recently in the R user community. There was an R flash mob of Stack Overflow which resulted in a noticeable increase in the number of R questions and answers in SO. I’ve been blown away by the quality of the participants. There has also been increased quality discussions on Twitter which are being tagged with #rstats. These changes in the community have not gone unnoticed.

Recently I posted a question about how to do a ‘group by’ in a regression with R. I had a way I had been doing this but I was suspicious there was a better way. One of the answers proposed using the PLYR package. I think I had seen the plyr package a few times but never really understood it. Although I didn’t select this as my top answer, it prompted me to look into PLYR more. What I discovered was really interesting.

The PLYR package is a tool for doing split-apply-combine (SAC) procedures. I’m very fluent in SQL so the best analogy for me was the GROUP BY statement in SQL. PLYR adds very little new functionality to R. What it does do is take the process of SAC and make it cleaner, more tidy and easier. I think I’m not the only one who wants a clean and tidy SAC. Here’s a quick example of making some summary stats using PLYR:

# install.packages("plyr") #run this if you don't have the package already

#make some example data
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")

#ddply is the plyr function
ddply(dd, c("dim1","dim2"), function(df)mean(df$v1))


    dim1 dim2          V1
    1    A    J  0.02554362
    2    B    J -0.15839675
    3    B    K -0.06077399
    4    C    K -0.02326776

PLYR functions have a neat naming convention. The first two letters of the function tells the input and output data types, respectively. The one I use the most is ddply which takes a data frame in and spits out a data frame. Let me see if I can explain what ddply is doing. The first argument, dd, is the input data frame. The next argument is the “group by” variables. Since I want to group by two variables I send them as a vector (that’s what the c() bit does). What threw me for a loop initially was the third argument, the function. What I found myself trying (unsuccessfully) was just using mean(v1) as the third argument. If I did that, R would spit at me and bring the marital status of my parents into question. I discovered that the problem was the ddply function was splitting the data by my ‘group by’ variables and then it wanted to pass each of the resulting data frames to a function. So what does it mean to pass a data frame to mean(v1)? Yeah, it means Jack Crap, that’s what it means. So in one of the PLYR examples I saw they were using these inline functions. The idea behind function(df)mean(df$v1) is to create a function to which we can pass a data frame and get out a meaningful result. The subset (or split) of the data gets passed to the function and that subset is then known as df. mean(df$v1) calculates the mean of v1 and returns an answer. ddply holds on to the answers of each split and then reassembles them all in the end. Slick, ey?

As with most things in R the idea can be extended to a vector of functions in order to perform many operations on each split:

ddply(dd, c("dim1","dim2"), function(df)c(mean(df$v1),mean(df$v2),mean(df$v3),sd(df$v1),sd(df$v2),sd(df$v3)))

The result looks like this:

dim1 dim2          V1        V2         V3        V4        V5       V6
1    A    J  0.02554362 0.3400250  0.1206980 0.9326424 1.0044120 1.100762
2    B    J -0.15839675 0.3662559 -0.1784193 0.7447807 0.8752162 1.105258
3    B    K -0.06077399 0.5184403 -0.2076024 1.0385107 1.0609706 1.153153
4    C    K -0.02326776 0.2639328  0.1352895 0.7940938 0.9025207 1.072460

Pretty nifty.

The author of PLYR is Hadley Wickham who is also the man behind GGPLOT2. If you like PLYR or GGPLOT2 then you should immediately buy Hadley’s GGPLOT2 book on Amazon. But be sure and use the link on this site or the link on Hadley’s site so he can get Amazon associate payment. The authors I have talked to told me they get more from the Associate program than they get from publishing royalties.

My father is a retired pilot turned crop farmer. He ALWAYS carries a pair of pliers in a nylon pouch on his belt. I can see that Hadley’s PLRY package is going to become my proverbial ‘belt pliers.’

Of course if I wrote an R package I’d have to name it Super RamBar, cause that’s just how I roll.


  1. dontpanic says:

    Thanks for the article, very interesting…

    i’ve made a litte performance test with a quiet large dataframe (~500000 rows, 5 cols). It looks like sqldf() is much faster (9 secs) than ddply() (15 secs) grouping by two columns and then sum() of a third column.

  2. JD Long says:

    plyr is certainly not as fast as sqldf() which uses the sqlite engine. However, this is only a relevant comparison in situations where an operation can be done in both plyr and sqldf(). These situations are actually quite few. plyr works not only on data frames, but also on lists, matrices, arrays, etc. In addition, sqlite (and SQL in general) only supports a very limited set of aggregation functions. plyr supports any aggregation function you can write plus all the functions included in base R and the thousands of CRAN packages. So it’s great to compare speed, but be forewarned that this comparison is only relevant in a very small subset of cases!

    Thanks for reading and posting your thoughts!

  3. Lavinia says:

    Thanks for the worked example, very useful, I’ve looked at plyr but never really appreciated how it worked.

  4. Ethan Brown says:

    Dearest Sir,

    Once again you were the right search result at the right time. What a handy tool and a handy post! My night is saved.

    BTW, it was good to meet you many moons ago at useR! 2010 at the RUG session–we now have a lively and excellent local group here in Denver thanks in part to your inspiration.

  5. Wabe says:

    A neater way to achieve the same result and at the same time set appropriate column names is to write:

  6. JD Long says:

    Good point Wabe. I didn’t use summarize in the example because I wanted to illustrate a design pattern applicable to functions which are not part of summarize. But I agree that summarize is pretty dang handy!

  7. YES. That is what confused me too at first.

  8. So here is another question. Say I want to ddply with INDICES = .(dim1, dim2) on only a subset of the data. Say the positive ones for concreteness.

    What then? Do I have to do it in two steps? Because that’s the dosey do I thought we were avoiding by using plyr.

  9. Bob Muenchen says:

    If you do this example using plyr’s summerize function, you can avoid typing df$ so many times and you also get to name your columns:

    > ddply(dd, c(“dim1″,”dim2″), summarize,
    + v1mean = mean(v1),
    + v2mean = mean(v2),
    + v3mean = mean(v3),
    + v1sd = sd(v1),
    + v2sd = sd(v2),
    + v3sd = sd(v3))
    dim1 dim2 v1mean v2mean v3mean v1sd v2sd v3sd
    1 A J 0.04736 0.091364 -0.30501 1.07886 0.96619 1.13664
    2 B J 0.30301 0.227677 -0.24818 1.22907 0.98196 1.29158
    3 B K -0.28136 -0.264688 0.35963 0.77938 0.83391 1.11255
    4 C K -0.24694 -0.163735 -0.24561 0.79591 0.91162 0.83031

  10. Bob Muenchen says:

    Oops, I missed Wabe’s comment about summarize!

Leave a Reply