Log in

R wrangling - Anomalous Space Time Piping
a stochastically driven life
R wrangling
Please be warned that I'm mostly venting about R in this post. Or, rather, venting and sharing knowledge.

As one is wont to do whilst working on a dissertation, one occasionally finds the need to create spiffy plots, and that's the position I found myself in today. Today's primary objective was to create a new plot and, as is typical with R, the initial plot took about a minute to write up ... and several hours to tweak to get Just Right. Truly today was epically ambivalent in that I feel a tangible sense of accomplishment, but that I've also had to waste several hours on something that was, in a sense, trivial.

Essentially I had the following data:
> head(trainingSetCounts)
  Job Generation NumParentsHigh NumParentsLow
1   0          0             29            12
2   0          1             41            12
3   0          2             31            20
4   0          3             19            27
5   0          4             16            28
6   0          5             34             8
> tail(trainingSetCounts)
    Job Generation NumParentsHigh NumParentsLow
715   9         18             36            20
716   9         19             31            13
717   9         20             29            17
718   9         21             33            13
719   9         22             46             6
720   9         23             56             6

Without going into too much detail on the variable meanings, I wanted to create a combined box-and-whiskers plot of the "NumParentsHigh" and "NumParentsLow" by "Generation" to make side-by-side comparison easy. I knew that ggplot2 was more than capable of rendering such.

So, my first cut rendering this plot was this:

q <- ggplot(subset(trainingSetCounts, Generation %% 2 == 0), aes(factor(Generation),NumParentsHigh,color='red')) + geom_boxplot()
q + labs(y = 'Training Set Sizes', x = 'Generation', colour = 'Counts') + geom_boxplot(aes(y=NumParentsLow,color='blue'))

Which generates the following plot:


This was most of the way there and took no time to do. This does a good job of showing the relative distributions by generation of the two variables, NumParentsHigh and NumParentsLow. However, there are some obvious problems with this plot. First, the two box-and-whisker plots overlap one another, and the legend uses the color name, which isn't at all descriptive of the variable.

And so began my merry and hellish journey to the final solution.

First, I knew that ggplot2 had the capability of "dodging" elements between plot layers; i.e., pulling overlapping elements apart to overcome overlaps. I spent some time playing with various permutations of position="dodge" to no avail. I also tried a variety of other ggplot incantations. I was getting very frustrated, especially since my intuition was that there was a simple way to do this given my experience with R.

After a while I realized that ggplot2 would intelligently render the two variables if I could differentiate them in some way on a row by row basis. What I needed to do was to split each row into two: one for the NumParentsHigh and the other for NumParentsLow. So I proceeded to cobble together code to convert the dataframe to a format that contained "Job, Generation, Count, Type" where "Type" would be a factor for "high" and "low" values.

The R Way to do this would be to use something like tapply() or transform(), but those are good at applying a function for groups of variables. They weren't applicable to the job. (Or, if they were, it wasn't obvious to me.)

So I began to write a for() loop to grind through the original dataframe to convert it to the other format. Here's a hint: anytime you are writing a for() loop in R, you are very, very likely Doing It Wrong™. And so I was.

After spending an embarrassingly long time trying to coerce various rbind() calls to work correctly inside a for loop, I realized that the plyr package's melt() probably would do what I wanted. Sure enough, this is the output from such a call:
> head(melt(trainingSetCounts,id=c("Job","Generation")))
  Job Generation       variable value
1   0          0 NumParentsHigh    29
2   0          1 NumParentsHigh    41
3   0          2 NumParentsHigh    31
4   0          3 NumParentsHigh    19
5   0          4 NumParentsHigh    16
6   0          5 NumParentsHigh    34

That is, melt() pulled out the individual values for NumParentsHigh and NumParentsLow by Job and Generation and created two new variables, appropriately enough called "variable" and "value." I now had the pieces I needed to make the final plot.
p <- ggplot(subset(melted.trainingSetCounts, Generation %% 2 == 0), aes(factor(Generation),value,color=factor(variable,labels=c('Best','Worst')))) 
p + labs(y = 'Training Set Sizes', x = 'Generation', colour = 'Counts') + geom_boxplot()

Which renders the following plot:

Final version of plot

Finally the two sets of box-and-whisker plots no longer overlap and the legend labels are sensible. And, along the way I learned a little more about melt() and ggplot2.
Infect with meme