Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations.  For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

triple box plots with patterns

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Generating the Data

Due to confidentiality, I cannot use my co-worker’s data set on my public blog, so I generated a data set for my example of pollution in 3 cities involving 2 gases.

# set the working directory to your folder of choice
setwd("INSERT YOUR DIRECTORY PATH HERE")

# set seed for random number generation to allow you to replicate my data
set.seed(1)
# set the standard deviation for the data
sd = 3
# generate fictitious pollution data for New York, London, and Los Angeles
co2.ny = rnorm(120, 7, sd)
co2.london = rnorm(120, 9, sd)
co2.la = rnorm(120, 10.5, sd)
ch4.ny = c(rnorm(100, 12, sd), rep(NA, 20))
ch4.london = c(rnorm(100, 15, sd), rep(NA, 20))
ch4.la = c(rnorm(100, 18, sd), rep(NA, 20))

# set vectors for labelling the data with location and pollutant
location = c('New York', 'London', 'Los Angeles', 'New York', 'London', 'Los Angeles')
pollutant = c('CO2', 'CO2', 'CO2', 'CH4', 'CH4', 'CH4')

# combine the data
all.data = data.frame(rbind(co2.ny, co2.london, co2.la, ch4.ny, ch4.london, ch4.la))

# add locations and pollutants to the data
all.data$location = location
all.data$pollutant = pollutant

I used the set.seed() function to allow you to replicate my randomly generated numbers from various normal distributions.  There are 2 important notes about my code above:

  • I added 20 values of “NA” for CH4 for all 3 cities.  This was done to ensure that the resulting data frame has the same number of observations for all 6 of the vectors.  (Try to combine the data into a data frame without doing this, and look at the resulting data frame – it’s not what we want.)
  • The data frame all.data is sorted in the same order as the vectors are listed in the rbind() function when all.data was created.  As a good data manipulation practice, it is always a good idea to explore your data set; specifically, use the head() function to look at a newly generated data table or matrix and check that it is correct or conforms to the correct format.

 

Stacking the Data

I ultimately need a data table that has 3 columns:

  • the concentration of the pollutant (the value to be plotted)
  • the name of the pollutant
  • the city

Thus, I need to stack the data frame all.data while retaining the name of the pollutant and the city.  The melt() function from the “reshape2” package can do this.  (Install it first if you don’t have it yet.)

 

# open the reshape2 library
library(reshape2)
# stack the data while retaining the location and pollutant label by 
stacked.data = melt(all.data, id = c('location', 'pollutant'))
# remove the column that gives the column name of the concentration from all.data
stacked.data = stacked.data[, -3]

 

If you look at the data table stacked.data when it is first created by melt(), it has the extra, unnecessary column that I deleted in the last line of code.

 

Generating the Box Plots

Now that the data set has the needed format, let’s plot the box plots.  Note that my code below

  • creates an object called boxplots.triple for the box plots that I will use later in the code
  • uses the formula “value ~ location + pollutant” to separate the plots into the 2 groups of triplets
  • uses the “at = ” option to specify the locations of the box plots along the horizontal axis – I experimented with the horizontal positions to find the necessary separation for side-by-side box plots
  • suppresses the default horizontal axis with the option xaxt = ‘n’
  • adds my custom axis and title with the axis() and title() functions, respectively

 

png('triple box plots with patterns.png')
boxplots.triple = boxplot(value~location + pollutant, data = stacked.data, at = c(1, 1.8, 2.6, 6, 6.8, 7.6), xaxt='n', ylim = c(min(0, min(co2.ny, co2.london, co2.la)), max(ch4.ny, ch4.london, ch4.la, na.rm = T)), col = c('white', 'white', 'gray'))
axis(side=1, at=c(1.8, 6.8), labels=c('Methane (ppb)\nNumber of Collections = 100', 'Carbon Dioxide (ppb)\nNumber of Collections = 120'), line=0.5, lwd=0)
title('Comparing Pollution in London, Los Angeles, and New York')

 

Note that the resulting box plot from above gives the grey pattern to the right-most box plot (New York) for each pollutant.  That was easy with the “col = ” option in boxplot().  I now have 2 patterns: white and grey.  A natural third pattern would be stripes, and this is the (moderately) hard part.

The code below draws a rectangle with stripes over the middle (Los Angeles) box plots.  Based on the horizontal locations that I set above and the resulting inter-quartile ranges of the 2 box plots for Los Angeles, I specified where these 2 striped rectangles should be plotted.

Let’s look for the interquartile ranges for Los Angeles.  Recall that I created the object boxplots.triple from the boxplot() function.  One of the outputs from this object is the collection of the 5-number summaries underlying the box plots; another output is the name of each 5-number summary as given by the city-pollutant label.

 

Screen Shot 2014-04-10 at 8.13.50 PM

 

 

 

 

 

 

I now have all of the information needed to draw my 2 striped rectangles for Los Angeles.  (Notice that the order of the cities have been changed from the original order in rbind() to alphabetical order in box plots’ statistics.)

  • The left coordinates for the 2 rectangles are the left coordinates of the middle box plots as given in the “at =” option in boxplot() above.  In my example, they are 1.4 and 6.4.
  • The right coordinates for the 2 rectangles are the right coordinates of the middle box plots as given in the “at =” option in boxplot() above.  In my example, they are 2.2 and 7.2.
  • The bottom coordinates for the 2 rectangles are the 1st quartiles of the middle box plots as shown in boxplots.statistics above.  In my example, they are the 2nd and 5th columns in the 2nd row: 15.713810 and 9.228822.
  • The top coordinates for the 2 rectangles are the 3rd quartiles of the middle box plots as shown in boxplots.statistics above.  In my example, they are the 2nd and 5th columns in the 4th row: 20.109108 and 12.926979.

I encourage you to read the documentation of rect() to learn more about how it works.

# draw the rectangles for Los Angeles, specifying the left, bottom, right, and top positions of the box plots, the density of the lines, and the angles of the lines
rect(c(1.4, 6.4), boxplots.triple$stats[2, c(2, 5)], c(2.2, 7.2), boxplots.triple$stats[4, c(2, 5)], density=12, angle=45)
# add city labels near the box plots
text(c(1, 1.8, 2.6, 6, 6.8, 7.6), c(6, 8, 1, 17.75, 20, 15.5), c('London', 'Los\nAngeles', 'New\nYork', 'London', 'Los\nAngeles', 'New\nYork'))
dev.off()

 

For visual ease, I added the labels of the cities directly below or above each box plot.  I chose not to use a legend because

  • it’s tiring to look back and forth between the box plots and the legend to determine which pattern denotes which city.  If there are only 2 cities, then it’s OK, but 3 or more makes it hard to read.
  • The legend() function does not work with both the “fill = ” option and the “density = ” option; it seems like it can only work with 1 of them.  I searched extensively but unsuccessfully for a solution to this problem.  If anybody knows how to solve this problem, please tell me in the comments!

triple box plots with patterns

While this is a nice outcome and likely useful for many other situations with grouped box plots, this approach does seem cumbersome and inelegant.  Do you have any suggestions for how else to plot side-by-side box plots with patterns?  If so, please leave them in the comments below!

 

22 Responses to Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

  1. Not exactly the same, but your post made me go a little crazy:

    mult.boxplot<-function(list,labflip=TRUE){
    g1levels<-levels(as.factor(list[,1]))
    g2levels<-levels(as.factor(list[,2]))
    listyy<-1:(length(levels(as.factor(list[,1])))*length(levels(as.factor(list[,2]))))

    firstsplit<-(length(g1levels)*length(g2levels))/length(levels(as.factor(list[,1])))
    secondsplit<-(firstsplit/(length(levels(as.factor(list[,2])))))*1

    margins<-(listyy*secondsplit)
    margfunc<-function(l){
    if ((l-1)%%(length(g2levels))==0) {margins<<-c(margins[1:l-1],margins[l:length(margins)]+1)}
    }
    fulllist<-2:length(margins)
    bettermargins<-do.call(rbind,lapply(fulllist,margfunc))
    pick<-length(bettermargins[,1])
    finalmargins<-bettermargins[pick,]
    otherlist<-1:length(margins)
    func<-function(i){
    ( (finalmargins[i]+finalmargins[i+length(levels(as.factor(list[,2])))-1])/2)
    }
    funkl<-function(l){
    if (l==1) l
    else if (l%%length(g2levels)-1==0) l
    }
    bens<-do.call(rbind,lapply(otherlist,funkl))
    oddlist<-subset((1:(2*length(g2levels))),(1:(2*length(g2levels)))%%2==1)

    labelplacement<<-do.call(rbind,lapply(bens,func))
    #png('triple box plots with patterns.png')
    boxplots.triple = boxplot(list[,3]~as.factor(list[,1]) + as.factor(list[,2]),
    data = tester,
    at = finalmargins,
    xaxt='n',
    ylim = c(min(y),
    max(y, na.rm = T)),
    col = c('white', if (length(g2levels)==2){'gray'}
    else if (length(g2levels)==3){c('white','gray')}
    else if (length(g2levels)==4){c('white','gray', 'white')}
    else if (length(g2levels)==5){c('white','gray', 'white','white')}))
    par(las=1)
    axis(side=1, at=labelplacement,
    labels=c(g1levels),
    line=3, lwd=0)
    if (labflip==TRUE) par(las=2)
    axis(side=1, at=finalmargins,
    labels=rep(g2levels,length(g1levels)),
    line=-.5, lwd=0)

    title(paste('Comparing', deparse(substitute(y)), 'across levels of', deparse(substitute(factor1)), 'and', deparse(substitute(factor2))))
    if (length(g2levels)==4 & length(g1levels)==2){
    rect(finalmargins[c(2,6)]-.4, boxplots.triple$stats[2, c(2, 6)], finalmargins[c(2,6)]+.4, boxplots.triple$stats[4, c(2, 6)], density=12, angle=45)
    rect(finalmargins[c(4,8)]-.4, boxplots.triple$stats[2, c(4, 8)], finalmargins[c(4,8)]+.4, boxplots.triple$stats[4, c(4, 8)], density=12, angle=135)
    }
    if (length(g2levels)==4 & length(g1levels)==3){
    rect(finalmargins[c(2,6,10)]-.4, boxplots.triple$stats[2, c(2, 6,10)], finalmargins[c(2,6,10)]+.4, boxplots.triple$stats[4, c(2, 6,10)], density=12, angle=45)
    rect(finalmargins[c(4,8,12)]-.4, boxplots.triple$stats[2, c(4, 8,12)], finalmargins[c(4,8,12)]+.4, boxplots.triple$stats[4, c(4, 8,12)], density=12, angle=135)
    }
    if (length(g2levels)==4 & length(g1levels)==4){
    rect(finalmargins[c(2,6,10,14)]-.4, boxplots.triple$stats[2, c(2, 6,10,14)], finalmargins[c(2,6,10,14)]+.4, boxplots.triple$stats[4, c(2, 6,10,14)], density=12, angle=45)
    rect(finalmargins[c(4,8,12,16)]-.4, boxplots.triple$stats[2, c(4, 8,12,16)], finalmargins[c(4,8,12,16)]+.4, boxplots.triple$stats[4, c(4, 8,12,16)], density=12, angle=135)
    }
    if (length(g2levels)==4 & length(g1levels)==5){
    rect(finalmargins[c(2,6,10,14,18)]-.4, boxplots.triple$stats[2, c(2, 6,10,14,18)], finalmargins[c(2,6,10,14,18)]+.4, boxplots.triple$stats[4, c(2, 6,10,14,18)], density=12, angle=45)
    rect(finalmargins[c(4,8,12,16,20)]-.4, boxplots.triple$stats[2, c(4, 8,12,16,20)], finalmargins[c(4,8,12,16,20)]+.4, boxplots.triple$stats[4, c(4, 8,12,16,20)], density=12, angle=135)
    }
    if (length(g2levels)==3 & length(g1levels)==2){
    rect(finalmargins[c(2,5)]-.4, boxplots.triple$stats[2, c(2, 5)], finalmargins[c(2,5)]+.4, boxplots.triple$stats[4, c(2, 5)], density=12, angle=45)
    }
    if (length(g2levels)==3 & length(g1levels)==3){
    rect(finalmargins[c(2,5,8)]-.4, boxplots.triple$stats[2, c(2, 5,8)], finalmargins[c(2,5,8)]+.4, boxplots.triple$stats[4, c(2, 5,8)], density=12, angle=45)
    }
    if (length(g2levels)==3 & length(g1levels)==4){
    rect(finalmargins[c(2,5,8,11)]-.4, boxplots.triple$stats[2, c(2, 5,8,11)], finalmargins[c(2,5,8,11)]+.4, boxplots.triple$stats[4, c(2, 5,8,11)], density=12, angle=45)
    }
    if (length(g2levels)==3 & length(g1levels)==5){
    rect(finalmargins[c(2,5)]-.4, boxplots.triple$stats[2, c(2, 5,8,11,14)],finalmargins[c(2,5,8,11,14)]+.4, boxplots.triple$stats[4, c(2, 5,8,11,14)], density=12, angle=45)
    }
    if (length(g2levels)==5 & length(g1levels)==2){
    rect(finalmargins[c(2,7)]-.4, boxplots.triple$stats[2, c(2, 7)], finalmargins[c(2,7)]+.4, boxplots.triple$stats[4, c(2, 7)], density=12, angle=45)
    rect(finalmargins[c(4,9)]-.4, boxplots.triple$stats[2, c(4, 9)], finalmargins[c(4,9)]+.4, boxplots.triple$stats[4, c(4, 9)], density=12, angle=135)
    rect(finalmargins[c(5,10)]-.4, boxplots.triple$stats[2, c(5, 10)], finalmargins[c(5,10)]+.4, boxplots.triple$stats[4, c(5, 10)], density=12, angle=45)
    rect(finalmargins[c(5,10)]-.4, boxplots.triple$stats[2, c(5, 10)], finalmargins[c(5,10)]+.4, boxplots.triple$stats[4, c(5, 10)], density=12, angle=135)
    }
    if (length(g2levels)==5 & length(g1levels)==3){
    rect(finalmargins[c(2,7,11)]-.4, boxplots.triple$stats[2, c(2, 7,11)], finalmargins[c(2,7,11)]+.4, boxplots.triple$stats[4, c(2, 7,11)], density=12, angle=45)
    rect(finalmargins[c(4,9,14)]-.4, boxplots.triple$stats[2, c(4, 9,14)], finalmargins[c(4,9,14)]+.4, boxplots.triple$stats[4, c(4, 9,14)], density=12, angle=135)
    rect(finalmargins[c(5,10,15)]-.4, boxplots.triple$stats[2, c(5, 10,15)], finalmargins[c(5,10,15)]+.4, boxplots.triple$stats[4, c(5, 10,15)], density=12, angle=45)
    rect(finalmargins[c(5,10,15)]-.4, boxplots.triple$stats[2, c(5, 10,15)], finalmargins[c(5,10,15)]+.4, boxplots.triple$stats[4, c(5, 10,15)], density=12, angle=135)
    }
    if (length(g2levels)==5 & length(g1levels)==4){
    rect(finalmargins[c(2,7,11,16)]-.4, boxplots.triple$stats[2, c(2, 7,11,16)], finalmargins[c(2,7,11,16)]+.4, boxplots.triple$stats[4, c(2, 7,11,16)], density=12, angle=45)
    rect(finalmargins[c(4,9,14,19)]-.4, boxplots.triple$stats[2, c(4, 9,14,19)], finalmargins[c(4,9,14,19)]+.4, boxplots.triple$stats[4, c(4, 9,14,19)], density=12, angle=135)
    rect(finalmargins[c(5,10,15,20)]-.4, boxplots.triple$stats[2, c(5, 10,15,20)], finalmargins[c(5,10,15,20)]+.4, boxplots.triple$stats[4, c(5, 10,15,20)], density=12, angle=45)
    rect(finalmargins[c(5,10,15,20)]-.4, boxplots.triple$stats[2, c(5, 10,15,20)], finalmargins[c(5,10,15,20)]+.4, boxplots.triple$stats[4, c(5, 10,15,20)], density=12, angle=135)
    }
    if (length(g2levels)==5 & length(g1levels)==5){
    rect(finalmargins[c(2,7,11,16,21)]-.4, boxplots.triple$stats[2, c(2, 7,11,16,21)], finalmargins[c(2,7,11,16,21)]+.4, boxplots.triple$stats[4, c(2, 7,11,16,21)], density=12, angle=45)
    rect(finalmargins[c(4,9,14,19,24)]-.4, boxplots.triple$stats[2, c(4, 9,14,19,24)], finalmargins[c(4,9,14,19,24)]+.4, boxplots.triple$stats[4, c(4, 9,14,19,24)], density=12, angle=135)
    rect(finalmargins[c(5,10,15,20,25)]-.4, boxplots.triple$stats[2, c(5, 10,15,20,25)], finalmargins[c(5,10,15,20,25)]+.4, boxplots.triple$stats[4, c(5, 10,15,20,25)], density=12, angle=45)
    rect(finalmargins[c(5,10,15,20,25)]-.4, boxplots.triple$stats[2, c(5, 10,15,20,25)], finalmargins[c(5,10,15,20,25)]+.4, boxplots.triple$stats[4, c(5, 10,15,20,25)], density=12, angle=135)
    }

    }

    factor2<-rbinom(100,1,.5)
    factor21<-rbinom(100,1,.5)
    factor3<-sample(c('Male','Female','Other'),100,replace=TRUE)
    factor31<-sample(c('Catholic','Christian','Atheist'),100,replace=TRUE)
    factor4<-rep(1:4,25)
    factor41<-sample(c('U.S.','Australia','UK','Canada'),100,replace=TRUE)
    factor5<-rep(1:5,20)
    factor51<-sample(c('Caucasian','Af.Amer','Asian Am.','Latin Am.','Canadian'),100,replace=TRUE)
    y<-rnorm(100,53,19)
    tester<-data.frame(factor2,factor21,factor3,factor31,factor4,factor41,factor5,factor51,y)

    list<-data.frame(factor41,factor5,y)
    mult.boxplot(list)

    • Thanks for your helpful script, Andrew! I ran it and saw a nice plot that accomplishes the same goal.

      Could you please briefly explain what aspect of my code drove you crazy? I recognize that my code may not be very elegant, so I welcome any feedback that you can share.

      Thanks,

      Eric

  2. Oh, no. Nothing about it drove me crazy. I just hate using (and seeing) situation specific code. Whenever I go to do a task I always write a general script that can be reused on a new situation with almost zero work. So when I saw your post I went crazy and had to write a more general script. I’m just now starting to test it with ‘real’ data instead of artificial data posted with it and am fixing it up.

    So thank you for giving me the idea for it. A lot of it I felt like I was reinventing the wheel when it came to box spacing and label placement, but it was definitely a fun little puzzle to figure out how to get it all to work no matter the size of the factors you put in (up to 5 levels, because I just didn’t feel like programming anymore).

    • Hi Andrew,

      Yes, I agree that generalizable code is better than situation-specific code. For this task, I just needed to get the desired plot for my co-worker, and that was good enough for this purpose. While it would have been good to spend more time to write a generalizable script, I learned enough new things to write a blog post about it, and I also don’t want to let this one interesting idea take time away from my other blog posts in development.

      Thus, I really appreciate you taking the time to write this for all of us!

      Eric

      • I absolutely agree with all of your points. I never knew how to make patterns on boxplots, and had never investigated changing plot coordinates, etc, so your post was super informative for me. And I had a lull in the work day to start investigating how to generalize it and really enjoyed doing so, so I thought I would share it. What I posted doesn’t actually work with ‘real’ data in it’s current form. I’ve fixed most of it up and can re-post it when I have some more free time to polish it up.

  3. Andrew Taylor says:

    twoway.box<-function(list,data,group1, group2, y,labflip=TRUE){
    attach(data)
    g1levels<-levels(as.factor(group1))
    g2levels<-levels(as.factor(group2))
    numboxes<-1:(length(levels(as.factor(group1)))*length(levels(as.factor(group2))))

    firstsplit<-(length(g2levels)*length(g1levels))/length(levels(as.factor(group2)))
    secondsplit<-(firstsplit/(length(levels(as.factor(group1)))))*1

    margins<-(numboxes*secondsplit)
    margfunc<-function(l){
    if ((l-1)%%(length(g1levels))==0) {margins<<-c(margins[1:l-1],margins[l:length(margins)]+1)}
    }
    fulllist<-2:length(margins)
    bettermargins<-do.call(rbind,lapply(fulllist,margfunc))
    find.depth<-length(bettermargins[,1])
    finalmargins<-bettermargins[find.depth,]
    otherlist<-1:length(margins)
    find.middle<-function(i){
    ( (finalmargins[i]+finalmargins[i+length(levels(as.factor(group1)))-1])/2)
    }
    get.labplace<-function(l){
    if (l==1) l
    else if (l%%length(g1levels)-1==0) l
    }
    firstlabs<-do.call(rbind,lapply(otherlist,get.labplace))
    oddlist<-subset((1:(2*length(g1levels))),(1:(2*length(g1levels)))%%2==1)

    labelplacement<<-do.call(rbind,lapply(firstlabs,find.middle))
    #png('triple box plots with patterns.png')
    boxplots.triple = boxplot(y~as.factor(group1) + as.factor(group2),
    data = data,
    at = finalmargins,
    xaxt='n',
    ylim = c(min(y),
    max(y, na.rm = T)),
    col = c('white', if (length(g2levels)==2){'gray'}
    else if (length(g1levels)==3){c('white','gray')}
    else if (length(g1levels)==4){c('white','gray', 'white')}
    else if (length(g1levels)==5){c('white','gray', 'white','white')}))
    par(las=1)
    axis(side=1, at=labelplacement,
    labels=c(g2levels),
    line=3, lwd=0)
    if (labflip==TRUE) par(las=2)
    axis(side=1, at=finalmargins,
    labels=rep(g1levels,length(g2levels)),
    line=-.5, lwd=0)

    title(paste('Comparing', deparse(substitute(y)), 'across levels of', deparse(substitute(group1)), 'and', deparse(substitute(group2))))
    if (length(g1levels)==4 & length(g2levels)==2){
    rect(finalmargins[c(2,6)]-.4, boxplots.triple$stats[2, c(2, 6)], finalmargins[c(2,6)]+.4, boxplots.triple$stats[4, c(2, 6)], density=12, angle=45)
    rect(finalmargins[c(4,8)]-.4, boxplots.triple$stats[2, c(4, 8)], finalmargins[c(4,8)]+.4, boxplots.triple$stats[4, c(4, 8)], density=12, angle=135)
    }
    if (length(g1levels)==4 & length(g2levels)==3){
    rect(finalmargins[c(2,6,10)]-.4, boxplots.triple$stats[2, c(2, 6,10)], finalmargins[c(2,6,10)]+.4, boxplots.triple$stats[4, c(2, 6,10)], density=12, angle=45)
    rect(finalmargins[c(4,8,12)]-.4, boxplots.triple$stats[2, c(4, 8,12)], finalmargins[c(4,8,12)]+.4, boxplots.triple$stats[4, c(4, 8,12)], density=12, angle=135)
    }
    if (length(g1levels)==4 & length(g2levels)==4){
    rect(finalmargins[c(2,6,10,14)]-.4, boxplots.triple$stats[2, c(2, 6,10,14)], finalmargins[c(2,6,10,14)]+.4, boxplots.triple$stats[4, c(2, 6,10,14)], density=12, angle=45)
    rect(finalmargins[c(4,8,12,16)]-.4, boxplots.triple$stats[2, c(4, 8,12,16)], finalmargins[c(4,8,12,16)]+.4, boxplots.triple$stats[4, c(4, 8,12,16)], density=12, angle=135)
    }
    if (length(g1levels)==4 & length(g2levels)==5){
    rect(finalmargins[c(2,6,10,14,18)]-.4, boxplots.triple$stats[2, c(2, 6,10,14,18)], finalmargins[c(2,6,10,14,18)]+.4, boxplots.triple$stats[4, c(2, 6,10,14,18)], density=12, angle=45)
    rect(finalmargins[c(4,8,12,16,20)]-.4, boxplots.triple$stats[2, c(4, 8,12,16,20)], finalmargins[c(4,8,12,16,20)]+.4, boxplots.triple$stats[4, c(4, 8,12,16,20)], density=12, angle=135)
    }
    if (length(g1levels)==3 & length(g2levels)==2){
    rect(finalmargins[c(2,5)]-.4, boxplots.triple$stats[2, c(2, 5)], finalmargins[c(2,5)]+.4, boxplots.triple$stats[4, c(2, 5)], density=12, angle=45)
    }
    if (length(g1levels)==3 & length(g2levels)==3){
    rect(finalmargins[c(2,5,8)]-.4, boxplots.triple$stats[2, c(2, 5,8)], finalmargins[c(2,5,8)]+.4, boxplots.triple$stats[4, c(2, 5,8)], density=12, angle=45)
    }
    if (length(g1levels)==3 & length(g2levels)==4){
    rect(finalmargins[c(2,5,8,11)]-.4, boxplots.triple$stats[2, c(2, 5,8,11)], finalmargins[c(2,5,8,11)]+.4, boxplots.triple$stats[4, c(2, 5,8,11)], density=12, angle=45)
    }
    if (length(g1levels)==3 & length(g2levels)==5){
    rect(finalmargins[c(2,5)]-.4, boxplots.triple$stats[2, c(2, 5,8,11,14)],finalmargins[c(2,5,8,11,14)]+.4, boxplots.triple$stats[4, c(2, 5,8,11,14)], density=12, angle=45)
    }
    if (length(g1levels)==5 & length(g2levels)==2){
    rect(finalmargins[c(2,7)]-.4, boxplots.triple$stats[2, c(2, 7)], finalmargins[c(2,7)]+.4, boxplots.triple$stats[4, c(2, 7)], density=12, angle=45)
    rect(finalmargins[c(4,9)]-.4, boxplots.triple$stats[2, c(4, 9)], finalmargins[c(4,9)]+.4, boxplots.triple$stats[4, c(4, 9)], density=12, angle=135)
    rect(finalmargins[c(5,10)]-.4, boxplots.triple$stats[2, c(5, 10)], finalmargins[c(5,10)]+.4, boxplots.triple$stats[4, c(5, 10)], density=12, angle=45)
    rect(finalmargins[c(5,10)]-.4, boxplots.triple$stats[2, c(5, 10)], finalmargins[c(5,10)]+.4, boxplots.triple$stats[4, c(5, 10)], density=12, angle=135)
    }
    if (length(g1levels)==5 & length(g2levels)==3){
    rect(finalmargins[c(2,7,11)]-.4, boxplots.triple$stats[2, c(2, 7,11)], finalmargins[c(2,7,11)]+.4, boxplots.triple$stats[4, c(2, 7,11)], density=12, angle=45)
    rect(finalmargins[c(4,9,14)]-.4, boxplots.triple$stats[2, c(4, 9,14)], finalmargins[c(4,9,14)]+.4, boxplots.triple$stats[4, c(4, 9,14)], density=12, angle=135)
    rect(finalmargins[c(5,10,15)]-.4, boxplots.triple$stats[2, c(5, 10,15)], finalmargins[c(5,10,15)]+.4, boxplots.triple$stats[4, c(5, 10,15)], density=12, angle=45)
    rect(finalmargins[c(5,10,15)]-.4, boxplots.triple$stats[2, c(5, 10,15)], finalmargins[c(5,10,15)]+.4, boxplots.triple$stats[4, c(5, 10,15)], density=12, angle=135)
    }
    if (length(g1levels)==5 & length(g2levels)==4){
    rect(finalmargins[c(2,7,11,16)]-.4, boxplots.triple$stats[2, c(2, 7,11,16)], finalmargins[c(2,7,11,16)]+.4, boxplots.triple$stats[4, c(2, 7,11,16)], density=12, angle=45)
    rect(finalmargins[c(4,9,14,19)]-.4, boxplots.triple$stats[2, c(4, 9,14,19)], finalmargins[c(4,9,14,19)]+.4, boxplots.triple$stats[4, c(4, 9,14,19)], density=12, angle=135)
    rect(finalmargins[c(5,10,15,20)]-.4, boxplots.triple$stats[2, c(5, 10,15,20)], finalmargins[c(5,10,15,20)]+.4, boxplots.triple$stats[4, c(5, 10,15,20)], density=12, angle=45)
    rect(finalmargins[c(5,10,15,20)]-.4, boxplots.triple$stats[2, c(5, 10,15,20)], finalmargins[c(5,10,15,20)]+.4, boxplots.triple$stats[4, c(5, 10,15,20)], density=12, angle=135)
    }
    if (length(g1levels)==5 & length(g2levels)==5){
    rect(finalmargins[c(2,7,11,16,21)]-.4, boxplots.triple$stats[2, c(2, 7,11,16,21)], finalmargins[c(2,7,11,16,21)]+.4, boxplots.triple$stats[4, c(2, 7,11,16,21)], density=12, angle=45)
    rect(finalmargins[c(4,9,14,19,24)]-.4, boxplots.triple$stats[2, c(4, 9,14,19,24)], finalmargins[c(4,9,14,19,24)]+.4, boxplots.triple$stats[4, c(4, 9,14,19,24)], density=12, angle=135)
    rect(finalmargins[c(5,10,15,20,25)]-.4, boxplots.triple$stats[2, c(5, 10,15,20,25)], finalmargins[c(5,10,15,20,25)]+.4, boxplots.triple$stats[4, c(5, 10,15,20,25)], density=12, angle=45)
    rect(finalmargins[c(5,10,15,20,25)]-.4, boxplots.triple$stats[2, c(5, 10,15,20,25)], finalmargins[c(5,10,15,20,25)]+.4, boxplots.triple$stats[4, c(5, 10,15,20,25)], density=12, angle=135)
    }
    detach(data)
    }
    test.data<-function(){
    factor2<-rbinom(100,1,.5)
    factor21<-rbinom(100,1,.5)
    factor3<-sample(c('Male','Female','Other'),100,replace=TRUE)
    factor31<-sample(c('Catholic','Christian','Atheist'),100,replace=TRUE)
    factor4<-rep(1:4,25)
    factor41<-sample(c('U.S.','Australia','UK','Canada'),100,replace=TRUE)
    factor5<-rep(1:5,20)
    factor51<-sample(c('Caucasian','Af.Amer','Asian Am.','Latin Am.','Canadian'),100,replace=TRUE)
    y<-rnorm(100,53,19)
    tester<<-data.frame(factor2,factor21,factor3,factor31,factor4,factor41,factor5,factor51,y)
    }
    #test.data()
    #twoway.box(list=datas,labflip=TRUE,data=tester,group1=factor41,group2=factor4,y=y)

    • Andrew Taylor says:

      This should work, and work correctly with real data.

      It now takes your data and variables in a normal way. Also, and much more importantly, it is now plotting the data correctly. The earlier version I had mentally flipped what was getting subgrouped, and so it was blatantly wrong to disastrous effects.

    • Thanks for your dedicated efforts to getting it right, Andrew! This is much appreciated!

      • Thank you for working out how to do this in R and sharing your code. The ‘list’ should have been deleted in the arguments list and call; I wanted to send it before I left work and got sloppy.

        Eh, I got paid to do it during a lull in my work, so thank Henry Ford Hospital, I guess. I hadn’t run into your blog before, but I’m definitely subscribed now, so keep up the informative information, please 🙂

  4. Colin says:

    Hello,

    I really like the design of this plot! However, I am new to R and seem to be missing the final step where the boxplot appears in my Plots window in R Studio. Is there a final line of code that I need to show this plot? Any help is appreciated!!

    • Hi Colin,

      My code will NOT show the plot in the plot window; instead, it’s exported to my working directory by png() and dev.off(). If you don’t specify a file path for the file name within png(), then it’ll be exported to the working directory. If you want to export the image to a folder different from the working directory, then write the full directory path of the intended folder before the file name.

      If you want to see the plot in the plot window in RStudio, don’t run the lines containing png() and dev.off() – just run the lines within it.

      Does this make sense?

      • Colin says:

        Awesome!! Thanks! Incredible work with this by the way. I realize there was a comment that it is just for a specific dataset, but this setup is something I’ll be dealing with quite a bit, so you may see some very familiar boxplots in some upcoming publications!

      • I’m glad that this was useful to you, Colin! You’re welcome to use this code however you like; attribution would be appreciated whenever possible!

      • Colin says:

        Hi Eric,

        Of course! How would you like your name to appear (e.g. which credentials? full name?)?

      • Hi Colin,

        Thanks for doing this! I would much appreciate you

        – stating my name and credential as “Eric Cai, blogger at The Chemical Statistician”.

        – mentioning the name of the blog post with a link to it.

        I understand that citing blogs is unusual for formal publications, especially if they are academic ones, so please write the attribution wherever is convenient. A good place to do so could be the acknowledgments section or any supplementary web site containing the R script for your analysis.

        Thanks again!

  5. ola.hein@web.de says:

    Oh wow this is insane! I was searching for exactly this kind of boxplot, but both codes look way too complicated for me, since I am a beginner in R. But I think it’s worth to try to figure out how to use them for my data.Thank you very much for providing them!

  6. Hi Eric, worked through this today. Thank you. Still wish it was able to expand to more cities and more types of pollution. I guess that means developing a whole R package -SuperGroupedBoxPlots! http://paynito.github.io/r,/knitr/2016/09/01/grouped-boxplots.html

  7. jkdel says:

    I use following function to get a vector of positions for both the labels on the x axis and the boxplots:

    grouped_box_positions <- function(var, covar, boxwex=0.75) {
    v <- length(unique(var))
    c <- length(unique(covar))
    n <- c*v
    center <- median(1:v)
    pos_xlab <- c()
    for (i in 1:c) {
    pos_xlab <- c(pos_xlab,center+v*(i-1))
    }
    pos_group_1 <- c()
    for (j in 1:v) {
    pos_group_1 <- c(pos_group_1,center-boxwex*(center-j))
    }
    pos_bars <- c(pos_group_1)
    for (i in 2:c) {
    pos_bars <- c(pos_bars,pos_group_1+v*(i-1))
    }
    return(list(bars=pos_bars,xlab=pos_xlab))
    }

    Then you can use it like so:
    pos <- grouped_box_positions(df$Var1,df$Var2)
    boxplot(Outcome~Var1+Var2, data=df, xaxt = "n", ylab="Outcome", at=pos$bars)
    axis(1,at=pos$xlab, labels=c("Treatment A","Treatment B","Controls"))

    The boxwex argument taken by the function grouped_pox_position is the same that can be provided to the boxplot function.

  8. jkdel says:

    Here is a gist containing a full example and some corrections: https://gist.github.com/jkdel/0f4fbe409c8a6a413aba23488a29650c

Your thoughtful comments are much appreciated!