I have used figures like the one above (left) in my work at various times. It present various distributions in the form of a boxplot, and uses differing labels (in this case, the lowercase letters) to denote significant differences; i.e. levels sharing a label are not significantly different. This type of presentation is common when showing changes in organism condition indices over time (e.g. Figs 3 & 4, Bullseye puffer fish in Mexico).
In the example above, a Kruskal-Wallis rank sum test is used to test differences across all levels, followed by pairwise Mann-Whitney rank tests. The result is a matrix of p-values showing significant differences in distribution. So far so good, but it's not always clear how the grouping relationships should be labelled. In this relatively simple example, the tricky part is that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped; Therefore, two labeling codes should be designated, with level 1 sharing both. I have wondered, for some time, if there might be some way to do this in an automated fashion using an algorithm. After many attempts on my own, I finally decided to post a question to SO.
So, my first question "Algorithm for automating pairwise significance grouping labels in R" led me to the concept of the "clique cover problem", and "graph theory" in general, via SO user "David Eisenstat". While I didn't completely understand his recommendation at first, it got me pointed in the right direction - I ultimately found the R package igraph for analyzing and plotting these types of problems.
The next questions were a bit more technical. I figured out that I could return the "cliques" of my grouping relationships network using the cliques function of the igraph package, but my original attempt was giving me a list all relationships in my matrix. It was obvious to me that I would need to identify groupings where all levels were fully connected (i.e. each node in the clique connects to all others). So, my next question "How to identify fully connected node clusters with igraph [in R]" got me a tip from SO user "majom", who showed me that these fully connected cliques could be identified by first reordering the starting nodes in my list of connections (before use in the graph.data.frame function), and then subjecting the resulting igraph object to the function maximal.cliques. So, the first suggestions from David were right on, even though they didn't include code. The result shows nicely all those groupings in the above example (right plot) with fully connected cliques [i.e. (1, 3), (1, 5), (2), (4, 6), (7)].
The final piece of the puzzle was more cosmetic - "How to order a list of vectors based on the order of values contained within [in R]". A bit vague, I know, but what I was trying to do was to label groups in a progressive way so that earlier levels received their labels first. I think this leads to more legible labeling, especially when levels represent some process of progression. At the time of this posting, I have received a single negative (-1) vote on this question... This may have to do with the clarity of the question - I seem to have confused some of the respondents based on follow up comments for clarification - or, maybe someone thought I hadn't shown enough effort on my own. There's no way to know without an accompanying comment. In any case, I got a robust approach from SO user "MrFlick", and I can safely say that I would never have come up with such an eloquent solution on my own.
In all, this solution seems to work great. I have tried it out on larger problems involving more levels and it appears to give correct results. Here is an example with 20 levels (a problem that would have been an amazing headache to do manually):
Code to reproduce the example:
library(igraph) # version 1.0.1 ### Synthetic data set set.seed(1) n <- 20 n2 <- 100 mu <- cumsum(runif(n, min=-3, max=3)) sigma <- runif(n, min=1, max=3) dat <- vector(mode="list", n) for(i in seq(dat)){ dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i]) } df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat)) ### Plot png("boxplot_levels_nolabels.png", width=6, height=4, units="in", res=400, type="cairo") par(mar=c(4,4,1,1)) boxplot(y ~ group, df, notch=TRUE, outline=FALSE) # boxplot of factor level distributions mtext("Levels", side=1, line=2.5) mtext("y", side=2, line=2.5) dev.off() ### Test for significant differences in factor level distributions kruskal.test(y ~ group, df) # Significant differences as determined by Kruskal-Wallis rank sum test pairwise.wilcox.test(df$y, df$g) # Pairwise Wilcoxon (or "Mann-Whitney") rank sum tests between all factor level combinations ### Labeling of factor level groupings mw <- pairwise.wilcox.test(df$y, df$g) # Create matrix showing factor levels that should be grouped g <- as.matrix(mw$p.value > 0.05) # TRUE signifies that pairs of not significantly different at the p < 0.05 level g <- cbind(rbind(NA, g), NA) # make square g <- replace(g, is.na(g), FALSE) # replace NAs with FALSE g <- g + t(g) # not necessary, but make matrix symmetric diag(g) <- 1 # diagonal equals 1 rownames(g) <- 1:n # change row names colnames(g) <- 1:n # change column names g # resulting matrix # Re-arrange data into an "edge list" for use in igraph (i.e. which groups are "connected") - Solution from "David Eisenstat" () same <- which(g==1) g2 <- data.frame(N1=((same-1) %% n) + 1, N2=((same-1) %/% n) + 1) g2 <- g2[order(g2[[1]]),] # Get rid of loops and ensure right naming of vertices g3 <- simplify(graph.data.frame(g2,directed = FALSE)) get.data.frame(g3) # view connections # Plot igraph png("igraph_level_groupings.png", width=5, height=5, units="in", res=400, type="cairo") par(mar=c(3,1,1,1)) V(g3)$color <- "grey" V(g3)$label.color <- "black" V(g3)$size <- 15 plot(g3) # plot all nodes are connections box() mtext("Linked levels are not significantly different \n(Mann-Whitney)", side=1, line=1) dev.off() # Calcuate the maximal cliques - these are groupings where every node is connected to all others cliq <- maximal.cliques(g3) # Solution from "majom" (http://stackoverflow.com/users/1575670/majom) # Reorder by level order - Solution from "MrFlick" (http://stackoverflow.com/users/2372064/mrflick) and "ckluss" ml<-max(sapply(cliq, length)) sf <- function(x) sort(x)[seq_len(ml)] cliq2 <- lapply(cliq, as.numeric) reord <- do.call(order, as.data.frame(do.call(rbind, lapply(cliq2, sf)))) cliq <- cliq[reord] cliq # Generate labels to factor levels lab.txt <- vector(mode="list", n) # empty list lab <- letters[seq(cliq)] # clique labels for(i in seq(cliq)){ # loop to concatenate clique labels for(j in cliq[[i]]){ lab.txt[[j]] <- paste0(lab.txt[[j]], lab[i]) } } # Boxplot with facor level grouping labels png("boxplot_levels_withlabels.png", width=6, height=4, units="in", res=400, type="cairo") par(mar=c(4,4,1,1)) ylim <- range(df$y) + c(0,2) bp <- boxplot(y ~ group, df, notch=TRUE, ylim=ylim, outline=FALSE) # boxplot of factor level distributions text(x=1:n, y=bp$stats[5,], labels=lab.txt, pos=3, col=1, cex=1, font=2) mtext("Levels", side=1, line=2.5) mtext("y", side=2, line=2.5) dev.off() # Combined plot png("combined_boxplot_igraph.png", width=8, height=4, units="in", res=400, type="cairo") layout(matrix(1:2,1,2), widths=c(5,3), heights=4) par(mar=c(4,4,1,1), ps=8) # plot 1 ylim <- range(df$y) + c(0,2) bp <- boxplot(y ~ group, df, notch=TRUE, ylim=ylim, outline=FALSE) # boxplot of factor level distributions text(x=1:n, y=bp$stats[5,], labels=lab.txt, col=1, cex=1, font=2, pos=3) mtext("Levels", side=1, line=2.5) mtext("y", side=2, line=2.5) # plot 2 par(mar=c(4,1,1,1)) V(g3)$color <- "grey" V(g3)$label.color <- "black" V(g3)$size <- 15 plot(g3) # plot all nodes are connections box() mtext("Linked levels are not significantly different \n(Mann-Whitney)", side=1, line=1) dev.off()
Also implemented in package multcompView:
ReplyDeleteAn Algorithm for a Letter-Based Representation of All-Pairwise Comparisons
Hans-Peter Piepho
Journal of Computational and Graphical Statistics, Vol. 13, No. 2 (Jun., 2004), pp. 456-466
So, re-inventing the wheel it seems. I don't have access to the publication - I would be interesting to know if the approach was similar.
DeleteNo, I found your approach very interesting, sorry, I should have said that first.
ReplyDeleteI will email you the paper.
Hmm, why not colour all boxes in the boxplot with the same colour when present in the same network?
ReplyDeleteCheers, Andrej
The complication is when more than one grouping is present per level - maybe hatching with different colors would be easy on the eye. One could also color the labels (e.g. different color per letter).
Deletewith the current igraph version reord <- ... seems not to work: Error in parse_op_args(..., what = "a vertex", is_fun = is_igraph_vs, : Not a vertex sequence
ReplyDeletedo not know if it is a good way, but can be fixed with
ReplyDeletesf <- function(x) sort(x)[seq_len(ml)]
cliq2 <- lapply(cliq, as.numeric)
reord <- do.call(order, as.data.frame(do.call(rbind, lapply(cliq2, sf))))
This looks good to me - The new igraph version give the class "igraph.vs" to the output of the function maximal.cliques. Your conversion to numeric works well. Thanks for figuring this out! Several folks have been interested in a fix to this issue. Cheers, Marc
DeleteIs this a reasonable update?
ReplyDeleteYes, the suggested change from ckluss produces the desired output with the updated igraph package. I've updated the code with this change.
Delete