me nugget: Maximal Information Coefficient (Part II)

Wednesday, September 17, 2014

Maximal Information Coefficient (Part II)

A while back, I wrote a post simply announcing a recent paper that described a new statistic called the "Maximal Information Coefficient" (MIC), which is able to describe the correlation between paired variables regardless of linear or nonlinear relationship. This turned out to be quite a popular post, and included a lively discussion as to the merits of the work and difficulties in using the software provided by the authors. Regarding the latter, I also had difficulties running the software on R and thus did not include an example. Checking back on this topic, I was pleased to see that an R package had subsequently been developed: minerva: Maximal Information-Based Nonparametric Exploration R package for Variable Analysis (Albanese et al. 2013). Further documentation of the package can be found here: http://minepy.sourceforge.net/

I tried out the package on the baseball data set used in the original paper by Reshef et al. (2011), where a suite of variables are correlated against a baseball player's salary. The author's state in their paper:

"In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged. On the basis of p, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. By contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27, 34) (fig. S12 and table S12). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary."

Here is a summary from the results computed with the function mine() of the minerva package (top 10 ranking MIC coefficients), which reproduces the same results as are shown in the Supplementary table S12 of the original paper:

For a visual representation of these results, the top figure plots MIC vs. Pearson and MIC Rank vs. Pearson Rank. Thanks to minerva author and maintainer M. Filosi for helping in reproducing the example.

References:

Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., & Furlanello, C. (2013). minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3), 407-408. [link]

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518-1524. [link]

Code to reproduce the example:

library(minerva)
 
# Load data
dat <- read.csv("MLB2008.csv") # from http://www.exploredata.net/Downloads/Baseball-Data-Set
x <- dat[,-c(1:3)]
excl <- which(diag(var(x)) < 1e-5) # exclude variables with low variance
x <- x[,-excl]
y <- dat$SALARY
 
# Analysis
M <- mine(x, y=y, alpha=0.7)
P <- cor(x, y=y)
res <- data.frame(MIC = c(M$MIC))
rownames(res) <- rownames(M$MIC)
res$MIC_Rank <- nrow(res) - rank(res$MIC, ties.method="first") + 1
res$Pearson <- P
res$Pearson_Rank <- nrow(res) - rank(abs(res$Pearson), ties.method="first") + 1
res <- res[order(res$MIC_Rank),]
head(res, n=10)
 
# Plot
png("MIC_vs_Pearson.png", width=7.5, height=3.5, res=400, units="in", type="cairo")
op <- par(mfrow=c(1,2), mar=c(4,4,1,1))
plot(MIC ~ abs(Pearson), res, pch=21,  col=4, bg=5)
plot(MIC_Rank ~ Pearson_Rank, res, pch=21, col=4, bg=5)
par(op)
dev.off()