Comments on me nugget: Maximal Information Coefficient (MIC)

Michael Clark has a simulation study that compares...

2014-09-19T02:57:11.867+02:00

Michael Clark has a simulation study that compares these dependence measures. It may be a little artificial and using data with some actual application in mind would be nice, but I find it interesting nonetheless.

http://www3.nd.edu/~mclark19/learn/CorrelationComparison.pdf

Hoeffding's D has also been used successfully ...

2012-01-29T02:12:44.946+01:00

Hoeffding's D has also been used successfully to study gene expression profiles, which is one of Reshef's example applications. It will be interesting to see how MIC compares with D.

http://www.mendeley.com/research/comparing-pearson-spearman-and-hoeffdings-d-measure-for-gene-expression-association-analysis/

I had a look into the published paper. It does not...

2012-01-09T02:01:09.271+01:00

I had a look into the published paper. It does not contain much technical information about the algorithm. But the supplemental material (which is free) contains details about the algorithm. The method seems to be very ad-hoc with a lot of technicalities. It would be hard to develop an independent implementation based on the published material. After reading the paper, I think the method is basically an application of mutual information concept from information theory with some complex binning and aggregation methods (e.g. quantization).

@Frank Harrel, In their page http://www.exploredat...

2011-12-27T16:05:21.553+01:00

@Frank Harrel, In their page http://www.exploredata.net/Technical-information, there is a link to the reprint of the article. They have some impressive names in their acknowledgment list, but the comparison to Hoeffding would have been nice.

No source code, just a non-free (CC Non-Commercial...

2011-12-24T09:51:47.866+01:00

No source code, just a non-free (CC Non-Commercial) wodge of Java byte code, and an R wrapper that calls the Java. If anyone thinks this is a useful technique then have a go re-implementing it from the published paper.

Is it not telling that this was published in a non...

2011-12-23T16:37:23.455+01:00

Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article (according the the references in the online supplement; I don't have access to the main article). This has been in the R hoeffd function in the Hmisc package for many years. Here's an example (type example(hoeffd) in R):

# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y) # an X
hoeffd(x,y)

D
x y
x 1.00 0.06
y 0.06 1.00

n= 200

P
x y
x 0 # P-value is very small
y 0

hoeffd uses an efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.

Frank Harrell

Can you explain what you did to get one of the pro...

2011-12-21T02:00:33.107+01:00

Can you explain what you did to get one of the provided files (e.g. MLB2008.csv) to run? I don't see any character variable in the records. Can you explain further what is meant?

Problem solved. In data set cannot be any characte...

2011-12-20T08:19:17.296+01:00

Problem solved. In data set cannot be any character variable.

Looks like they've posted supplemental materia...

2011-12-20T04:10:14.061+01:00

Looks like they've posted supplemental material with pseudo code here
http://www.sciencemag.org/content/334/6062/1518/suppl/DC1

Same here, by the way. Running the code in a comm...

2011-12-20T01:02:53.417+01:00

Same here, by the way. Running the code in a command line shows a Java formatting error in how it is handling the first datapoint, if I read it correctly. I sent a message regarding this, but have yet to get a reply. Same error shows up under Windows Vista or Ubuntu Linux

Yep, I've changed filenames to one of their ex...

2011-12-19T22:12:15.475+01:00

Yep, I've changed filenames to one of their example but still not working, same problem.

Thanks for all the comments clarifying the example...

2011-12-19T20:49:15.508+01:00

Thanks for all the comments clarifying the example. The link to the Perspectives article in Science (2nd comment post) provides a nice background into the significance of the work and its future prospects. Thanks for adding that. - Marc

I had the same problem, until I realized that the ...

2011-12-19T20:03:03.006+01:00

I had the same problem, until I realized that the instructions are poorly worded. The statement:

MINE("example.csv","all.pairs")

means, you need to use the PATH to YOUR CSV file. There is no "example.csv" You can download one of theirs, put it in your working directory and then use "your filehere.csv" where it says "example.csv". PS - it is REALLY a shell; outputs the data to a csv file. It could be expanded to be used with dataframes or matrices and to output to the same.

I don't know if you have similar problem but w...

2011-12-19T18:15:09.780+01:00

I don't know if you have similar problem but when I run that R code, my R just shut down...

...we need extensions of MIC(X,Y) to MIC(X,Y|Z). W...

2011-12-19T17:35:58.661+01:00

...we need extensions of MIC(X,Y) to MIC(X,Y|Z). We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

http://www.sciencemag.org/content/334/6062/1502.full

I think this is the real issue to be solved!

Nice, but only works for bivariate data which I th...

2011-12-19T16:53:08.273+01:00

Nice, but only works for bivariate data which I think is a strong limitation.