me nugget: 2011

Monday, December 19, 2011

Maximal Information Coefficient (MIC)

Pearson r correlation coefficients for various distributions of paired data (Credit: Denis Boigelot, Wikimedia Commons)

A paper published this week in Science outlines a new statistic called the maximal information coefficient (MIC), which is able to equally describe the correlation between paired variables regardless of linear or nonlinear relationship. In other words, as Pearson's r gives a measure of the noise surrounding a linear regression, MIC should give similar scores to equally noisy relationships regardless of type.

Maximum Covariance Analysis (MCA)

Maximum Covariance Analysis (MCA) (Mode 1; scaled) of Sea Level Pressure (SLP) and Sea Surface Temperature (SST) monthly anomalies for the region between -180 °W to -70 °W and +30 °N to -30 °S. MCA coefficients (scaled) are below. The mode represents 94% of the squared covariance fraction (SCF).

Maximum Correlation Analysis (MCA) is similar to Empirical Orthogonal Function Analysis (EOF) in that they both deal with the decomposition of a covariance matrix. In EOF, this is a covariance matrix based on a single spatio-temporal field, while MCA is based on the decomposition of a "cross-covariance" matrix derived from two fields.

Another aspect of speeding up loops in R

Any frequent reader of R-bloggers will have come across several posts concerning the optimization of code - in particular, the avoidance of loops.

Here's another aspect of the same issue. If you have experience programming in other languages besides R, this is probably a no-brainer, but for laymen, like myself, the following example was a total surprise. Basically, every time you redefine the size of an object in R, you are also redefining the allotted memory - and this takes some time. It's not necessarily a lot of time, but if you are having to do it during every iteration of a loop, it can really slow things down.

The following example shows three versions of a loop that creates random numbers and stores those numbers in a results object. The first example (Ex. 1) demonstrates the wrong approach, which is to concatenate the results onto the results object ("x") , thereby continually changing the size of x after each loop. The second approach (Ex. 2) is about 150x faster - x is defined as an empty matrix containing NAs, which is gradually filled (by row) during each loop. The third example (Ex. 3) shows another possibility if one does not know what the size of the results from each loop will be. An empty list is created of length equaling the number of loops. The elements of the list are then gradually filled with the loop results. Again, this is at least 150x faster than Ex. 1 (and I'm actually surprised to see that it may even be faster than Ex.2).

Define intermediate color steps for colorRampPalette

The following function, color.palette(), is a wrapper for colorRampPalette() and allows some increased flexibility in defining the spacing between main color levels. One defines both the main color levels (as with colorRampPalette) and an optional vector containing the number of color levels that should be put in between at equal distances.

The above figure shows the effect on a color scale (see image.scale) containing 5 main colors (blue, cyan, white, yellow, and red). The result of colorRampPalette (upper) produces an equal number of levels between the main colors. By increasing the number of intermediate colors between blue-cyan and yellow-red (lower), the number of color levels in the near white range is reduced. The resulting palette, for example, was better in highlighting the positive and negative values of an Emprical Orthogonal Function (EOF) mode.

[Updates]: The following approach has serious shortcomings, which I have recently become aware of. In a comparison of gappy EOF approaches Taylor et al. (2013) [pdf] show that this traditional approach is not as accurate as others. Specifically, the approach of DINEOF (Data Interpolating Empirical Orthogonal Functions) proved to be the most accurate. I have outlined the DINEOF algorithm in another post [link]. and show a comparison of gappoy EOF methods here: http://menugget.blogspot.de/2014/09/pca-eof-for-data-with-missing-values.html. The R package "sinkr" now contains a version of the function ("eof") for easy installation: https://github.com/menugget/sinkr

-----------------

The following is a function for the calculation of Empirical Orthogonal Functions (EOF). For those coming from a more biologically-oriented background and are familiar with Principal Component Analysis (PCA), the methods are similar. In the climate sciences the method is usually used for the decomposition of a data field into dominant spatial-temporal modes.

Propagation of error

     At the onset, this was strictly an excercise of my own curiosity and I didn't imagine writing this down in any form at all. As someone who has done some modelling work in the past, I'm embarrassed to say that I had never fully grasped how one can gauge the error of a model output without having to do some sort of Monte Carlo simulation whereby the model parameters are repeatedly randomized within a given confidence interval. Its relatively easy to imagine that a model containing many parameters, each with an associated error, will tend to propagate these errors throughout. Without getting to far over my head here, I will just say that there are defined methods for calculating the error of a variable if one knows the underlying error of the functions that define them (and I have tried out only a very simple one here!).
     In the example below, I have three main variables (x, y, and z) and two functions that define the relationships y~x and z~y. The question is, given these functions, what would be the error of a predicted z value given an initial x value? The most general rule seems to be:
     error(z~x)^2 = error(y~x)^2 + error(z~y)^2
However, correlated errors require additional terms (see Wikipedia: Propagation of uncertainty). The following example does just that by simulating correlated error terms using the MASS package's function mvrnorm().

example:

Converting values to color levels

Adding color to a plot is helpful in many situations for visualizing an additional dimension of the data. Again, I wrote the below function "val2col" in R after having coded this manually over and over in the past. It uses similar arguments as the image function in that one defines the colors to be used as well as optional break points or z-limits by which the data is binned into those colors. The typical situation where I use the function is with the mapping of climate data, yet the addition of color to an XY plot can often be easier on the eyes than adding an additional z-axis spatial dimension. In combination with the image.scale function, that I previously posted, the data can be quickly interpretted.
As an example, gridded sea level pressure is plotted above as projected polygons using the packages maps and mapproj. Values were converted to colors with the val2col function and the image.scale function plotted a corresponding scale. For those interested in using netcdf files, the example also uses the ncdf package for reading the data files into R.

Adding a scale to an image plot

[NOTE: new version of the image.scale function can be found here: http://menugget.blogspot.de/2013/12/new-version-of-imagescale-function.html.]

Here's a function that allows you to add a color scale legend to an image plot (or probably any plot needing a z-level scale). I found myself having to program this over and over again, and just decided to make a plotting function for future use. While I really like the look of levelplot(), the modular aspect of image() makes it much more handy to combine with other plotting commands or overlays.
For example, as far as I can tell, the simple addition of the triangle symbol to mark the highest point in the above map of Maunga Whau volcano is not possible with levelplot.
After adding this symbol, the function below - image.scale() - was used to add the accompanying color scale to another area of the device.

The function...

Creating svg graphics for web publishing

Thanks to the nice post from Revolution Analytics I was finally able to get an svg device working on my Windows OS version of R. It took some additional tips from a fellow user of blogger to figure out out how to embed the result on bloggers.com (due to the inability to upload svg files I had to post the file on Wikimedia Commons and then create a link).

Luckily, I didn't need to rebuild my R version with cairo support - i just installed the R package Cairo and then used its function CairoSVG()

the code for the figure...

Color reduction of an image - and Warholize?

There seems to be several methods out there for reducing the colors in an image. I became interested in this after pondering how this is done in the excellent freeware program IrfanView. Unfortunately, their method is not described anywhere that I could find, but I imagine that it is something along the tree data structure collapse method that ImageMagick employes.

The biOps package employes the k-means clustering to arrive at a reduced number of colors. I find that while this method takes a bit longer, the results can actually look a lot better. Just for kicks, I imbedded the imgKMeans() function in another function called warholize() that replaces these reduced color levels with another set. It's definitly not a Warhol, but I still like the effect.

the function...

Clarke and Ainsworth's BIOENV and BVSTEP (and BIO-BIO etc...)

Nonmetric Multidimensional Scaling (NMDS) plot of vegetation sample dissimilarities with best correlating environmental variables (left) and species (right) plotted as vectors (datasets "varespec" and "varechem" from the package vegan)

The R package "vegan" contains a version of Clarke and Ainsworth's (1993) BIOENV analysis allowing for the comparison of distance/similarity matrices between two sets of data having either samples or variables in common. The typical setup is in the exploration of environmental variables that best correlate to sample similarities of the biological community (e.g. species biomass or abundance). In this case, the similarity matrix of the community is fixed, while subsets of the environmental variables are used in the calculation of the environmental similarity matrix. A correlation coefficient (typically Spearman rank correlation coefficient) is then calculated between the two matrices and the best subset of environmental variables can then be identified and further subjected to a permutation test to determine significance.

This can be a very helpful analysis in the exploration of the often highly dimensional space of community samples. The method is also widely accepted by the scientific community due to its flexibility across a wide variety of data and is completely non-parametric - Clarke and Ainsworth's (1993) paper describing the method has 674 citations on Google Scholar at the time of this posting.

The R package "vegan" incorporates this routine in the function bioenv(). An example of a BIOENV exploration between vegetation community data (dataset "varespec" in the vegan package) and the environmental data (dataset "varechem" in the vegan package) :

Whales, plankton migrate across Northwest Passage

Phytoplankton species not seen in the Atlantic for at least 800,000 years are turning up in recent years due to an increase in water transport through the Northwest Passage. While this is not yet cause for alarm, increases in species invasions from one oceanic basin to another could cause larger shifts in marine ecosystems over time - read more here

Friday, June 10, 2011

Image color palette replacement

Here is an example of a function I wrote to change the color palette used in an image. The above example comes from a black and white original, although color images can also be used. The function first converts the image to grayscale in order to have levels of color intensity between 0-255. Using a new color palette with 256 color levels, the gray levels are replaced with a rgb (red, blue, green) vector from the new palette. The results can be very strange...
The package biOps is required for reading and writing the .jpeg files.

the function...

Simulating CMYK mis-registration printing

I recently came across a poster advertising a children's production of Shakespeare's The Tempest where they purposely used an effect to mimic a mis-registration in CMYK printing. You have probably seen this before as a slight offset in one of the 4 colors (cyan, magenta, yellow, and black).
     The CMYK color model is "subtractive" in that you end up with a white color (when the paper color is white) when no colors are printed. The opposite is an "additive" color model, such as RGB (red, green, blue), which results in black when none of these three color channels are added. This is more typically used in imaging on lit screens (e.g. color creation in R using the rgb() function).
     I wanted to try simulating this type of mis-registration in R and came up with the following function. For images with white backgrounds, the "subtractive" shift will look best while an "additive" shift works best for black backgrounds. The results are essentially the same, but you can eliminate some color channel striping on the image borders by choosing one or the other.
     This is probably much easier to do in a photo editting program, but I had fun with it nonetheless. I used the excellent package biOps for some of its image reading and manipulation functions.

...the function

map.xyz(): interpolation of XYZ data and projection onto a map

I am still struggling to get a grasp of R's mapping capabilities. Part of my frustration lies in the fact that I often work on areas near the poles, which complicates interpolation across the 180 degree line. For smaller areas, interpolation can be done using the interp() function in the package akima. I have taken the results from interp and projected the image onto a map. You will need the akima, maps, and mapproj packages and the functions new.lon.lat(), earth.dist(), and pos2coord().

As an example I have mapped the distance from Mecca, Saudi Arabia:

The function...

#A function that is sometimes useful in determining the 
#coordinate(i.e. row and column number) of a matrix position
#(and vice-versa). 
#Either a vector of positions ("pos") 
#OR a 2 column matrix of matrix coordinates, ("coord", i.e. cbind(row,col)), 
#AND the matrix dimentions must be supplied (dim.mat, i.e. c(nrow,ncol)).
pos2coord<-function(pos=NULL, coord=NULL, dim.mat=NULL){
 if(is.null(pos) & is.null(coord) | is.null(dim.mat)){
  stop("must supply either 'pos' or 'coord', and 'dim.mat'")
 }
 if(is.null(pos) & !is.null(coord) & !is.null(dim.mat)){
  pos <- ((coord[,2]-1)*dim.mat[1])+coord[,1] 
  return(pos)
 }
 if(!is.null(pos) & is.null(coord) & !is.null(dim.mat)){
  coord <- matrix(NA, nrow=length(pos), ncol=2)
  coord[,1] <- ((pos-1) %% dim.mat[1]) +1
  coord[,2] <- ((pos-1) %/% dim.mat[1]) +1
  return(coord)
 }
}

Created by Pretty R at inside-R.org

Sunday, May 29, 2011

R functions for Earth geographic coordinate calculations

Here are some functions that I regularly use for geographic data (e.g. binning, filtering, calculation of new positions etc.).

#distance in kilometers between two long/lat positions (from "fossil" package)
earth.dist <- function (long1, lat1, long2, lat2) 
{
    rad <- pi/180
    a1 <- lat1 * rad
    a2 <- long1 * rad
    b1 <- lat2 * rad
    b2 <- long2 * rad
    dlon <- b2 - a2
    dlat <- b1 - a1
    a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
    c <- 2 * atan2(sqrt(a), sqrt(1 - a))
    R <- 6378.145
    d <- R * c
    return(d)
}

Created by Pretty R at inside-R.org

#degree bearing between two long/lat positions (from "fossil" package)
earth.bear <- function (long1, lat1, long2, lat2) 
{
    rad <- pi/180
    a1 <- lat1 * rad
    a2 <- long1 * rad
    b1 <- lat2 * rad
    b2 <- long2 * rad
    dlon <- b2 - a2
    bear <- atan2(sin(dlon) * cos(b1), cos(a1) * sin(b1) - sin(a1) * 
        cos(b1) * cos(dlon))
    deg <- (bear%%(2 * pi)) * (180/pi)
    return(deg)
}

Created by Pretty R at inside-R.org

new.lon.lat <-
function (lon, lat, bearing, distance) 
{
    rad <- pi/180
    a1 <- lat * rad
    a2 <- lon * rad
    tc <- bearing * rad
    d <- distance/6378.145
    nlat <- asin(sin(a1) * cos(d) + cos(a1) * sin(d) * cos(tc))
    dlon <- atan2(sin(tc) * sin(d) * cos(a1), cos(d) - sin(a1) * 
        sin(nlat))
    nlon <- ((a2 + dlon + pi)%%(2 * pi)) - pi
    npts <- cbind(nlon/rad, nlat/rad)
    return(npts)
}

Created by Pretty R at inside-R.org

#tells which lon lat positions are within the defined limits to the west, east, north, and south
lon.lat.filter <-
function (lon_vector, lat_vector, west, east, north, south) 
{
 if(west>east) {
  lon_vector_new=replace(lon_vector, which(lon_vector<0), lon_vector[which(lon_vector<0)]+360)
  east_new=east+360
 } else {
  lon_vector_new=lon_vector
  east_new=east
 }
  hits=which(lon_vector_new < east_new & lon_vector_new > west & lat_vector < north & lat_vector > south)
 return(hits)
}

Created by Pretty R at inside-R.org

me nugget

Monday, December 19, 2011

Maximal Information Coefficient (MIC)

Tuesday, December 13, 2011

Maximum Covariance Analysis (MCA)

Monday, November 28, 2011

Another aspect of speeding up loops in R

Thursday, November 24, 2011

Define intermediate color steps for colorRampPalette

Empirical Orthogonal Function (EOF) Analysis for gappy data

Friday, November 11, 2011

Propagation of error

Monday, September 12, 2011

Converting values to color levels

Wednesday, August 31, 2011

Adding a scale to an image plot

Monday, July 25, 2011

Creating svg graphics for web publishing

Wednesday, July 6, 2011

Color reduction of an image - and Warholize?

Thursday, June 30, 2011

Clarke and Ainsworth's BIOENV and BVSTEP (and BIO-BIO etc...)

Sunday, June 26, 2011

Whales, plankton migrate across Northwest Passage

Friday, June 10, 2011

Image color palette replacement

Friday, June 3, 2011

Simulating CMYK mis-registration printing

Monday, May 30, 2011

map.xyz(): interpolation of XYZ data and projection onto a map

Array position to matrix coordinates conversion

Sunday, May 29, 2011

R functions for Earth geographic coordinate calculations

NPR Topics: Science

WWF - Environmental News