Thursday, November 24, 2011

Empirical Orthogonal Function (EOF) Analysis for gappy data

[Updates]: The following approach has serious shortcomings, which I have recently become aware of. In a comparison of gappy EOF approaches Taylor et al. (2013) [pdf] show that this traditional approach is not as accurate as others. Specifically, the approach of DINEOF (Data Interpolating Empirical Orthogonal Functions) proved to be the most accurate. I have outlined the DINEOF algorithm in another post [link]. and show a comparison of gappoy EOF methods here: http://menugget.blogspot.de/2014/09/pca-eof-for-data-with-missing-values.html. The R package "sinkr" now contains a version of the function ("eof") for easy installation: https://github.com/menugget/sinkr

-----------------
The following is a function for the calculation of Empirical Orthogonal Functions (EOF). For those coming from a more biologically-oriented background and are familiar with Principal Component Analysis (PCA), the methods are similar. In the climate sciences the method is usually used for the decomposition of a data field into dominant spatial-temporal modes. 


     The most commonly offered example of EOF in many textbooks is in the analysis of large oscillating variables, such as sea level pressure (SLP) or temperature (e.g. SST) fields. Nevertheless, the method can also be useful for other data where it can be used to reduce a high dimensional field into a smaller number of  dominant modes (i.e. principle components). In the example above, I have performed an EOF on monthly anomalies of surface chlorophyll a (chl a) concentrations for the northern Humboldt Current ecosystem (~ 5-16 °S and 110 km offshore extension). The data field is a matrix consisting of ~ 7300+ spatial locations (~ 4.5 km resolution); each column being a time series of one of these spatial locations. An important aspect of this analysis is that the data contain gaps (i.e. 'NA' values where observations are lacking, e.g. due to cloud coverage) and requires special treatment in the calculation of the covariance matrix and the expansion of the EOF coefficients (Björnsson and Venegas, 1997; von Storch and Zwiers, 1999). Given the high variability in daily chl a, as well an increase in gaps in daily time series, the analysis is probably not as appropriate at that temporal scale.

     The figure shows the leading 2 EOF modes, which describe 82 % and 7 % of the field's squared covariance. Above are the (normalized) spatial EOF patterns and below are their respective (normalized) EOF coefficient time series. In this case only 3 modes were significant according to "North's rule of thumb" and they cumulatively explain 92 % of the squared covariance of the field. Those modes not meeting the rule of thumb criteria are usually attributable to smaller scale features.

One must be careful in the interpretation of EOFs as they will not always represent "real" and clear patterns in the field. In the chl a example, EOF 1 is pretty obviously related to El Niño Southern Oscillation (ENSO), and correlates with the Southern Oscillation Index (SOI) at R=0.51. This is a well known process wereby the strength of the atmospheric pressure gradient across the equatorial Pacific (as reflected in the SOI) affects the depth of the thermo-nutricline in Peru and subsequently, the availability of nutrients available to primary producers.  The end of the strong El Niño of 1997/98 is seen as a positive value in the EOF 1 coefficient, while the strong La Niña of 2010 (1, 2) is seen as a negative value. The interpretation of EOF 2 requires more knowledge of the local system - At first glance, there seems to be two main areas that oscillate out of phase - a main upwelling area around 15 °S and the near coast system contain primarily positive EOF values, while more offshore waters are primarily negative. This could reflect a more localized wind-forced upwelling signal yet more information is probably needed for its interpretation.

The field can be reconstructed by summing the products of each EOF with it's corresponding coefficient (a matrix multiplication). I will demonstrate this in a future post, but for the moment you can gauge the direction of each EOF's contribution by observing the sign of the signal. For example, the reconstructed field using EOF 1 should contain all positive values for the La Niña period of 2010 because the spatial pattern consists of only negative values, and this is multiplied by a negative coefficient value during that period.

The function eof.mca() (below) was used for this analysis and additional information regarding it's setup can be found in the header notes. I have set up the function to also accept two fields for a Maximum Covariance Analysis (MCA), which is a similar analysis that searches for modes of highest covariance between two different fields. Again, I will give an example of it's application in a future post, but see Björnsson and Venegas(1997) or von Storch and Zwiers (1999) for examples. Given the size of the above field (and resulting covariance matrix), I employed the "irlba" method ('implicitly restarted Lanczos bi-diagonalization') to speed up the decomposition, as a full svd ('singular value decomposition') would have taken agonizingly long (and maxed out my RAM). The only disadvantage to the method is that (as far as I can tell) one cannot accurately calculate the explain variance of each mode without knowing the sum of all singular values. However, the squared covariance fraction is calculable with the irlba method without a full decomposition.



the eof.mca function...
#This function performs either a EOF ('Empirical Orthogonal Function' analysis) 
#on a single field (F1) or an MCA ('Maximum Covariance Analysis') on two fields
#(F1 and F2).
#The most typical setup is to have fields where columns are
#spatial locations and rows are temporal measurement. For an MCA, the row 
#dimension of F1 and F2 must be identical, but the columns can be different.
 
#Arguments 'F1_cols_incl' and 'F2_cols_incl' allow for the specification of
#using a subset of columns.
 
#In both cases, a covariance matrix is computed between columns using the 
#function 'cov4gappy()'. This function gives comparable results to: 
#'cov(F1,y=F2,use="pairwise.complete.obs")'
#whereby each covariance value is divided by n number of shared values (as opposed
#to n-1 in the case of cov(). Futhermore, the function will return a 0 (zero) in
#cases where no shared values exist between columns; the advantage being that a
#covariance matrix will still be calculated in cases of very gappy data, or when
#spatial locations have accidentally been included without observations (i.e. land
#in fields of aquatic-related parameters). 
#EOF analysis could be conducted through a direct decomposition of the 
#field F1 using SVD, although, for consistency with MCA, the the decomposition is applied
#to the covariance matrix. This is also an essential step for gappy data.
 
#The default settings include the centering of columns (centered=TRUE), which is usually
#recommended for EOF. Scaling (scaled=TRUE) of columns can also be set. This may
#be desired in MCA when unit ranges differ between fields. There are many different
#viewpoints on these approaches, so take caution as to the settings you choose.
 
#'nu' and 'nv' generally allow one to specify the return of a smaller number
#of EOF or MCA modes in order to create a smaller sized results object. 
#Additionally, two 'methods' are possible for the decomposition of the covariance
#matrix - 'svd' ('singular value decomposition') and 'irlba' ('implicitly restarted
#Lanczos bi-diagonalization') from the 'irlba' package. irlba can be a great 
#time saver in the decomposition of very large matrices, whereby the nu and nv
#settings define a stopping point. To the contrary, the svd method computes the
#full set of modes regardless, and nu and nv settings will only be used to trim
#the returned result.
 
#Finally, a lag may be introduced in order to explore changes in the associated
#statistics. This really only makes sense in cases where the row dimension
#represents temporal measurements taken at regular intervals.
 
#Main statistics returned:
#1. expl_var - 'explained variance' for each mode (not calcuable with irlba method
#as it requires the sum of the full singular value vector. 
#2. sq_cov_frac - 'squared covariance fraction' for each mode
#3. n_sig - 'number of significant' modes. This is calculated accrding to 'North's
#Rule of Thumb' which compares the difference between neighboring singular values
#(Lambda) with their associated error (Lambda_err)
 
#Additional information is returned which is helpful in the reconstruction of fields
#using a truncated number of modes and their associated coefficients.
 
eof.mca <- function(F1, F2=NULL,
centered=TRUE, scaled=FALSE,
F1_cols_incl=1:length(F1[1,]), 
F2_cols_incl=1:length(F2[1,]),
nu=NULL, nv=NULL, method="svd", F2_lag=NULL
){
 
    if(method=="irlba"){
        print("Only squared variance, not variance, is calculated using irlba method")
    }
 
    ####################
    ###eof vectors######
    ####################
    if(is.null(F2)){ # EOF
        F1 <- scale(F1, center=centered, scale=scaled)
 
        F1_center=attr(F1,"scaled:center")
        F1_scale=attr(F1,"scaled:scale")
        F2_center=NULL
        F2_scale=NULL
 
        F1_ts <- rownames(F1)
        F2_ts <- NULL
 
        F1_dim <- dim(F1)
        F2_dim <- NULL
 
        C <- cov4gappy(F1[,F1_cols_incl])
        F2_cols_incl=NULL
 
        if(method=="svd") {   
            if(is.null(nu)){
                nu=F1_dim[2]
            }
            if(is.null(nv)){
                nv=F1_dim[2]
            }
            L <- svd(C)
        }
 
        if(method=="irlba") {
            if(is.null(nu)){
                nu=5
            }
            if(is.null(nv)){
                nv=5
            }
            L <- irlba(C, nu=nu, nv=nv)
        }
 
        L$v<-NULL
    }
 
    if(!is.null(F2)){ # MCA
        F1 <- scale(F1, center=centered, scale=scaled)
        F2 <- scale(F2, center=centered, scale=scaled)
 
        F1_center=attr(F1,"scaled:center")
        F1_scale=attr(F1,"scaled:scale")
        F2_center=attr(F2,"scaled:center")
        F2_scale=attr(F2,"scaled:scale")
 
        if(!is.null(F2_lag)){
            if(sign(F2_lag)==1){
                F1 <- F1[(1+F2_lag):length(F1[,1]),]
                F2 <- F2[1:(length(F2[,1])-F2_lag),]
            }
            if(sign(F2_lag)==-1){
                F1 <- F1[1:(length(F1[,1])-F2_lag),]
                F2 <- F2[(1+F2_lag):length(F2[,1]),]
            }
        }
 
        F1_ts <- rownames(F1)
        F2_ts <- rownames(F2)
 
        F1_dim <- dim(F1)
        F2_dim <- dim(F2)
 
        C <- cov4gappy(F1[,F1_cols_incl], F2=F2[,F2_cols_incl])
 
        if(method=="svd") {
            if(is.null(nu)){
                nu=min(F1_dim[2], F2_dim[2])
            }
            if(is.null(nv)){
                nv=min(F1_dim[2], F2_dim[2])
            }
            L <- svd(C)
        }
 
        if(method=="irlba") {
            if(is.null(nu)){
                nu=5
            }
            if(is.null(nv)){
                nv=5
            }           
            L <- irlba(C, nu=nu, nv=nv)
        }
 
    }
 
    ###################
    ###eof mca stats###
    ###################
    if(method=="svd"){
        expl_var=L$d/sum(L$d) #explained variance
        sq_cov_frac=L$d^2/sum(L$d^2) #squared covariance fraction
    }
    if(method=="irlba"){
        expl_var=NULL
        if(dim(C)[1] <= dim(C)[2]){
            sq_cov_frac=L$d^2/sum(diag(C%*%t(C)))
        } else {
            sq_cov_frac=L$d^2/sum(diag(t(C)%*%C))
        }
    }
 
    if(length(L$d)>1){
        Lambda_err <- sqrt(2/min(F1_dim[2], F2_dim[2]))*L$d
        upper.lim <- L$d+Lambda_err
        lower.lim <- L$d-Lambda_err
        NORTHok=0*L$d
        for(i in seq(L$d)){
            Lambdas <- L$d
            Lambdas[i] <- NaN
            nearest <- which.min(abs(L$d[i]-Lambdas))
            if(nearest > i){
                if(lower.lim[i] > upper.lim[nearest]) NORTHok[i] <- 1
            }
            if(nearest < i){
                if(upper.lim[i] < lower.lim[nearest]) NORTHok[i] <- 1
            }
        }
        n_sig <- min(which(NORTHok==0))-1
    }
 
    ##########################################################
    ###expansion of eof coefficients "principle components"###
    ##########################################################
 
    A_coeff = NULL
    A_norm = NULL
    A = NULL
    B_coeff = NULL
    B_norm = NULL
    B = NULL
 
    #trim columns of original data
    F1 <- as.matrix(F1[,F1_cols_incl])
 
    #setup for norm
    F1_val<-replace(F1, which(!is.na(F1)), 1)
    F1_val<-replace(F1_val, which(is.na(F1_val)), 0)
 
    #calc of expansion coefficient and scaling norm
    A_coeff <- replace(F1, which(is.na(F1)), 0)%*%L$u[,1:nu]
    A_norm <- F1_val%*%(L$u[,1:nu]^2)
    A=A_coeff/A_norm
 
    if(!is.null(F2)){
        #trim columns of original data then center then scale
        F2 <- F2[,F2_cols_incl]       
 
        #setup for norm
        F2_val<-replace(F2, which(!is.na(F2)), 1)
        F2_val<-replace(F2_val, which(is.na(F2_val)), 0)
 
        #calc of expansion coefficient and scaling norm
        B_coeff <- replace(F2, which(is.na(F2)), 0)%*%L$v[,1:nv]
        B_norm <- F2_val%*%(L$v[,1:nv]^2)
        B=B_coeff/B_norm
    }
 
    ############
    ###result###
    ############
    RESULT <- list(
        u=L$u[,1:nu], v=L$v[,1:nv], 
        Lambda=L$d, Lambda_err=Lambda_err,
        expl_var=expl_var, sq_cov_frac=sq_cov_frac, 
        NORTHok=NORTHok, n_sig=n_sig, nu=nu, nv=nv,
        F1_dim=F1_dim, F2_dim=F2_dim,
        F1_cols_incl=F1_cols_incl,  F2_cols_incl=F2_cols_incl,
        F1_center=F1_center, F1_scale=F1_scale, 
        F2_center=F2_center, F2_scale=F2_scale,
        A=A, B=B,
        F1_ts=F1_ts, F2_ts=F2_ts, F2_lag=F2_lag
    )
 
    RESULT
} 
Created by Pretty R at inside-R.org


the cov4gappy function...
cov4gappy <- function(F1, F2=NULL){
    if(is.null(F2)){
        F1 <- as.matrix(F1)
        F1_val<-replace(F1, which(!is.na(F1)), 1)
        F1_val<-replace(F1_val, which(is.na(F1_val)), 0) 
        n_pairs=(t(F1_val)%*%F1_val)
 
        F1<-replace(F1, which(is.na(F1)), 0)
        cov_mat <- (t(F1)%*%F1)/n_pairs
        cov_mat <- replace(cov_mat, which(is.na(cov_mat)), 0)
    }
 
    if(!is.null(F2)){
        if(dim(F1)[1] == dim(F2)[1]){
            F1 <- as.matrix(F1)
            F2 <- as.matrix(F2)
 
            F1_val<-replace(F1, which(!is.na(F1)), 1)
            F1_val<-replace(F1_val, which(is.na(F1_val)), 0) 
            F2_val<-replace(F2, which(!is.na(F2)), 1)
            F2_val<-replace(F2_val, which(is.na(F2_val)), 0) 
            n_pairs=(t(F1_val)%*%F2_val)
 
            F1<-replace(F1, which(is.na(F1)), 0)
            F2<-replace(F2, which(is.na(F2)), 0)
            cov_mat <- (t(F1)%*%F2)/n_pairs
            cov_mat <- replace(cov_mat, which(is.na(cov_mat)), 0)
 
        } else {
            print("ERROR; matrices columns not of the same lengths")
        }
    }
 
    cov_mat
}
Created by Pretty R at inside-R.org



References
 
Björnsson, H. and Venegas, S.A. (1997). "A manual for EOF and SVD analyses of climate data", McGill University, CCGCR Report No. 97-1, Montréal, Québec, 52pp.

von Storch, H, Zwiers, F.W. (1999). Statistical analysis in climate research. Cambridge University Press.

 [pdf]

6 comments:

  1. Funny, I was thinking of posting something like this myself... I really like the graphics you're using for this.

    I'm a bit curious about about your "mode's explained variance"... I'm working off the top of my head but shouldn't this be the squared singular value of the given mode (i.e. the mode's eigenvalue) divided by the the sum of the squared singular values (i.e. the sum of all modes' eigenvalues) I think you're calling this the SCF? and then you can find the delta lambda for your North test by simply multiplying this by the square root of 2 over the number of singular values...?

    In code, might it be most efficient to:

    explained.var <- L$d^2 / sum(L$d^2)

    # Confidence interval for your test based on North et al:
    delta.lambda <- (explained.var * sqrt(2/length(L$d)))

    You may be doing this...? I may not be following your code correctly.

    At any rate, thank you very much for posting this.

    ReplyDelete
  2. I forgot to mention that the package "clim.pact" offers EOF and other related functions for the analysis of climate data. I'm not sure that it handles gappy data though.

    ReplyDelete
  3. Hi Brewster,

    Thanks for your feedback and compliments regarding the graphics - I sometimes focus too much on making things look good...

    Regarding explained variance - I might be wrong about this, but I think I have come across references that sometimes present one or the other. I can't really say which is the better one to present, but SCF is basically just squaring the Lambda values before calculating each singular value's contribution to the sum. If I'm wrong about this, let me know.

    You're right about the calculation of Lambda error. I have just written it in a weird way to accommodate the irlba method - in that case, one is usually calculating a smaller subset of EOFs and thus the length of singular values is smaller than for a full decomposition. Thus the length of Lambda is no longer appropriate to determine the Lambda Errors. By taking the number of columns in your original matrix, you will arrive at the number of singular values in a full matrix decomposition. I haven't got to MCA yet, but in that case the number of singular values would be equal to the number of columns in the smaller of the two field.

    ReplyDelete
  4. Great post - can you explain what th e "leading mode" is? I would like to create a similar map and not really sure what you plotted. It "leading mode" the mode of the EOF axis 1, and the mode of EOF axis 2? Thanks, RT

    ReplyDelete
  5. Thanks for you comment - what I mean by the "leading modes" is just that these are the first ones - i.e. modes 1 & 2. The EOFs are ordered according to the amount of variance that they represent.

    ReplyDelete
  6. can you also post your code for the figures? thanks!

    ReplyDelete