## Sunday, March 25, 2012

### Canonical Correlation Analysis for finding patterns in coupled fields

 First CCA pattern of Sea Level Pressure (SLP) and Sea Surface Temperature (SST) monthly anomalies for the region between -180 °W to -70 °W and +30 °N to -30 °S.

The following post demonstrates the use of Canonical Correlation Analysis (CCA) for diagnosing coupled patterns in climate fields. The method produces similar results to that of  Maximum Covariance Analysis (MCA), but patterns reflect maximum correlation rather than maximum covariance. Furthermore, the output of the model is a combination of linear models that can be used for field prediction.

This particular method was illustrated by Barnett and Preisendorfer (1997) - it constructs a CCA model based on a truncated subset of EOF coefficients (i.e. "principle components") instead of using the original field (as with MCA). This truncation has several benefits for the fitting of the model - First, one reduces the amount of noise in the problem by eliminating the higher EOF modes, which represent poorly organized, small-scale features of the fields. Second, by using orthogonal functions, the algebra of the problem is simplified (see von Storch and Zweiers 1999 for details). Bretherton etal. (1992) reviewed several techniques for diagnosing coupled patterns and found the Barnett and Preisendorfer method (hereafter "BPCCA") and MCA to be the most robust.

I have used the same data fields (SLP and SST for the equatorial Pacific) as for my prior post on MCA. In the first step of BPCCA, the SLP monthly anomaly field (predictor, 'x') was truncated to 3 EOFs, explaining 73% of the field's variance, and the SST monthly anomaly field (predictant, 'y') was truncated to 4 EOFs, explaining 70% of the field's variance.

For the purpose of the example, I have performed an EOF on the fields for the period of 1950-2004, although only EOF coefficients from 1950-1990 will be included in the BPCCA model. The remaining part of the signal will be used for the demonstration of field prediction. The BPCCA model is then calculated using the "bp.cca" function below, which is based on the excellent recipe for CCA within the book by Wilks (2006).  Another helpful resource was the book of von Storch and Zweiers (1999).

The CCA model produces 4 hierarchical variates of decreasing correlation as revealed by the canonical correlation coefficents (rho):
> bpcca\$rho
[1] 0.7232763 0.3749655 0.2005449 0.1071221

Using the p.asym function from the CCP package, the first three canonical correlation coefficients are deemed significant at the p < 0.05 level.
Pillai-Bartlett Trace, using F-approximation:
stat    approx df1  df2      p.value
1 to 4:  0.71542120 20.648593  20 1896 0.000000e+00
2 to 4:  0.19229257  8.012806  12 1904 8.104628e-15
3 to 4:  0.05169341  4.172160   6 1912 3.568567e-04
4 to 4:  0.01147515  2.761959   2 1920 6.341890e-02

The resulting CCA variates can also be transformed into the original spatial coordinates to produce a map as in the figure at the top. You can see that the 1st CCA pattern largely resembles the 1st MCA mode.

Based on the CCA model, one can predict values for the 4 SST EOF coefficients based on the SLP EOF coefficients (plot below). Furthermore, the model can predict the values after 1990 even though they were not included in the original model. Only the 1st SST EOF coefficient is reproduced with any degree of success, but this is substantial considering the that this leading pattern was responsible for 49% of the SST field's variance.

Finally, given these predicted EOF coefficeients, one can reconstruct the entire field using the spatial EOF patterns. The figure below shows the correlation of this predicted SST field to that of the original. Another often used measure of the predicted field is to calculate a skill score that relates the sum of squared differences of the predicted field (e.g. "SSpred") to that of a reference field (e.g. "SSref") that has been reconstructed from the same truncated EOFs used in the BPCCA model. For example, SSpred / SSref  * 100 would calculate the percent increase in sum of squared differences over that of the reference field.

In order to reproduce the analysis shown here, you will need the bp.cca function below. This function is written to work with the output objects of the eof.mca function. Other functions found within this blog are needed to reproduce the example. (color.palette, anomalyval2col, lon.lat.filter, image.scale, exp.mat, cov4gappy, color.palette) .

References:

- Barnett, T.P., Preisendorfer, R., 1987. Origins and levels of monthly and seasonal forecast skill for united-states surface air temperatures determined by canonical correlation-analysis. Monthly Weather Review 115 (9), 1825-1850.
- Bretherton, C.S., Smith, C., Wallace, J.M., 1992. An intercomparison of methods for finding coupled patterns in climate data. Journal of Climate 5 (6), 541-560.
- Storch, H.V., Zwiers, F.W., 1999. Statistical analysis in climate research. Cambridge University Press, Cambridge.
- Wilks, D.S., 2006. Statistical methods in the atmospheric sciences, 2nd ed. Academic Press, Amsterdam.

the bpcca function:
```bp.cca <- function(eofx,eofy,
n_pcx_incl=eofx\$n_sig, n_pcy_incl=eofy\$n_sig,
rowx_incl=1:length(eofx\$A[,1]), rowy_incl=1:length(eofy\$A[,1]),
y_lag=0){

require(CCP)
#since these are PCs no centering is needed
x <- eofx\$A[rowx_incl,1:n_pcx_incl] %*% exp.mat(diag(eofx\$Lambda[1:n_pcx_incl]), -0.5)
y <- eofy\$A[rowy_incl,1:n_pcy_incl] %*% exp.mat(diag(eofy\$Lambda[1:n_pcy_incl]), -0.5)

if(sign(y_lag) != 0){
if(sign(y_lag)==1){
x <- x[1:(length(x[,1])-y_lag),]
y <- y[(1+y_lag):length(y[,1]),]
}
if(sign(y_lag)==-1){
x <- x[(1-y_lag):length(x[,1]),]
y <- y[1:(length(y[,1])+y_lag),]
}
}

xdim <- dim(x)
ydim <- dim(y)

mindim=min(xdim[2], ydim[2])

if(xdim[1]!=ydim[1])
stop("unequal number of rows in x and y")

xcols <- 1:xdim[2]
ycols <- (xdim[2]+1):(xdim[2]+ydim[2])

Sxx <- cov4gappy(x)
Sxy <- cov4gappy(x,F2=y)
Syy <- cov4gappy(y)

############################################
############################################
if(xdim[2] >= ydim[2]){
svd1 <- svd(exp.mat(Sxx, -0.5) %*% Sxy %*% exp.mat(Syy, -0.5))
A <- exp.mat(Sxx, -0.5) %*% svd1\$u
B <- exp.mat(Syy, -0.5) %*% svd1\$v
} else {
svd1 <- svd(exp.mat(Syy, -0.5) %*% t(Sxy) %*% exp.mat(Sxx, -0.5))
A <- exp.mat(Sxx, -0.5) %*% svd1\$v
B <- exp.mat(Syy, -0.5) %*% svd1\$u
}

####################
#Canonical variates#
####################
V <- x %*% A
W <- y %*% B

########################
#Canonical correlations#
########################

rho=NA*1:mindim
for(i in 1:mindim){
rho[i] <- cor(as.vector(W[,i]), as.vector(V[,i]), use="pairwise")
}

#A test of CCA significance
cca.sig  <- p.asym(rho, xdim[1], xdim[2], ydim[2], tstat = "Pillai")

#out
list(
rho=rho, stat=cca.sig\$stat, df1=cca.sig\$df1, df2=cca.sig\$df2,
approx=cca.sig\$approx,p.value=cca.sig\$p.value,
mindim=mindim, n_pcx_incl=n_pcx_incl, n_pcy_incl=n_pcy_incl,
rowx_incl=rowx_incl, rowy_incl=rowy_incl,
A=A, B=B,
V=V, W=W,
x_ts=rownames(x), y_ts=rownames(y), y_lag=y_lag,
x_dim=eofx\$F1_dim, y_dim=eofy\$F1_dim,
x_cols_incl=eofx\$F1_cols_incl,  y_cols_incl=eofy\$F1_cols_incl,
x_center=eofx\$F1_center, x_scale=eofx\$F1_scale,
y_center=eofy\$F1_center, y_scale=eofy\$F1_scale
)

}```
Created by Pretty R at inside-R.org

the code to produce the examples above:
```###Load libraries
library(maps)
library(mapproj)
library(ncdf)
library(CCP)

source("val2col.R")
source("lon.lat.filter.R")
source("image.scale.R")
source("eof.mca.R")
source("cov4gappy.R")
source("color.palette.R")
source("anomaly.R", sep=""))
source("bp.cca.R")
source("exp.mat.R")

#Kaplan SST monthly anomaly dataset
#http://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/kaplan_sst/catalog.html
#http://www.esrl.noaa.gov/psd/thredds/fileServer/Datasets/kaplan_sst/sst.mon.anom.nc

#slp
nc <- open.ncdf("slp.mnmean.nc")   # opens nc file
slp.lon <- get.var.ncdf( nc, "lon")
slp.lat <- get.var.ncdf( nc, "lat")
slp.t <- get.var.ncdf( nc, "time")
slp.raw <- get.var.ncdf(nc, "slp")
close.ncdf(nc)
slp.t <- as.Date(slp.t, origin="1800-01-01")
temp <- which(slp.lon>180)
slp.lon[temp] <- slp.lon[temp]-360
slp.grd <- expand.grid(slp.lon, slp.lat)
colnames(slp.grd) <- c("lon", "lat")
slp <- matrix(c(slp.raw), nrow=length(slp.t), ncol<-length(slp.lon)*length(slp.lat), byrow=TRUE)
row.names(slp) <- as.character(slp.t)
dim(slp)
slp <- anomaly(slp, as.POSIXlt(slp.t), level="monthly")

#sst
nc <- open.ncdf("sst.mon.anom.nc")   # opens nc file
sst.lon <- get.var.ncdf( nc, "lon")
sst.lat <- get.var.ncdf( nc, "lat")
sst.t <- get.var.ncdf( nc, "time")
sst.raw <- get.var.ncdf(nc, "sst")
close.ncdf(nc)
sst.t <- as.Date(sst.t, origin="1800-01-01")
temp <- which(sst.lon>180)
sst.lon[temp] <- sst.lon[temp]-360
sst.grd <- expand.grid(sst.lon, sst.lat)
colnames(sst.grd) <- c("lon", "lat")
sst<- matrix(c(sst.raw), nrow=length(sst.t), ncol<-length(sst.lon)*length(sst.lat), byrow=TRUE)
row.names(sst) <- as.character(sst.t)
dim(sst)

###Make polygons for the data grids
#SLP
spacing=5 # degrees
slp.poly <- vector(mode="list", dim(slp.grd)[1])
for(i in seq(slp.poly)){
x=c(slp.grd[i,1]-spacing/2, slp.grd[i,1]+spacing/2, slp.grd[i,1]+spacing/2, slp.grd[i,1]-spacing/2)
y=c(slp.grd[i,2]-spacing/2, slp.grd[i,2]-spacing/2, slp.grd[i,2]+spacing/2, slp.grd[i,2]+spacing/2)
slp.poly[[i]] <- data.frame(x=x, y=y)
}

#SST
spacing=5 # degrees
sst.poly <- vector(mode="list", dim(sst.grd)[1])
for(i in seq(sst.poly)){
x=c(sst.grd[i,1]-spacing/2, sst.grd[i,1]+spacing/2, sst.grd[i,1]+spacing/2, sst.grd[i,1]-spacing/2)
y=c(sst.grd[i,2]-spacing/2, sst.grd[i,2]-spacing/2, sst.grd[i,2]+spacing/2, sst.grd[i,2]+spacing/2)
sst.poly[[i]] <- data.frame(x=x, y=y)
}

###Select data for analysis
#time selection
range(slp.t)
range(sst.t)
t.min <- as.Date("1950-01-01")
t.max1 <- as.Date("1989-12-31")
t.max2 <- as.Date("2004-12-31")
slp.t.incl <- which(slp.t >= t.min & slp.t <= t.max2)
sst.t.incl <- which(sst.t >= t.min & sst.t <= t.max2)

#space selection
lon.lim <- c(-180, -70)
lat.lim <- c(-30, 30)
slp.grd.incl <- lon.lat.filter(slp.grd[,1], slp.grd[,2], lon.lim[1], lon.lim[2], lat.lim[2], lat.lim[1])
sst.grd.incl <- lon.lat.filter(sst.grd[,1], sst.grd[,2], lon.lim[1], lon.lim[2], lat.lim[2], lat.lim[1])

###BPCCA
#Calculate EOFs
eof.slp <- eof.mca(slp[slp.t.incl, slp.grd.incl], centered=TRUE, nu=20)
eof.sst <- eof.mca(sst[sst.t.incl, sst.grd.incl], centered=TRUE, nu=20)
dim(eof.slp\$A)
dim(eof.sst\$A)

#cumulative variance explained by the "significant" EOFs
heights=c(5)
widths=c(5,5)
res=200
#figure
png(filename="cum_expl_var.png", width = sum(widths), height = sum(heights), units="in", res=res)
par(mfcol=c(1,2))
plot(cumsum(eof.slp\$expl_var[1:20])*100, ylab="cum.expl.var.[%]", xlab="EOF", cex=2, main="Significant EOFs - SLP")
grid()
points(cumsum(eof.slp\$expl_var[1:eof.slp\$n_sig])*100, pch=20, col=2, cex=2)
text(x=eof.slp\$n_sig, y=sum(eof.slp\$expl_var[1:eof.slp\$n_sig])*100, labels=paste(round(sum(eof.slp\$expl_var[1:eof.slp\$n_sig])*100),"%"), pos=4)

plot(cumsum(eof.sst\$expl_var[1:20])*100, ylab="cum.expl.var.[%]", xlab="EOF", cex=2, main="Significant EOFs - SST")
grid()
points(cumsum(eof.sst\$expl_var[1:eof.sst\$n_sig])*100, pch=20, col=2, cex=2)
text(x=eof.sst\$n_sig, y=sum(eof.sst\$expl_var[1:eof.sst\$n_sig])*100, labels=paste(round(sum(eof.sst\$expl_var[1:eof.sst\$n_sig])*100),"%"), pos=4)

dev.off()

#CCA model will be built on a smaller time subset of the PCs (eof\$A)
eof.slp.match <- which(as.Date(rownames(eof.slp\$A)) >= t.min & as.Date(rownames(eof.slp\$A)) <= t.max1)
eof.sst.match <- which(as.Date(rownames(eof.sst\$A)) >= t.min & as.Date(rownames(eof.sst\$A)) <= t.max1)

#The BPCCA model based on SLP and SST EOFS
bpcca <- bp.cca(eof.slp, eof.sst,
n_pcx_incl=eof.slp\$n_sig, n_pcy_incl=eof.sst\$n_sig,
rowx_incl=eof.slp.match, rowy_incl=eof.sst.match)

###Map of CCA
slp.cca.modes <- eof.slp\$u[,1:bpcca\$n_pcx_incl] %*% bpcca\$A
sst.cca.modes <- eof.sst\$u[,1:bpcca\$n_pcy_incl] %*% bpcca\$B

MODE=1
zran <- range(slp.cca.modes[,MODE], sst.cca.modes[,MODE])
zlim <- c(-max(abs(zran)), max(abs(zran)))

heights=c(3,2)
widths=c(4,4,0.5)
pal=color.palette(c("red", "yellow", "white", "cyan", "blue"), c(10,1,1,10))
ncol=100
res=200
colorvalues1 <- val2col(slp.cca.modes[,MODE], zlim, col=pal(ncol)) #color levels for the polygons
colorvalues2 <- val2col(sst.cca.modes[,MODE], zlim, col=pal(ncol)) #color levels for the polygons

#mapproj settings
project="fisheye"
orientation=c(mean(lat.lim), mean(lon.lim), 0)
PAR=1

#figure
png(filename=paste("map_cca_mode", MODE, ".png", sep=""), width = sum(widths), height = sum(heights), units="in", res=res)
#x11(width = sum(widths), height = sum(heights))
par(omi=c(0.5, 0.5, 0.5, 0.5), ps=12)
layout(matrix(c(1,2,3,4,4,4),nrow=2,ncol=3,byrow=TRUE), widths = widths, heights = heights, respect=TRUE)
layout.show(4)

#plot of map1
par(mai=c(0.1, 0.1, 0.1, 0.1))
map("world",project=project, orientation=orientation, par=PAR, ylim=lat.lim, xlim=lon.lim)
for(i in seq(slp.grd.incl)){
polygon(mapproject(x=slp.poly[[slp.grd.incl[i]]][,1], y=slp.poly[[slp.grd.incl[i]]][,2]), col=colorvalues1[i], border=colorvalues1[i], lwd=0.3)
}
map("world",project=project, orientation=orientation, par=PAR, fill=FALSE, add=TRUE, col="black")
map.grid(c(-180, 180, -90, 90), nx=36, ny=18, labels=FALSE, col="grey", lwd=1)
box()
mtext(paste("SLP Monthly Anomaly CCA", MODE), side=3, line=1, col=3)

#plot of map2
par(mai=c(0.1, 0.1, 0.1, 0.1))
map("world",project=project, orientation=orientation, par=PAR, ylim=lat.lim, xlim=lon.lim, xaxs="i", yaxs="i")
for(i in seq(sst.grd.incl)){
polygon(mapproject(x=sst.poly[[sst.grd.incl[i]]][,1], y=sst.poly[[sst.grd.incl[i]]][,2]), col=colorvalues2[i], border=colorvalues2[i], lwd=0.3)
}
map("world",project=project, orientation=orientation, par=PAR, fill=FALSE, add=TRUE, col="black")
map.grid(c(-180, 180, -90, 90), nx=36, ny=18, labels=FALSE, col="grey", lwd=1)
box()
mtext(paste("SST Monthly Anomaly CCA", MODE), side=3, line=1, col=4)

par(mai=c(0.1, 0.1, 0.1, 0.1))
image.scale(1, zlim, col=pal(ncol), horiz=FALSE, yaxt="n")
axis(4)
box()

YLIM <- range(c(bpcca\$V[,MODE], bpcca\$W[,MODE]))
XLIM <- range(c(as.Date(bpcca\$x_ts), as.Date(bpcca\$y_ts)))
par(mai=c(0.1, 0.1, 0.5, 0.1))
plot(as.Date(bpcca\$x_ts), bpcca\$V[,MODE], t="n", ylim=YLIM, xlim=XLIM, ylab="", xlab="")
abline(h=seq(-4,4,2), col=8, lty=3)
abline(v=seq(as.Date("1900-01-01"),as.Date("2100-01-01"),by="5 years"), col=8, lty=3)
lines(as.Date(bpcca\$x_ts), bpcca\$V[,MODE],col=3)
lines(as.Date(bpcca\$y_ts), bpcca\$W[,MODE],col=4)
mtext(paste("CCA", MODE, "Variates; R =", round(bpcca\$rho[MODE],2)), side=3, line=1)

dev.off()

###CCA model prediction of EOF coefficients
#use only significant CCAs
cca_use=which(bpcca\$p.value<0.05)

# 1. create extended V using the full PCs of slp (to 2004)
V_pred <- as.matrix(eof.slp\$A[,1:bpcca\$n_pcx_incl]) %*% exp.mat(diag(eof.slp\$Lambda[1:bpcca\$n_pcx_incl],bpcca\$n_pcx_incl), -0.5) %*% as.matrix(bpcca\$A)

# 2. create predicted W
W_pred <- as.matrix(V_pred[,cca_use]) %*% diag(bpcca\$rho[cca_use], length(cca_use))

# 3. create predicted PCs
PCy_pred <- W_pred %*% as.matrix(t(bpcca\$B[,cca_use]))

# 4. add units back to predicted PCs
PCy_pred_norm <- PCy_pred %*% exp.mat(diag(eof.sst\$Lambda[1:bpcca\$n_pcy_incl],bpcca\$n_pcy_incl), 0.5)

#figure of original vs. predicted EOF coefficients
LWD=c(1,1,1)
LTY=c(1,1,1)
COL=c(1,8,rgb(0,1,0,0.7))
png(filename="cca_pred_sst_pcs.png", width = 6, height = 6, units="in", res=400)
par(mfcol=c(dim(PCy_pred_norm )[2]+1, 1), omi=c(0.2,0.2,0.2,0.2))
par(mar=c(0.5,0.5,0.5,0.5))
plot(1,1,bty="n", t="n", ylab="", xlab="", yaxt="n", xaxt="n")
legend(
"center",
inset = c(0.01,0.01),
legend=c(
paste("SST PC - portion", "used in CCA model"),
paste("SST PC - portion", "not used in CCA model"),
paste("Predicted SST PC")
),
col=COL,
lty=LTY,
lwd=LWD,
)

par(mar=c(3,5,0.5,0.5))
for(i in seq(dim(PCy_pred_norm )[2])){
MODE=i
XLIM <- range(c(as.Date(rownames(PCy_pred_norm)), as.Date(rownames(eof.sst\$A))))
YLIM <- range(c(PCy_pred_norm[,MODE],eof.sst\$A[,MODE]))
plot(as.Date(rownames(PCy_pred_norm)),PCy_pred_norm[,MODE], t="n", ylim=YLIM, xlim=XLIM, ylab="", xlab="")
abline(h=seq(par()\$yaxp[1],par()\$yaxp[2],,par()\$yaxp[3]+1), col=8, lty=3, lwd=0.5)
abline(v=seq(as.Date("1900-01-01"),as.Date("2100-01-01"),by="5 years"), col=8, lty=3, lwd=0.5)
lines(as.Date(rownames(eof.sst\$A)[eof.sst.match]),eof.sst\$A[eof.sst.match,MODE], col=COL[1], lty=LTY[1], lwd=LWD[1])
lines(as.Date(rownames(eof.sst\$A)[-eof.sst.match]),eof.sst\$A[-eof.sst.match,MODE], col=COL[2], lty=LTY[2], lwd=LWD[2])
lines(as.Date(rownames(PCy_pred_norm)),PCy_pred_norm[,MODE], col=COL[3], lty=LTY[3], lwd=LWD[3])
mtext(paste("PC ", MODE, " (", round(eof.sst\$expl_var[i]*100), " %)", sep=""), side=2, line=3, cex=0.7)
}

dev.off()

#5. reconstruct field
field.pred <- PCy_pred_norm %*% t(eof.sst\$u[,1:bpcca\$n_pcy_incl])
dim(field.pred)
dim(sst[sst.t.incl, sst.grd.incl])

#6. a possible map - correlation between original and predicted
COR <- NaN*seq(dim(field.pred)[2])
for(i in seq(dim(field.pred)[2])){
COR[i] <- cor(sst[sst.t.incl, sst.grd.incl[i]], field.pred[,i], use="pairwise")
}

zlim <- c(-1,1)
pal=color.palette(c("red", "yellow", "cyan", "blue"))
ncol=20
colorvalues1 <- val2col(COR, zlim, col=pal(ncol)) #color levels for the polygons

#device settings
heights=c(3)
widths=c(4,0.5)
res=200

#mapproj settings
project="fisheye"
orientation=c(mean(lat.lim), mean(lon.lim), 0)
PAR=1

#figure
png(filename="sst_pred_vs_orig_field.png", width = sum(widths), height = sum(heights), res=res, units="in")
#x11(width = sum(widths), height = sum(heights))
par(omi=c(0.1, 0.1, 0.5, 0.5), ps=12)
layout(matrix(c(1,2),nrow=1,ncol=2,byrow=TRUE), widths = widths, heights = heights, respect=TRUE)
layout.show(2)

#plot of map
par(mai=c(0.1, 0.1, 0.1, 0.1))
map("world",project=project, orientation=orientation, par=PAR, ylim=lat.lim, xlim=lon.lim)
for(i in seq(slp.grd.incl)){
polygon(mapproject(x=sst.poly[[sst.grd.incl[i]]][,1], y=sst.poly[[sst.grd.incl[i]]][,2]), col=colorvalues1[i], border=colorvalues1[i], lwd=0.3)
}
map("world",project=project, orientation=orientation, par=PAR, fill=FALSE, add=TRUE, col="black")
map.grid(c(-180, 180, -90, 90), nx=36, ny=18, labels=FALSE, col="grey", lwd=1)
box()
mtext("Correlation of original vs. predicted SST fields", side=3, line=1)

par(mai=c(0.1, 0.1, 0.1, 0.1))
image.scale(1, zlim, col=pal(ncol), horiz=FALSE, yaxt="n")
axis(4)
box()

dev.off()```
Created by Pretty R at inside-R.org