Tuesday, October 30, 2012

DINEOF (Data Interpolating Empirical Orthogonal Functions)


I finally got around to reproducing the DINEOF method (Beckers and Rixon, 2003) for optimizing EOF analysis on gappy data fields - it is especially useful for remote sensing data where cloud cover can result in large gaps in data. Their paper gives a nice overview of some of the various methods that have been used for such data sets. One of these approaches, which I have written about before,  involves deriving EOFs from a covariance matrix as calculated from available data. Unfortunately, as the author's point out, such covariance matrices are no longer positive-definite, which can lead to several problems. The DINEOF method seems to overcome several of these issues.

DINEOF is a procedure fills gaps by iteratively decomposing the data field via Singular Value Decomposition (SVD) until a best solution is found as compared to a subset of reference values (non-gaps). This is done by progressively including more EOFs in the reconstruction of the gappy locations until minimization of error converges. The authors first demonstrate the method using a synthetic example (e.g. above figure), where: 1) a data set is created, 2) noise is added (based on a noise / signal setting), 3) gaps are created (based on a fraction of gappiness setting), and finally the gaps are interpolated via the DINEOF method. The figure below shows a record of the iterations performed to fill in the gaps with increasing number of EOFs. An optimized solution occurred using the leading 6 EOFs - the addition of a 7th EOF did not improve the root mean squared error (RMS).

The synthetic example was based on 9 combined signals. Below we see a fit of the EOFs of the "true" field (black points) and the optimized solution of the gappy data (red lines). Again, only 5 EOFs seem to represent a real signal. This is a result of the addition of the noise - a non-noise version usually iterates until the full 9 EOF modes.




Here is an animation of the optimization process:


Below is a version of DINEOF for R and a script to reproduce the example. Additional information on DINEOF can be found here. This code is my interpretation of the DINEOF procedure - use at your own risk! Take care when using the default setting for ref.pos (positions to zero out and use as reference points in the conversion). As is, this samples 30 OR 1% of the values, whichever is bigger.


My version of the DINEOF procedure:
#This function is based on the DINEOF (Data Interpolating Empirical Orthogonal Functions)
#procedure described by Beckers and Rixon (2003).
#
#The arguments are:
#Xo - a gappy data field
#nu - a maximum number of EOFs to iterate (leave equalling "NULL" if algorithm shold proceed until convergence)
#ref.pos - a vector of non-gap reference positions by which errors will be assessed via root mean squared error ("RMS"). 
 #If ref.pos = NULL, then either 30 or 1% of the non-gap values (which ever is larger) will be sampled at random.
#delta.rms - is the threshold for RMS convergence.
#
#The results object includes:
#Xa - the data field with interpolated values (via EOF reconstruction) included.
#n.eof - the number of EOFs used in the final solution
#RMS - a vector of the RMS values from the iteration
#NEOF - a vector of the number of EOFs used at each iteration
#
#Beckers, Jean-Marie, and M. Rixen. "EOF Calculations and Data Filling from Incomplete Oceanographic Datasets."
#Journal of Atmospheric and Oceanic Technology 20.12 (2003): 1839-1856.
#
dineof <- function(Xo, nu=NULL, ref.pos=NULL, delta.rms=1e-5){
 
 if(is.null(nu)){
  nu=dim(Xo)[2]
 } 
 
 na.true <- which(is.na(Xo))
 na.false <- which(!is.na(Xo))
 if(is.null(ref.pos)) ref.pos <- sample(na.false, max(30, 0.01*length(na.false)))
 
 Xa <- replace(Xo, c(ref.pos, na.true), 0)
 rms.prev <- Inf
 rms.now <- sqrt(mean((Xa[ref.pos] - Xo[ref.pos])^2))
 n.eof <- 1
 RMS <- rms.now
 NEOF <- n.eof
 Xa.best <- Xa
 n.eof.best <- n.eof 
 while(rms.prev - rms.now > delta.rms & nu > n.eof){ #loop for increasing number of eofs
  while(rms.prev - rms.now > delta.rms){ #loop for replacement
   rms.prev <- rms.now
   SVDi <- svd(Xa) 
   RECi <- as.matrix(SVDi$u[,seq(n.eof)]) %*% as.matrix(diag(SVDi$d[seq(n.eof)], n.eof, n.eof)) %*% t(as.matrix(SVDi$v[,seq(n.eof)]))
   Xa[c(ref.pos, na.true)] <- RECi[c(ref.pos, na.true)]
   rms.now <- sqrt(mean((Xa[ref.pos] - Xo[ref.pos])^2))
   print(paste(n.eof, "EOF", "; RMS =", round(rms.now, 8)))
   RMS <- c(RMS, rms.now)
   NEOF <- c(NEOF, n.eof)
   gc()
   if(rms.now == min(RMS)) {
    Xa.best <- Xa
    n.eof.best <- n.eof
   }
  }
  n.eof <- n.eof + 1
  rms.prev <- rms.now
  SVDi <- svd(Xa) 
  RECi <- as.matrix(SVDi$u[,seq(n.eof)]) %*% as.matrix(diag(SVDi$d[seq(n.eof)], n.eof, n.eof)) %*% t(as.matrix(SVDi$v[,seq(n.eof)]))
  Xa[c(ref.pos, na.true)] <- RECi[c(ref.pos, na.true)]
  rms.now <- sqrt(mean((Xa[ref.pos] - Xo[ref.pos])^2))
  print(paste(n.eof, "EOF", "; RMS =", round(rms.now, 8)))
  RMS <- c(RMS, rms.now)
  NEOF <- c(NEOF, n.eof)
  gc()
  if(rms.now == min(RMS)) {
   Xa.best <- Xa
   n.eof.best <- n.eof
  }
 }
 
 Xa <- Xa.best
 n.eof <- n.eof.best
 rm(list=c("Xa.best", "n.eof.best", "SVDi", "RECi"))
 
 Xa[ref.pos] <- Xo[ref.pos]
 
 RESULT <- list(
  Xa=Xa, n.eof=n.eof, RMS=RMS, NEOF=NEOF
 )
 
 RESULT
}
Created by Pretty R at inside-R.org





To reproduce the example (follows that contained in the paper by Beckers and Rixon [2003])
#color palette
pal <- colorRampPalette(c("blue", "cyan", "yellow", "red"))
 
#Generate data
m=50
n=100
frac.gaps <- 0.5 # the fraction of data with NaNs
N.S.ratio <- 0.1 # the Noise to Signal ratio for adding noise to data
 
x <- (seq(m)*2*pi)/m
t <- (seq(n)*2*pi)/n
 
#True field
Xt <- 
 outer(sin(x), sin(t)) + 
 outer(sin(2.1*x), sin(2.1*t)) + 
 outer(sin(3.1*x), sin(3.1*t)) +
 outer(tanh(x), cos(t)) + 
 outer(tanh(2*x), cos(2.1*t)) + 
 outer(tanh(4*x), cos(0.1*t)) + 
 outer(tanh(2.4*x), cos(1.1*t)) + 
 tanh(outer(x, t, FUN="+")) + 
 tanh(outer(x, 2*t, FUN="+"))
 
Xt <- t(Xt)
image(Xt, col=pal(100))
 
 
#Noise field
set.seed(1)
RAND <- matrix(runif(length(Xt), min=-1, max=1), nrow=nrow(Xt), ncol=ncol(Xt))
R <- RAND * N.S.ratio * Xt
 
 
#True field + Noise field
Xp <- Xt + R
image(Xp, col=pal(100))
 
#Observed field with gaps
set.seed(1)
gaps <- sample(seq(length(Xp)), frac.gaps*length(Xp))
Xo <- replace(Xp, gaps, NaN)
image(Xo, col=pal(100))
 
 
#
set.seed(1)
RES <- dineof(Xo)
Xa <- RES$Xa
 
image(Xa, col=pal(100))
 
 
 
 
#plot of rms convergence
png("dineof_rms_convergence.png", width=5, height=4, units="in", res=400)
par(mar=c(5,5,3,5))
plot(RES$RMS, xlab="Iteration", ylab="RMS", log="y", t="l", lwd=2)
par(new=TRUE)
plot(RES$NEOF, xaxt="n", yaxt="n", xlab="", ylab="", t="l", lwd=2, col=2)
axis(4)
mtext("# EOFs", side=4, line=3, col=2)
dev.off()
 
 
#plot of fields
png("dineof_composite.png", width=8, height=8, units="in", res=400)
#x11(width=8, height=8)
ZLIM <- range(Xt, Xp, Xo, Xa, na.rm=TRUE)
par(mfrow=c(2,2), mar=c(3,3,3,1))
image(z=Xt, zlim=ZLIM, main="A) True", col=pal(100), xaxt="n", yaxt="n", xlab="", ylab="")
box()
mtext("t", side=1, line=0.5)
mtext("x", side=2, line=0.5)
image(z=Xp, zlim=ZLIM, main=paste("B) True + Noise (N/S = ", N.S.ratio, ")", sep=""), col=pal(100), xaxt="n", yaxt="n", xlab="", ylab="")
box()
mtext("t", side=1, line=0.5)
mtext("x", side=2, line=0.5)
box()
image(z=Xo, zlim=ZLIM, main=paste("C) Observed (", frac.gaps*100, " % gaps)", sep=""), col=pal(100), xaxt="n", yaxt="n", xlab="", ylab="")
mtext("t", side=1, line=0.5)
mtext("x", side=2, line=0.5)
image(z=Xa, zlim=ZLIM, main="D) Reconstruction", col=pal(100), xaxt="n", yaxt="n", xlab="", ylab="")
box()
mtext("t", side=1, line=0.5)
mtext("x", side=2, line=0.5)
 
dev.off()
 
 
#plot of EOF coefficients
St <- svd(Xt)
Sa <- svd(Xa)
png("dineof_EOF_coefficients.png", width=8, height=8, units="in", res=400)
#x11(width=8, height=8)
par(mfrow=c(3,3), mar=c(3,3,3,1))
for(i in seq(9)){
YLIM <- range(St$u[,i], Sa$u[,i])
plot(t, St$u[,i], ylim=YLIM, xlab="", ylab="")
lines(t, Sa$u[,i], col=2, lwd=2)
mtext(paste("EOF coeff.", i), side=3, line=0.5)
}
dev.off()
Created by Pretty R at inside-R.org



References

5 comments:

  1. Thanks for this - a very interesting article.

    A simple method for dealing with gappy rasters, involving voter-model interpolation, was suggested by Clint Sprott a few years ago:

    http://sprott.physics.wisc.edu/pubs/paper276.htm

    It's only useful when the data are fractal-ish.

    ReplyDelete
  2. Thanks for the interesting reference Michael - Benoit Mandelbrot would have been proud of that application.

    ReplyDelete
  3. I think you forgot to define mu_Noise ?
    but this is great stuff - thanks for sharing.

    ReplyDelete
  4. Thanks for your nice words and for noticing that missing part of the code - I have defined mu_Noise now in the example. Cheers

    ReplyDelete
  5. Correction - Noise is now implemented using a randomly generated matrix of uniform distribution (RAND), which is multiplied by the N/S ratio and the true field : R <- RAND * N.S.ratio * Xt. This is then added to the true field to derive the noisy field (Xp).

    ReplyDelete