me nugget: gappy data

Showing posts with label gappy data. Show all posts

Thursday, February 25, 2016

RSpectra::svds function now default method for sinkr::dineof

The summary blog post describing the RSpectra (formerly rARPACK) package made a convincing case for the improved decomposition speed of the "svds" function for partial SVD (Singular Value Decomposition) over several other R packages. Until now, the sinkr package has relied on irlba for it's dineof function (Data Interpolating Empirical Orthogonal Functions). Since the routine can be quite computationally intensive, I wanted to test the performance of svds as an alternative method.

In a simple example that performs SVD on a field of sea level pressure of the equatorial Pacific, svds outperforms irlba both in speed and correlation of singular vector (e.g. $u) to the output of the base svd function. One can see in the following graph, that trailing vectors break down in their correlation while svds maintains nearly perfect correlation. Interestingly, this artifact is removed by first centering the data field.

While the effect looks dramatic above, it should be noted that the trailing vectors usually carry only a small fraction of information, and thus contribute only marginally to errors in field reconstruction. Below is a figure showing the reconstruction error of svd, svds, and irlba with increasing levels of truncation.

Finally, both methods were compared in their performance within dineof. With the non-centered field both approaches arrive to a similar RMS, but svds converges with less iterations and EOFs than irlba. With the centered data both methods produce nearly identical results.

So, even though the differences are small, the rSpectra::svds method will now be the default method for both the eof and dineof functions within sinkr. For the moment, the irlba method is maintained for compatibility with previous versions.

Code to reproduce:

PCA / EOF for data with missing values - a comparison of accuracy

Not all Principal Component Analysis (PCA) (also called Empirical Orthogonal Function analysis, EOF) approaches are equal when it comes to dealing with a data field that contain missing values (i.e. "gappy"). The following post compares several methods by assessing the accuracy of the derived PCs to reconstruct the "true" data set, as was similarly conducted by Taylor et al. (2013).

The gappy EOF methods to be compared are:

LSEOF - "Least-Squares Empirical Orthogonal Functions" - The traditional approach, which modifies the covariance matrix used for the EOF decomposition by the number of paired observations, and further scales the projected PCs by these same weightings (see Björnsson and Venegas 1997, von Storch and Zweiers 1999 for details).
RSEOF - "Recursively Subtracted Empirical Orthogonal Functions" - This approach modifies the LSEOF approach by recursively solving for the leading EOF, whose reconstructed field is then subtracted from the original field. This recursive subtraction is done until a given stopping point (i.e. number of EOFs, % remaining variance, etc.) (see Taylor et al. 2013 for details)
DINEOF - "Data Interpolating Empirical Orthogonal Functions" - This approach gradually solves for EOFs by means of an iterative algorothm to fit EOFS to a given number of non-missing value reference points (small percentage of observations) via RMSE minimization (see Beckers and Rixen 2003 for details).

I have introduced both the LSEOF [link] and DINEOF [link] methods in the past, but have never directly compared them for the blog. The purpose of this post is to make this comparison and to also introduce a more general EOF function that is capable of conducting RSEOF. All analyses can be reproduced following installation of the "sinkr" package: https://github.com/marchtaylor/sinkr

The basic problem comes down to the difficulties of decomposing a matrix that is not "positive-definite", i.e. the estimated covariance matrix from a gappy data set. DINEOF entirely avoids this issue by first interpolating the values to create a full data field, while LSEOF and RSEOF rely on decomposing this estimation. A known problem is that the trailing EOFs derived from such a matrix are amplified in their singular values, which can consequently amplify errors in field reconstructions when included. The RSEOF approach thus attempts to remedy these issues by recursively solving for only leading EOFs. In the following examples, I show the performance of the three approaches in terms of reconstructing the data field (including the "true" values).

Example 1 - Synthetic data set:
The first example uses a synthetic data set used by Beckers and Rixen (2003) in their introduction of the DINEOF approach. The accuracy of the reconstruction is dependent on the number of EOFs used. In a non-gappy example, a perfect reconstruction should be possible using this full set of EOFs - In fact it only takes 9 EOFs when using the non-noisy true field, since it is a composite of 9 signals. In the case of the noisy, gappy data sets, reconstructions with trailing EOFs may increase errors. This can be seen in the figure at the top of the post showing RMSE vs the number of EOFs used in the reconstruction.

The figure shows the DINEOF approach to be the most accurate. The LSEOF approach has a clear RSME minimum with 4 EOFs, while the RSEOF approach was largely able to remedy the amplification of error when using trailing EOFs. The problem of error amplification is even more dramatic when viewed visually, as in the following where the full set of EOFs have been used:

Lomb-Scargle periodogram for unevenly sampled time series

In the natural sciences, it is common to have incomplete or unevenly sampled time series for a given variable. Determining cycles in such series is not directly possible with methods such as Fast Fourier Transform (FFT) and may require some degree of interpolation to fill in gaps. An alternative is the Lomb-Scargle method (or least-squares spectral analysis, LSSA), which estimates a frequency spectrum based on a least squares fit of sinusoid.

The above figure shows a Lomb-Scargle periodogram of a time series of sunspot activity (1749-1997) with 50% of monthly values missing. As expected (link1, link2), the periodogram displays a a highly significant maximum peak at a frequency of ~11 years.

The function comes from a nice set of functions that I found here: http://research.stowers-institute.org/efg/2005/LombScargle/R/index.htm. An accompanying paper focusing on its application to time series of gene expression can be found here.

Below is a comparison to an FFT of the full time series. For another great resource on spectral analysis, and time series-related R methods in general, see the following website: http://zoonek2.free.fr/UNIX/48_R/15.html.

To reproduce the example:

DINEOF (Data Interpolating Empirical Orthogonal Functions)

I finally got around to reproducing the DINEOF method (Beckers and Rixon, 2003) for optimizing EOF analysis on gappy data fields - it is especially useful for remote sensing data where cloud cover can result in large gaps in data. Their paper gives a nice overview of some of the various methods that have been used for such data sets. One of these approaches, which I have written about before, involves deriving EOFs from a covariance matrix as calculated from available data. Unfortunately, as the author's point out, such covariance matrices are no longer positive-definite, which can lead to several problems. The DINEOF method seems to overcome several of these issues.

Working with Globcolour data

[UPDATE: new example can be found here. Functions have also been added to the sinkr package.]

The Globcolour project (http://www.globcolour.info/) provides relatively easy access to ocean color remote sensing data. Data is provided at http://hermes.acri.fr/ and the following parameters are available:

· Chlorophyll-a (CHL1 and CHL2)

· Fully normalised water leaving radiances at 412, 443, 490, 510, 531, 550-565, 620, 665-

670, 681 and 709 nm (Lxxx)

· Coloured dissolved and detrital organic materials absorption coefficient (CDM)

· Diffuse attenuation coefficient (Kd(490))

· Particulate back-scattering coefficient (bbp)

· Total Suspended Matter (TSM)

· Relative excess of radiance at 555 nm (EL555)

· Photosynthetic Available Radiation (PAR)

· Heated layer depth (ZHL)

· Secchi disk depth (ZSD)

· Primary production (PP)

· Aerosol optical thickness over water (T865)

· Cloud Fraction (CF)

Of particular interest to ecologists are the estimates of Chlorophyll a (chla) , which combines data from several satellites for better coverage - SeaWiFS (NASA), MODIS (NASA), MERIS (ESA). Data is available at several temporal (daily, 8-days, and monthly averages) and spatial (4.63 km, 0.25°, and 1°) resolutions for the global domain. Several merged products are available: simple averaging (AV), weighted averaging (AVW), and Garver, Siegel, Maritorena Model (GSM) [for more information see the Product User Guide].

Due to the gappy nature of the data (e.g. due to land and clouds), many of the data products only provide values at grids where estimation was possible. For high resolution data, such as in 4.63 km resolution daily estimates, grids with values are often far fewer than the total number of ISIN grids (n=23,761,676) used by the product. This saves space in the files for download, but you may need to reconstruct the field (with NaN's included for grids without observations) for some analyses.

The following example shows how to retrieve Globcolour data and process it using R. Global data is available, but I have provided instructions for processing a smaller area of 4.63 km resolution chla data from the Galapagos archipelago. One can define lat/lon limits for the desired area on the http://hermes.acri.fr/ interface. An ftp address will be sent via Email as to the location of the data when finished.

Canonical Correlation Analysis for finding patterns in coupled fields

First CCA pattern of Sea Level Pressure (SLP) and Sea Surface Temperature (SST) monthly anomalies for the region between -180 °W to -70 °W and +30 °N to -30 °S.

The following post demonstrates the use of Canonical Correlation Analysis (CCA) for diagnosing coupled patterns in climate fields. The method produces similar results to that of Maximum Covariance Analysis (MCA), but patterns reflect maximum correlation rather than maximum covariance. Furthermore, the output of the model is a combination of linear models that can be used for field prediction.

This particular method was illustrated by Barnett and Preisendorfer (1997) - it constructs a CCA model based on a truncated subset of EOF coefficients (i.e. "principle components") instead of using the original field (as with MCA). This truncation has several benefits for the fitting of the model - First, one reduces the amount of noise in the problem by eliminating the higher EOF modes, which represent poorly organized, small-scale features of the fields. Second, by using orthogonal functions, the algebra of the problem is simplified (see von Storch and Zweiers 1999 for details). Bretherton etal. (1992) reviewed several techniques for diagnosing coupled patterns and found the Barnett and Preisendorfer method (hereafter "BPCCA") and MCA to be the most robust.

A ridiculous proof of concept: xyz interpolation

Ridiculous Orb

This is really the last one on this theme for a while... I had alluded to a combination of methods regarding xyz interpolation at the end of my last post and wanted to demonstrate this in a final example.

The ridiculousness that you see above involved two interpolation steps. First, a thin plate spline interpolation ("Tps" function of the fields package) is applied to the original random xyz field of distance to Mecca. This fitted model is then used to predict values at a new grid of 2° resolution. Finally, in order to avoid plotting polygons for each grid (which can be slow for fine grids), I obtain their projected coordinates with the mapproject function. Using these projected coordinates and their respective z values, a second interpolation is done with the "interp" function of the akima package onto a relatively fine grid of 1000x1000 positions. The result is a smooth field that can then be overlayed on the map using the "image" function (very fast).

So you may ask - When is this even necessary? I would say that it really only makes sense for projecting a filled.contour-type plot for relatively sparse geographic data. Be warned - for large amounts of xyz data, the interpolation algorithms can take a long time.

A couple of functions, found within this blog, are needed to reproduce the plot (earth.dist, color.palette).

the code to reproduce the figure...

Maximum Covariance Analysis (MCA)

Maximum Covariance Analysis (MCA) (Mode 1; scaled) of Sea Level Pressure (SLP) and Sea Surface Temperature (SST) monthly anomalies for the region between -180 °W to -70 °W and +30 °N to -30 °S. MCA coefficients (scaled) are below. The mode represents 94% of the squared covariance fraction (SCF).

Maximum Correlation Analysis (MCA) is similar to Empirical Orthogonal Function Analysis (EOF) in that they both deal with the decomposition of a covariance matrix. In EOF, this is a covariance matrix based on a single spatio-temporal field, while MCA is based on the decomposition of a "cross-covariance" matrix derived from two fields.

Empirical Orthogonal Function (EOF) Analysis for gappy data

[Updates]: The following approach has serious shortcomings, which I have recently become aware of. In a comparison of gappy EOF approaches Taylor et al. (2013) [pdf] show that this traditional approach is not as accurate as others. Specifically, the approach of DINEOF (Data Interpolating Empirical Orthogonal Functions) proved to be the most accurate. I have outlined the DINEOF algorithm in another post [link]. and show a comparison of gappoy EOF methods here: http://menugget.blogspot.de/2014/09/pca-eof-for-data-with-missing-values.html. The R package "sinkr" now contains a version of the function ("eof") for easy installation: https://github.com/menugget/sinkr

-----------------

The following is a function for the calculation of Empirical Orthogonal Functions (EOF). For those coming from a more biologically-oriented background and are familiar with Principal Component Analysis (PCA), the methods are similar. In the climate sciences the method is usually used for the decomposition of a data field into dominant spatial-temporal modes.

me nugget

Thursday, February 25, 2016

RSpectra::svds function now default method for sinkr::dineof

Monday, September 15, 2014

PCA / EOF for data with missing values - a comparison of accuracy

Thursday, January 10, 2013

Lomb-Scargle periodogram for unevenly sampled time series

Tuesday, October 30, 2012

DINEOF (Data Interpolating Empirical Orthogonal Functions)

Monday, April 2, 2012

Working with Globcolour data

Sunday, March 25, 2012

Canonical Correlation Analysis for finding patterns in coupled fields

Wednesday, March 14, 2012

A ridiculous proof of concept: xyz interpolation

Tuesday, December 13, 2011

Maximum Covariance Analysis (MCA)

Thursday, November 24, 2011

Empirical Orthogonal Function (EOF) Analysis for gappy data

NPR Topics: Science

WWF - Environmental News