me nugget

Thursday, February 25, 2016

RSpectra::svds function now default method for sinkr::dineof

The summary blog post describing the RSpectra (formerly rARPACK) package made a convincing case for the improved decomposition speed of the "svds" function for partial SVD (Singular Value Decomposition) over several other R packages. Until now, the sinkr package has relied on irlba for it's dineof function (Data Interpolating Empirical Orthogonal Functions). Since the routine can be quite computationally intensive, I wanted to test the performance of svds as an alternative method.

In a simple example that performs SVD on a field of sea level pressure of the equatorial Pacific, svds outperforms irlba both in speed and correlation of singular vector (e.g. $u) to the output of the base svd function. One can see in the following graph, that trailing vectors break down in their correlation while svds maintains nearly perfect correlation. Interestingly, this artifact is removed by first centering the data field.

While the effect looks dramatic above, it should be noted that the trailing vectors usually carry only a small fraction of information, and thus contribute only marginally to errors in field reconstruction. Below is a figure showing the reconstruction error of svd, svds, and irlba with increasing levels of truncation.

Finally, both methods were compared in their performance within dineof. With the non-centered field both approaches arrive to a similar RMS, but svds converges with less iterations and EOFs than irlba. With the centered data both methods produce nearly identical results.

So, even though the differences are small, the rSpectra::svds method will now be the default method for both the eof and dineof functions within sinkr. For the moment, the irlba method is maintained for compatibility with previous versions.

Code to reproduce:

Visualizing model predictions in 3d

Here is a brief exploration of the misc3d package, which has some nice functions that can be used in conjunction with rgl. I am especially pleased with the output of contour3d which I have used to plot GAM predictions in 3d.
The example is a simple dataset of x, y, and z, data that were used to calculate a 4th variable "value" with the equation:

value = -0.01x³ + -0.2*y² + -0.3*z²

Fitting GAM model to this dataset resulted in the following spline terms.

Then, the fitted GAM model was used to predict values on a regular 3d grid for plotting with the rgl package. The following plot shows the original data, with value values colored (blue colors of the spectrum are low values, red colors are high values). Finally, the contour3d function is used to add the GAM predictions as colored contours.

I got some nice insight from the R code accompanying the book by Murcell (2011).

References:
Murrell, P., 2011. R graphics. CRC Press.

Example script:

Working with Globcolour data (Part 2)

Just a quick note to announce that the makeGlobcolourField and isin.convert functions have been added to the sinkr package. In addition, the makeGlobcolourField function now used the ncdf4 package to read the .nc files. Both functions are only set up to deal with the higher resolution 4 km data based on the ISIN grid ("L3b").
The following script is an example of extracting data for the Philippines, and produces a map of mean Chl1 values:

Example script:

A gem of a comment regarding likelihood optimization

An interesting comment on conducting likelihood optimization (frequentist approach) (from http://www.petrkeil.com/?p=393):

R code:

### The following R code does just this for a normal linear regression (eq1 above):
# Generate some data
a <- 5
b <- 2
sigma <- 0.5
x <- runif(1000,0,20)
y <- a+(b*x)+rnorm(1000)*sigma
dat <- data.frame(x=x,y=y)
plot(dat$x,dat$y)
 
# Write the likelihood function for the statistical model (in this case, normal linear regression)
loglike<- function(par,dat)
{
a <- par['a']
b <- par['b']
sigma <- par['sigma']
 
return(sum(dnorm(dat$y-a-(b*dat$x),mean=0,sd=sigma,log=T)))
}
 
# Maximise the log likelihood
res <- optim(par=c(a=1.5,b=1.5,sigma=0.6), fn=loglike, dat=dat, method='L-BFGS-B', lower=rep(0.2,3), upper=c(a=5,b=5,sigma=2), control=list(fnscale=-1))
res$value
res$par
 
# Plot the fitted result over the data to see if it looks like a good fit (which we know it will be).
x <- seq(from=min(dat$x),to=max(dat$x),by=0.1)
y <- res$par['a']+(res$par['b']*x)
plot(dat$x, dat$y, type='p')
points(x, y, type='l', col='red')
 
 
### And compare to LM
lmfit <- lm(y ~ x, dat)
summary(lmfit)$coeff
res$par

Created by Pretty R at inside-R.org

Wednesday, July 22, 2015

A simple square binning function

While I like the hexbin package and see the advantage of having more circular bins, I find that square binning allowed me to take advantage of lower level plotting functions like image, which allows for more flexibility when plotting. The following is an efficient implementation within the function sqbin, available in the sinkr package.

Example:

#library(devtools)
#install_github("marchtaylor/sinkr")
library(sinkr)
 
# Synthetic data
set.seed(1)
n <- 1e6
x <- runif(n, min=-3, max=3)
y <- 4*x^2 + rnorm(n, sd=5)
 
sqbin.res <- sqbin(x,y)
 
# Plot
op <- par(mar=c(4,4,1,1))
image(sqbin.res, col=jetPal(20))
par(op)
 
# Plot with legend
op <- par(no.readonly = TRUE)
lo <- matrix(1:2, nrow=1, ncol=2)
layout(lo, widths=c(4,1), heights=c(4), respect=TRUE)
par(cex=1)
par(mar=c(3,3,1,1))
image(sqbin.res, col=jetPal(20))
par(mar=c(3,0,1,4))
imageScale(sqbin.res$z, col=jetPal(20), axis.pos=4)
mtext("Frequency", line=2.5, side=4)
par(op)

Created by Pretty R at inside-R.org

Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends

I recently came across a great new package for web scraping called rvest by Hadley Wickham. This nice little post got me started: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/. It was really incredibly easy to use when combined with the css selector tool "selector gadget"(http://selectorgadget.com/), which helps you select elements of a webpage to scrape.

I decided to have a go at extracting some stats for R-bloggers postings (e.g. dates, authors). Aggregating by date results in the plot above. You may notice a prominent outlier on 2011-05-26, which was due to repeat postings of the same entry. I also checked to see if there is a trend according to the number of postings by day of the week:

As one might expect, there is a lull in postings on the weekend. So, if you want longer exposure on the front page of R-bloggers, then you might want to post later in the week to lessen the competition!

I fit a GAMM (family="poisson") to the data with a cyclical spline term for day of the week ("wday") and another spline term for time (i.e. numeric conversion of date) to further explore the trends (with help from a great post on using GAMs with temporal data):

Both terms are significant, and you can better visualize the long-term trend. Here is a plot of the model prediction:

The day of the week oscillations are strong enough to result in what looks like a band of predicted values - I added a line for a model that only contains the long-term trend (in red) for easier viewing. A zoom in on a smaller subset of the data reveals the weekly oscillations:

The long-term trend indicates that a decline in the number of posts started in 2013, and continues to to the present. I binned the data by year quarter and found the same:

This decline seems to be related to the number of authors that are regularly posting, which is also declining at a similar rate. The correlation is strong enough that we can estimate an average of 3.7 postings / author / quarter (or, ~1 posting / author / month). So, what's going on? My hunch is that, while the use of R (and probably the number of blogs) continues to grow (link1, link2), R-bloggers doesn't seem to be accepting all blogs that apply to have their feeds aggregated (and may even be actively removing some; e.g. this blog has unfortunately been dropped without explanation). Maybe the time invested is maxed out at ~ 7 posts / day?

All in all, I was pleasantly surprised how easy web scraping can be. The code to reproduce this exercise can be found below (be warned that the web scraping loop took about 30 min to gather data from 620 webpages).

To reproduce the example (required packages: rvest (CRAN), sinkr (GitHub)):

Updated getcolors function

There is an updated version of the getcolors function available in the sinkr package, called getcolors2. The previous version allowed for the visual selection of colors from a palette of 216 colors. This updated version uses a more natural color palette with a rainbow gradient on the x-axis and lightness on the y-axis. Following a series of color selections via clicks, the function returns the vector of chosen colors that can later be used in a plot:

Example (from getcolors2 help):

Map projection cheat sheet

Here's a cheat sheet for map projection settings for the mapproject function (mapproj package). I've repeated the same general procedure for each map: 1. create blank world map, 2. project colored polygons, 3. overlay country boundaries. Some specifications were needed for individual map projections (i.e. different orientation, lat/lon limits, additional parameters). Most projections work well as long as the limits and orientation are valid - in a few cases country boundaries ('square' and 'hex') were wrapped incorrectly. I was unable to get the two spheroid projections working due to the escape character in their names ('sp\_mercator', 'sp\_albers') (maybe someone has a suggestion here?).

In addition to the maps and mapproj packages, the sinkr package is also used for the calculation of global distance, value to color conversion, and polygon creation.

Script to reproduce (high resolution png and pdf versions):

Add text using relative plot region coordinates

Here is a simple but useful function for adding text to a plot device using the relative x, y coordinates (i.e. between 0 and 1). I found myself programming these few lines repeatedly and decided to finally bundle it into a function, called reltext(). The function can be found below, and is also available in the sinkr package (https://github.com/marchtaylor/sinkr).

Walters and Martell (2004, Section 3.2) present an interesting example of harvest feedback policy exploration using basic spreadsheet techniques and optimization procedures (e.g. Solver in Excel, or Optimizer in Quattro Pro). For a given species, parameters describing growth, mortality, and reproduction/recruitment are used to model the population's dynamics. Given a random series of environmental effects to recruitment parameters (survival and carrying capacity) the optimization procedure can be used to determine the best series of fishing mortality (F) values given a cost function (e.g. maximize the sum of: harvest yield, log-transformed yield, discounted yield):

In this example, for a hypothetical population of tilapia, yearly recruitment survival varies by ca. +/-30%, while recruitment carrying capacity has two periods of poor recruitment (-50%, e.g. due to a regime shift) during the simulation years 20-35 and 60-75, as indicated by the shaded areas. The optimal F series are similar for maximizing yield or discounted yield (discount rate = 5%) likely due to the fact that tilapia is a fast-growing, high fecundity species and can rebound quickly following intensive fishing pressure. Maximizing log-transformed biomass is the "risk-averse" strategy, whereby peaks in harvest are less highly valued, and fluctuations in fishing pressure (and yield) are dampened:

Interesting patterns emerge, e.g. one should increase fishing pressure at the beginning of a "bad" recruitment regime (grey regions) and reduce fishing at the beginning a "good" recruitment regime. In both cases, this speeds the trajectory of the stock to it's new equilibrium and takes advantage of soon to be lost or gained harvestable biomass. In other words, fish hard before bad times so as not to "waste" a large spawning stock, and lay off of fishing before good times so as to make full use of the spawning potential. Martell and Walters (2004) state,

These anticipatory behaviors are a bizarre reversal of our usual prescriptions about how to hedge against uncertain future changes in productivity and would almost certainly never be acceptable as management recommendations in practical decision settings.

The figure at the top of the post shows the relationship between biomass and yield during the optimal scenarios. The relationship is most strongly linear for the log(Yt) optimization, but a relationship is evident for all three scenarios. One could imagine using the coefficients of this regression as a general harvesting strategy; x-intercept would be the minimum biomass before harvesting is allowed, and the slope gives the exploitation rate (E = harvest/biomass). One would need to program this strategy to be able to make the comparison to the optimal harvested values, but even the simplified strategy of using a constant exploitation rate from the regression (e.g. E = 0.34, 0.23, 0.37) results in 87, 77, and 90% of the optimal solution for sum(Yt), sum(log(Yt)), and sum(discounted Yt), respectively. These levels of performance (and higher) are typical for most examples explored by the approach (Walters and Parma, 1996).

Finally, the authors also demonstrate the situation where all sizes are vulnerable to fishing (as opposed to a knife-edge selection of individuals above a given minimum size, as in the above example). Here the sum(Yt) optimization results in a "pulse-fishing" strategy whereby the stock is fished hard (usually E > 0.5) for 1-2 years, followed by 1-2 years of recovery where no fishing is allowed. This pattern results from "growth overfishing", whereby unselective harvesting wastes a large part of the biomass that is still growing rapidly:

The examples use the stockSim() and optim.stockSim() functions of the R package fishdynr. Instructions for direct installation from GitHub can be found here: https://github.com/marchtaylor/fishdynr. Vector optimization of the F series is computed with the optim() function (stats package).

References:
- Walters, C. J., Martell, S. J., 2004. Fisheries ecology and management. Princeton University Press.

- Walters, C., Parma, A. M., 1996. Fixed exploitation rate strategies for coping with effects of climate change. Canadian Journal of Fisheries and Aquatic Sciences, 53(1), 148-158.

Script to reproduce the example:

me nugget

Thursday, February 25, 2016

RSpectra::svds function now default method for sinkr::dineof

Thursday, February 18, 2016

Visualizing model predictions in 3d

Wednesday, February 17, 2016

Working with Globcolour data (Part 2)

Friday, October 9, 2015

A gem of a comment regarding likelihood optimization

Wednesday, July 22, 2015

A simple square binning function

Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends

Tuesday, May 12, 2015

Updated getcolors function

Wednesday, April 8, 2015

Map projection cheat sheet

Thursday, April 2, 2015

Add text using relative plot region coordinates

Exploring fishing harvest feedback policies in R using vector optimization

NPR Topics: Science

WWF - Environmental News