me nugget: 2015

Friday, October 9, 2015

A gem of a comment regarding likelihood optimization

An interesting comment on conducting likelihood optimization (frequentist approach) (from http://www.petrkeil.com/?p=393):

R code:

### The following R code does just this for a normal linear regression (eq1 above):
# Generate some data
a <- 5
b <- 2
sigma <- 0.5
x <- runif(1000,0,20)
y <- a+(b*x)+rnorm(1000)*sigma
dat <- data.frame(x=x,y=y)
plot(dat$x,dat$y)
 
# Write the likelihood function for the statistical model (in this case, normal linear regression)
loglike<- function(par,dat)
{
a <- par['a']
b <- par['b']
sigma <- par['sigma']
 
return(sum(dnorm(dat$y-a-(b*dat$x),mean=0,sd=sigma,log=T)))
}
 
# Maximise the log likelihood
res <- optim(par=c(a=1.5,b=1.5,sigma=0.6), fn=loglike, dat=dat, method='L-BFGS-B', lower=rep(0.2,3), upper=c(a=5,b=5,sigma=2), control=list(fnscale=-1))
res$value
res$par
 
# Plot the fitted result over the data to see if it looks like a good fit (which we know it will be).
x <- seq(from=min(dat$x),to=max(dat$x),by=0.1)
y <- res$par['a']+(res$par['b']*x)
plot(dat$x, dat$y, type='p')
points(x, y, type='l', col='red')
 
 
### And compare to LM
lmfit <- lm(y ~ x, dat)
summary(lmfit)$coeff
res$par

Created by Pretty R at inside-R.org

Wednesday, July 22, 2015

A simple square binning function

While I like the hexbin package and see the advantage of having more circular bins, I find that square binning allowed me to take advantage of lower level plotting functions like image, which allows for more flexibility when plotting. The following is an efficient implementation within the function sqbin, available in the sinkr package.

Example:

#library(devtools)
#install_github("marchtaylor/sinkr")
library(sinkr)
 
# Synthetic data
set.seed(1)
n <- 1e6
x <- runif(n, min=-3, max=3)
y <- 4*x^2 + rnorm(n, sd=5)
 
sqbin.res <- sqbin(x,y)
 
# Plot
op <- par(mar=c(4,4,1,1))
image(sqbin.res, col=jetPal(20))
par(op)
 
# Plot with legend
op <- par(no.readonly = TRUE)
lo <- matrix(1:2, nrow=1, ncol=2)
layout(lo, widths=c(4,1), heights=c(4), respect=TRUE)
par(cex=1)
par(mar=c(3,3,1,1))
image(sqbin.res, col=jetPal(20))
par(mar=c(3,0,1,4))
imageScale(sqbin.res$z, col=jetPal(20), axis.pos=4)
mtext("Frequency", line=2.5, side=4)
par(op)

Created by Pretty R at inside-R.org

Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends

I recently came across a great new package for web scraping called rvest by Hadley Wickham. This nice little post got me started: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/. It was really incredibly easy to use when combined with the css selector tool "selector gadget"(http://selectorgadget.com/), which helps you select elements of a webpage to scrape.

I decided to have a go at extracting some stats for R-bloggers postings (e.g. dates, authors). Aggregating by date results in the plot above. You may notice a prominent outlier on 2011-05-26, which was due to repeat postings of the same entry. I also checked to see if there is a trend according to the number of postings by day of the week:

As one might expect, there is a lull in postings on the weekend. So, if you want longer exposure on the front page of R-bloggers, then you might want to post later in the week to lessen the competition!

I fit a GAMM (family="poisson") to the data with a cyclical spline term for day of the week ("wday") and another spline term for time (i.e. numeric conversion of date) to further explore the trends (with help from a great post on using GAMs with temporal data):

Both terms are significant, and you can better visualize the long-term trend. Here is a plot of the model prediction:

The day of the week oscillations are strong enough to result in what looks like a band of predicted values - I added a line for a model that only contains the long-term trend (in red) for easier viewing. A zoom in on a smaller subset of the data reveals the weekly oscillations:

The long-term trend indicates that a decline in the number of posts started in 2013, and continues to to the present. I binned the data by year quarter and found the same:

This decline seems to be related to the number of authors that are regularly posting, which is also declining at a similar rate. The correlation is strong enough that we can estimate an average of 3.7 postings / author / quarter (or, ~1 posting / author / month). So, what's going on? My hunch is that, while the use of R (and probably the number of blogs) continues to grow (link1, link2), R-bloggers doesn't seem to be accepting all blogs that apply to have their feeds aggregated (and may even be actively removing some; e.g. this blog has unfortunately been dropped without explanation). Maybe the time invested is maxed out at ~ 7 posts / day?

All in all, I was pleasantly surprised how easy web scraping can be. The code to reproduce this exercise can be found below (be warned that the web scraping loop took about 30 min to gather data from 620 webpages).

To reproduce the example (required packages: rvest (CRAN), sinkr (GitHub)):

Updated getcolors function

There is an updated version of the getcolors function available in the sinkr package, called getcolors2. The previous version allowed for the visual selection of colors from a palette of 216 colors. This updated version uses a more natural color palette with a rainbow gradient on the x-axis and lightness on the y-axis. Following a series of color selections via clicks, the function returns the vector of chosen colors that can later be used in a plot:

Example (from getcolors2 help):

Map projection cheat sheet

Here's a cheat sheet for map projection settings for the mapproject function (mapproj package). I've repeated the same general procedure for each map: 1. create blank world map, 2. project colored polygons, 3. overlay country boundaries. Some specifications were needed for individual map projections (i.e. different orientation, lat/lon limits, additional parameters). Most projections work well as long as the limits and orientation are valid - in a few cases country boundaries ('square' and 'hex') were wrapped incorrectly. I was unable to get the two spheroid projections working due to the escape character in their names ('sp\_mercator', 'sp\_albers') (maybe someone has a suggestion here?).

In addition to the maps and mapproj packages, the sinkr package is also used for the calculation of global distance, value to color conversion, and polygon creation.

Script to reproduce (high resolution png and pdf versions):

Add text using relative plot region coordinates

Here is a simple but useful function for adding text to a plot device using the relative x, y coordinates (i.e. between 0 and 1). I found myself programming these few lines repeatedly and decided to finally bundle it into a function, called reltext(). The function can be found below, and is also available in the sinkr package (https://github.com/marchtaylor/sinkr).

Walters and Martell (2004, Section 3.2) present an interesting example of harvest feedback policy exploration using basic spreadsheet techniques and optimization procedures (e.g. Solver in Excel, or Optimizer in Quattro Pro). For a given species, parameters describing growth, mortality, and reproduction/recruitment are used to model the population's dynamics. Given a random series of environmental effects to recruitment parameters (survival and carrying capacity) the optimization procedure can be used to determine the best series of fishing mortality (F) values given a cost function (e.g. maximize the sum of: harvest yield, log-transformed yield, discounted yield):

In this example, for a hypothetical population of tilapia, yearly recruitment survival varies by ca. +/-30%, while recruitment carrying capacity has two periods of poor recruitment (-50%, e.g. due to a regime shift) during the simulation years 20-35 and 60-75, as indicated by the shaded areas. The optimal F series are similar for maximizing yield or discounted yield (discount rate = 5%) likely due to the fact that tilapia is a fast-growing, high fecundity species and can rebound quickly following intensive fishing pressure. Maximizing log-transformed biomass is the "risk-averse" strategy, whereby peaks in harvest are less highly valued, and fluctuations in fishing pressure (and yield) are dampened:

Interesting patterns emerge, e.g. one should increase fishing pressure at the beginning of a "bad" recruitment regime (grey regions) and reduce fishing at the beginning a "good" recruitment regime. In both cases, this speeds the trajectory of the stock to it's new equilibrium and takes advantage of soon to be lost or gained harvestable biomass. In other words, fish hard before bad times so as not to "waste" a large spawning stock, and lay off of fishing before good times so as to make full use of the spawning potential. Martell and Walters (2004) state,

These anticipatory behaviors are a bizarre reversal of our usual prescriptions about how to hedge against uncertain future changes in productivity and would almost certainly never be acceptable as management recommendations in practical decision settings.

The figure at the top of the post shows the relationship between biomass and yield during the optimal scenarios. The relationship is most strongly linear for the log(Yt) optimization, but a relationship is evident for all three scenarios. One could imagine using the coefficients of this regression as a general harvesting strategy; x-intercept would be the minimum biomass before harvesting is allowed, and the slope gives the exploitation rate (E = harvest/biomass). One would need to program this strategy to be able to make the comparison to the optimal harvested values, but even the simplified strategy of using a constant exploitation rate from the regression (e.g. E = 0.34, 0.23, 0.37) results in 87, 77, and 90% of the optimal solution for sum(Yt), sum(log(Yt)), and sum(discounted Yt), respectively. These levels of performance (and higher) are typical for most examples explored by the approach (Walters and Parma, 1996).

Finally, the authors also demonstrate the situation where all sizes are vulnerable to fishing (as opposed to a knife-edge selection of individuals above a given minimum size, as in the above example). Here the sum(Yt) optimization results in a "pulse-fishing" strategy whereby the stock is fished hard (usually E > 0.5) for 1-2 years, followed by 1-2 years of recovery where no fishing is allowed. This pattern results from "growth overfishing", whereby unselective harvesting wastes a large part of the biomass that is still growing rapidly:

The examples use the stockSim() and optim.stockSim() functions of the R package fishdynr. Instructions for direct installation from GitHub can be found here: https://github.com/marchtaylor/fishdynr. Vector optimization of the F series is computed with the optim() function (stats package).

References:
- Walters, C. J., Martell, S. J., 2004. Fisheries ecology and management. Princeton University Press.

- Walters, C., Parma, A. M., 1996. Fixed exploitation rate strategies for coping with effects of climate change. Canadian Journal of Fisheries and Aquatic Sciences, 53(1), 148-158.

Script to reproduce the example:

R package "fishdynr"

The fishdynr package allows for the construction of some basic population dynamics models commonly used in fisheries science. Included are models of a single cohort, cohortSim, and a more complex iterative model that incorporates a stock-recruitment relationship, stockSim. The model functions require a list of parameters as the main argument, which contains information about the given population's dynamics (growth, mortality, recruitment, etc.) and fishery (e.g. selectivity function and related parameters). This allows for a great deal of flexibility in adapting the analyses to particular species or fishery by defining functions of growth, mortality, fishing selectivity, recruitment, etc., outside the analysis. The package (located on GitHub) can be easily installed via the install_github function of the devtools package:

library(devtools)
install_github("fishdynr", "marchtaylor")

The above figure shows the output of the Beverton and Holt's (1957) yield-per-recruit model (ypr function - wrapper for cohortSim), based on variable fishing mortality (F) and length at first capture (Lcap, knife-edge selection). Length at maturity is shown as a dashed white line for reference. In this example, the maximum yield (Ymax) is defined as the maximum possible yield without depletion of the spawning biomass below 50% of it's virgin, unfished biomass. This simple cohort model has many additional outputs that can be helpful in visualizing processes of growth and mortality:

I hope to continue to document different models used in fisheries science within the package. Any suggestions or comments are welcome.

References
Beverton, R. J. H.; Holt, S. J. (1957), On the Dynamics of Exploited Fish Populations, Fishery Investigations Series II Volume XIX, Ministry of Agriculture, Fisheries and Food

Script to reproduce the examples

me nugget

Friday, October 9, 2015

A gem of a comment regarding likelihood optimization

Wednesday, July 22, 2015

A simple square binning function

Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends

Tuesday, May 12, 2015

Updated getcolors function

Wednesday, April 8, 2015

Map projection cheat sheet

Thursday, April 2, 2015

Add text using relative plot region coordinates

Exploring fishing harvest feedback policies in R using vector optimization

Sunday, February 1, 2015

R package "fishdynr"

NPR Topics: Science

WWF - Environmental News