Wednesday, July 22, 2015

A simple square binning function

While I like the hexbin package and see the advantage of having more circular bins, I find that square binning allowed me to take advantage of lower level plotting functions like image, which allows for more flexibility when plotting. The following is an efficient implementation within the function sqbin, available in the sinkr package.


# Synthetic data
n <- 1e6
x <- runif(n, min=-3, max=3)
y <- 4*x^2 + rnorm(n, sd=5)
sqbin.res <- sqbin(x,y)
# Plot
op <- par(mar=c(4,4,1,1))
image(sqbin.res, col=jetPal(20))
# Plot with legend
op <- par(no.readonly = TRUE)
lo <- matrix(1:2, nrow=1, ncol=2)
layout(lo, widths=c(4,1), heights=c(4), respect=TRUE)
image(sqbin.res, col=jetPal(20))
imageScale(sqbin.res$z, col=jetPal(20), axis.pos=4)
mtext("Frequency", line=2.5, side=4)
Created by Pretty R at

Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends

I recently came across a great new package for web scraping called rvest by Hadley Wickham. This nice little post got me started: It was really incredibly easy to use when combined with the css selector tool "selector gadget"(, which helps you select elements of a webpage to scrape.

I decided to have a go at extracting some stats for R-bloggers postings (e.g. dates, authors). Aggregating by date results in the plot above. You may notice a prominent outlier on 2011-05-26, which was due to repeat postings of the same entry. I also checked to see if there is a trend according to the number of postings by day of the week:

 As one might expect, there is a lull in postings on the weekend. So, if you want longer exposure on the front page of R-bloggers, then you might want to post later in the week to lessen the competition!

I fit a GAMM (family="poisson") to the data with a cyclical spline term for day of the week ("wday") and another spline term for time (i.e. numeric conversion of date) to further explore the trends (with help from a great post on using GAMs with temporal data):

Both terms are significant, and you can better visualize the long-term trend. Here is a plot of the model prediction:

The day of the week oscillations are strong enough to result in what looks like a band of predicted values - I added a line for a model that only contains the long-term trend (in red) for easier viewing. A zoom in on a smaller subset of the data reveals the weekly oscillations:

The long-term trend indicates that a decline in the number of posts started in 2013, and continues to to the present. I binned the data by year quarter and found the same:

This decline seems to be related to the number of authors that are regularly posting, which is also declining at a similar rate. The correlation is strong enough that we can estimate an average of 3.7 postings / author / quarter (or, ~1 posting / author / month). So, what's going on? My hunch is that, while the use of R (and probably the number of blogs) continues to grow (link1, link2), R-bloggers doesn't seem to be accepting all blogs that apply to have their feeds aggregated (and may even be actively removing some; e.g. this blog has unfortunately been dropped without explanation). Maybe the time invested is maxed out at ~ 7 posts / day?

All in all, I was pleasantly surprised how easy web scraping can be. The code to reproduce this exercise can be found below (be warned that the web scraping loop took about 30 min to gather data from 620 webpages).

To reproduce the example (required packages: rvest (CRAN), sinkr (GitHub)):

Tuesday, May 12, 2015

Updated getcolors function

There is an updated version of the getcolors function available in the sinkr package, called getcolors2. The previous version allowed for the visual selection of colors from a palette of 216 colors. This updated version uses a more natural color palette with a rainbow gradient on the x-axis and lightness on the y-axis. Following a series of color selections via clicks, the function returns the vector of chosen colors that can later be used in a plot:
Example (from getcolors2 help):

Wednesday, April 8, 2015

Map projection cheat sheet

Here's a cheat sheet for map projection settings for the mapproject function (mapproj package). I've repeated the same general procedure for each map: 1. create blank world map, 2. project colored polygons, 3. overlay country boundaries. Some specifications were needed for individual map projections (i.e. different orientation, lat/lon limits, additional parameters). Most projections work well as long as the limits and orientation are valid - in a few cases country boundaries ('square' and 'hex') were wrapped incorrectly. I was unable to get the two spheroid projections working due to the escape character in their names ('sp\_mercator', 'sp\_albers') (maybe someone has a suggestion here?).

In addition to the maps and mapproj packages, the sinkr package is also used for the calculation of global distance, value to color conversion, and polygon creation.

Script to reproduce (high resolution png and pdf versions):

Thursday, April 2, 2015

Add text using relative plot region coordinates

Here is a simple but useful function for adding text to a plot device using the relative x, y coordinates (i.e. between 0 and 1). I found myself programming these few lines repeatedly and decided to finally bundle it into a function, called reltext(). The function can be found below, and is also available in the sinkr package (

Exploring fishing harvest feedback policies in R using vector optimization

Walters and Martell (2004, Section 3.2) present an interesting example of harvest feedback policy exploration using basic spreadsheet techniques and optimization procedures (e.g. Solver in Excel, or Optimizer in Quattro Pro). For a given species, parameters describing growth, mortality, and reproduction/recruitment are used to model the population's dynamics. Given a random series of environmental effects to recruitment parameters (survival and carrying capacity) the optimization procedure can be used to determine the best series of fishing mortality (F) values given a cost function (e.g. maximize the sum of: harvest yield, log-transformed yield, discounted yield):

In this example, for a hypothetical population of tilapia, yearly recruitment survival varies by ca. +/-30%, while recruitment carrying capacity has two periods of poor recruitment (-50%, e.g. due to a regime shift) during the simulation years 20-35 and 60-75, as indicated by the shaded areas. The optimal F series are similar for maximizing yield or discounted yield (discount rate = 5%) likely due to the fact that tilapia is a fast-growing, high fecundity species and can rebound quickly following intensive fishing pressure. Maximizing log-transformed biomass is the "risk-averse" strategy, whereby peaks in harvest are less highly valued, and fluctuations in fishing pressure (and yield) are dampened:

Interesting patterns emerge, e.g. one should increase fishing pressure at the beginning of a "bad" recruitment regime (grey regions) and reduce fishing at the beginning a "good" recruitment regime. In both cases, this speeds the trajectory of the stock to it's new equilibrium and takes advantage of soon to be lost or gained harvestable biomass. In other words, fish hard before bad times so as not to "waste" a large spawning stock, and lay off of fishing before good times so as to make full use of the spawning potential. Martell and Walters (2004) state,
These anticipatory behaviors are a bizarre reversal of our usual prescriptions about how to hedge against uncertain future changes in productivity and would almost certainly never be acceptable as management recommendations in practical decision settings.
The figure at the top of the post shows the relationship between biomass and yield during the optimal scenarios.  The relationship is most strongly linear for the log(Yt) optimization, but a relationship is evident for all three scenarios. One could imagine using the coefficients of this regression as a general harvesting strategy; x-intercept would be the minimum biomass before harvesting is allowed, and the slope gives the exploitation rate (E = harvest/biomass). One would need to program this strategy to be able to make the comparison to the optimal harvested values, but even the simplified strategy of using a constant exploitation rate from the regression (e.g. E = 0.34, 0.23, 0.37) results in 87, 77, and 90% of the optimal solution for sum(Yt), sum(log(Yt)), and sum(discounted Yt), respectively. These levels of performance (and higher) are typical for most examples explored by the approach (Walters and Parma, 1996).

Finally, the authors also demonstrate the situation where all sizes are vulnerable to fishing (as opposed to a knife-edge selection of individuals above a given minimum size, as in the above example). Here the sum(Yt) optimization results in a "pulse-fishing" strategy whereby the stock is fished hard (usually E > 0.5) for 1-2 years, followed by 1-2 years of recovery where no fishing is allowed. This pattern results from "growth overfishing", whereby unselective harvesting wastes a large part of the biomass that is still growing rapidly:

The examples use the stockSim() and optim.stockSim() functions of the R package fishdynr. Instructions for direct installation from GitHub can be found here: Vector optimization of the F series is computed with the optim() function (stats package).

- Walters, C. J., Martell, S. J., 2004. Fisheries ecology and management. Princeton University Press.
- Walters, C., Parma, A. M., 1996. Fixed exploitation rate strategies for coping with effects of climate change. Canadian Journal of Fisheries and Aquatic Sciences, 53(1), 148-158.

Script to reproduce the example:

Sunday, February 1, 2015

R package "fishdynr"

The fishdynr package allows for the construction of some basic population dynamics models commonly used in fisheries science. Included are models of a single cohort, cohortSim, and a more complex iterative model that incorporates a stock-recruitment relationship, stockSim. The model functions require a list of parameters as the main argument, which contains information about the given population's dynamics (growth, mortality, recruitment, etc.) and fishery (e.g. selectivity function and related parameters). This allows for a great deal of flexibility in adapting the analyses to particular species or fishery by defining functions of growth, mortality, fishing selectivity, recruitment, etc., outside the analysis. The package (located on GitHub) can be easily installed via the install_github function of the devtools package:

install_github("fishdynr", "marchtaylor")

The above figure shows the output of the Beverton and Holt's (1957) yield-per-recruit model (ypr function - wrapper for cohortSim), based on variable fishing mortality (F) and length at first capture (Lcap, knife-edge selection). Length at maturity is shown as a dashed white line for reference. In this example, the maximum yield (Ymax) is defined as the maximum possible yield without depletion of the spawning biomass below 50% of it's virgin, unfished biomass. This simple cohort model has many additional outputs that can be helpful in visualizing processes of growth and mortality:

I hope to continue to document different models used in fisheries science within the package. Any suggestions or comments are welcome.

Beverton, R. J. H.; Holt, S. J. (1957), On the Dynamics of Exploited Fish Populations, Fishery Investigations Series II Volume XIX, Ministry of Agriculture, Fisheries and Food

Script to reproduce the examples

Friday, December 5, 2014

Data point locator function

Here's a little function to select data points in an open graphical device (ptlocator()). The function does a scaling of the x and y axes in order to give them equal weighting and remove the influence of differing units or ranges. The function then calculates the Euclidean distance between the selected locations (using the locator() function) and the x, y coordinates of the plotted data points. Colored points are filled in for the data point that has the lowest distance to the clicked location, and the results give the vector positions of the closest x, y data points.

[NOTE: I just realized that the identify function is very similar in its usage]

The function:

Tuesday, September 30, 2014

Additional tips for structuring an individual-based model in R

I had a reader ask me recently to help understand how to modify the code of an individual-based model (IBM) that I posted a while back. It was my first attempt at an IBM in R, and I realized that I have made some significant changes to the way that I code such models nowadays. Most of the changes are structural, but seem to help a lot in clearly documenting the model and its underlying processes.
Basically, I follow a structure that a friend taught me (a very experienced modeller, specializing in IBMs). Granted R isn't the best language for such models but, depending on your computational needs, it can also be quite easy to implement if you already have experience in basic R programming. The idea is to code important processes of the IBM as functions, which accept an object of individuals as the main argument. The functions can be as simple or elaborate as needed, but the key is that when you finally set up your simulation, you only need to call these functions again. This results in easily legible code, where you are less likely to get lost in the model and can concentrate more on the processes to be performed during each iteration (e.g. growth, reproduction, mortality):

inds <- grow.inds(inds)
inds <- reproduce.inds(inds)
inds <- kill.inds(inds)

Since the previous post, I have gone towards using data.frame objects to store individuals and their attributes since I find it easier to apply functions for extracting summary statistics (e.g. histogram of age distribution in the final population):

The example shown here is a modification of the genetic drift example that I showed before - only this time I have included 4 color phenotypes. The model runs until either one phenotype dominates the population or a maximum number of iterations is reached. Reproduction allows for a individuals to have more than one offspring per time step, but skewed towards zero. Death is modeled by a constant instantaneous mortality rate. I have used such a setup with more complicated models of fish genetics and found the performance to be quite fast, even with population sizes of 300,000 individuals. The key is to maintain as much vectorization in the functions as possible.

To reproduce the example:

Wednesday, September 17, 2014

Maximal Information Coefficient (Part II)

A while back, I wrote a post simply announcing a recent paper that described a new statistic called the "Maximal Information Coefficient" (MIC), which is able to describe the correlation between paired variables regardless of linear or nonlinear relationship. This turned out to be quite a popular post, and included a lively discussion as to the merits of the work and difficulties in using the software provided by the authors. Regarding the latter, I also had difficulties running the software on R and thus did not include an example. Checking back on this topic, I was pleased to see that an R package had subsequently been developed: minerva: Maximal Information-Based Nonparametric Exploration R package for Variable Analysis (Albanese et al. 2013). Further documentation of the package can be found here:

I tried out the package on the baseball data set used in the original paper by Reshef et al. (2011), where a suite of variables are correlated against a baseball player's salary. The author's state in their paper:
"In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged. On the basis of p, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. By contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27, 34) (fig. S12 and table S12). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary."
Here is a summary from the results computed with the function mine() of the minerva package (top 10 ranking MIC coefficients), which reproduces the same results as are shown in the Supplementary table S12 of the original paper:

For a visual representation of these results, the top figure plots MIC vs. Pearson and MIC Rank vs. Pearson Rank. Thanks to minerva author and maintainer M. Filosi for helping in reproducing the example.


Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., & Furlanello, C. (2013). minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3), 407-408. [link]

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518-1524. [link]

Code to reproduce the example: