Tuesday, May 12, 2015

Updated getcolors function

There is an updated version of the getcolors function available in the sinkr package, called getcolors2. The previous version allowed for the visual selection of colors from a palette of 216 colors. This updated version uses a more natural color palette with a rainbow gradient on the x-axis and lightness on the y-axis. Following a series of color selections via clicks, the function returns the vector of chosen colors that can later be used in a plot:
Example (from getcolors2 help):

Wednesday, April 8, 2015

Map projection cheat sheet


Here's a cheat sheet for map projection settings for the mapproject function (mapproj package). I've repeated the same general procedure for each map: 1. create blank world map, 2. project colored polygons, 3. overlay country boundaries. Some specifications were needed for individual map projections (i.e. different orientation, lat/lon limits, additional parameters). Most projections work well as long as the limits and orientation are valid - in a few cases country boundaries ('square' and 'hex') were wrapped incorrectly. I was unable to get the two spheroid projections working due to the escape character in their names ('sp\_mercator', 'sp\_albers') (maybe someone has a suggestion here?).

In addition to the maps and mapproj packages, the sinkr package is also used for the calculation of global distance, value to color conversion, and polygon creation.


Script to reproduce (high resolution png and pdf versions):

Thursday, April 2, 2015

Add text using relative plot region coordinates



Here is a simple but useful function for adding text to a plot device using the relative x, y coordinates (i.e. between 0 and 1). I found myself programming these few lines repeatedly and decided to finally bundle it into a function, called reltext(). The function can be found below, and is also available in the sinkr package (https://github.com/marchtaylor/sinkr).

Exploring fishing harvest feedback policies in R using vector optimization





Walters and Martell (2004, Section 3.2) present an interesting example of harvest feedback policy exploration using basic spreadsheet techniques and optimization procedures (e.g. Solver in Excel, or Optimizer in Quattro Pro). For a given species, parameters describing growth, mortality, and reproduction/recruitment are used to model the population's dynamics. Given a random series of environmental effects to recruitment parameters (survival and carrying capacity) the optimization procedure can be used to determine the best series of fishing mortality (F) values given a cost function (e.g. maximize the sum of: harvest yield, log-transformed yield, discounted yield):

In this example, for a hypothetical population of tilapia, yearly recruitment survival varies by ca. +/-30%, while recruitment carrying capacity has two periods of poor recruitment (-50%, e.g. due to a regime shift) during the simulation years 20-35 and 60-75, as indicated by the shaded areas. The optimal F series are similar for maximizing yield or discounted yield (discount rate = 5%) likely due to the fact that tilapia is a fast-growing, high fecundity species and can rebound quickly following intensive fishing pressure. Maximizing log-transformed biomass is the "risk-averse" strategy, whereby peaks in harvest are less highly valued, and fluctuations in fishing pressure (and yield) are dampened:

Interesting patterns emerge, e.g. one should increase fishing pressure at the beginning of a "bad" recruitment regime (grey regions) and reduce fishing at the beginning a "good" recruitment regime. In both cases, this speeds the trajectory of the stock to it's new equilibrium and takes advantage of soon to be lost or gained harvestable biomass. In other words, fish hard before bad times so as not to "waste" a large spawning stock, and lay off of fishing before good times so as to make full use of the spawning potential. Martell and Walters (2004) state,
These anticipatory behaviors are a bizarre reversal of our usual prescriptions about how to hedge against uncertain future changes in productivity and would almost certainly never be acceptable as management recommendations in practical decision settings.
The figure at the top of the post shows the relationship between biomass and yield during the optimal scenarios.  The relationship is most strongly linear for the log(Yt) optimization, but a relationship is evident for all three scenarios. One could imagine using the coefficients of this regression as a general harvesting strategy; x-intercept would be the minimum biomass before harvesting is allowed, and the slope gives the exploitation rate (E = harvest/biomass). One would need to program this strategy to be able to make the comparison to the optimal harvested values, but even the simplified strategy of using a constant exploitation rate from the regression (e.g. E = 0.34, 0.23, 0.37) results in 87, 77, and 90% of the optimal solution for sum(Yt), sum(log(Yt)), and sum(discounted Yt), respectively. These levels of performance (and higher) are typical for most examples explored by the approach (Walters and Parma, 1996).

Finally, the authors also demonstrate the situation where all sizes are vulnerable to fishing (as opposed to a knife-edge selection of individuals above a given minimum size, as in the above example). Here the sum(Yt) optimization results in a "pulse-fishing" strategy whereby the stock is fished hard (usually E > 0.5) for 1-2 years, followed by 1-2 years of recovery where no fishing is allowed. This pattern results from "growth overfishing", whereby unselective harvesting wastes a large part of the biomass that is still growing rapidly:


The examples use the stockSim() and optim.stockSim() functions of the R package fishdynr. Instructions for direct installation from GitHub can be found here: https://github.com/marchtaylor/fishdynr. Vector optimization of the F series is computed with the optim() function (stats package).

References:
- Walters, C. J., Martell, S. J., 2004. Fisheries ecology and management. Princeton University Press.
- Walters, C., Parma, A. M., 1996. Fixed exploitation rate strategies for coping with effects of climate change. Canadian Journal of Fisheries and Aquatic Sciences, 53(1), 148-158.

Script to reproduce the example:

Sunday, February 1, 2015

R package "fishdynr"


The fishdynr package allows for the construction of some basic population dynamics models commonly used in fisheries science. Included are models of a single cohort, cohortSim, and a more complex iterative model that incorporates a stock-recruitment relationship, stockSim. The model functions require a list of parameters as the main argument, which contains information about the given population's dynamics (growth, mortality, recruitment, etc.) and fishery (e.g. selectivity function and related parameters). This allows for a great deal of flexibility in adapting the analyses to particular species or fishery by defining functions of growth, mortality, fishing selectivity, recruitment, etc., outside the analysis. The package (located on GitHub) can be easily installed via the install_github function of the devtools package:

library(devtools)
install_github("fishdynr", "marchtaylor")


The above figure shows the output of the Beverton and Holt's (1957) yield-per-recruit model (ypr function - wrapper for cohortSim), based on variable fishing mortality (F) and length at first capture (Lcap, knife-edge selection). Length at maturity is shown as a dashed white line for reference. In this example, the maximum yield (Ymax) is defined as the maximum possible yield without depletion of the spawning biomass below 50% of it's virgin, unfished biomass. This simple cohort model has many additional outputs that can be helpful in visualizing processes of growth and mortality:

I hope to continue to document different models used in fisheries science within the package. Any suggestions or comments are welcome.

References
Beverton, R. J. H.; Holt, S. J. (1957), On the Dynamics of Exploited Fish Populations, Fishery Investigations Series II Volume XIX, Ministry of Agriculture, Fisheries and Food

Script to reproduce the examples

Friday, December 5, 2014

Data point locator function


Here's a little function to select data points in an open graphical device (ptlocator()). The function does a scaling of the x and y axes in order to give them equal weighting and remove the influence of differing units or ranges. The function then calculates the Euclidean distance between the selected locations (using the locator() function) and the x, y coordinates of the plotted data points. Colored points are filled in for the data point that has the lowest distance to the clicked location, and the results give the vector positions of the closest x, y data points.

[NOTE: I just realized that the identify function is very similar in its usage]

The function:

Tuesday, September 30, 2014

Additional tips for structuring an individual-based model in R

I had a reader ask me recently to help understand how to modify the code of an individual-based model (IBM) that I posted a while back. It was my first attempt at an IBM in R, and I realized that I have made some significant changes to the way that I code such models nowadays. Most of the changes are structural, but seem to help a lot in clearly documenting the model and its underlying processes.
Basically, I follow a structure that a friend taught me (a very experienced modeller, specializing in IBMs). Granted R isn't the best language for such models but, depending on your computational needs, it can also be quite easy to implement if you already have experience in basic R programming. The idea is to code important processes of the IBM as functions, which accept an object of individuals as the main argument. The functions can be as simple or elaborate as needed, but the key is that when you finally set up your simulation, you only need to call these functions again. This results in easily legible code, where you are less likely to get lost in the model and can concentrate more on the processes to be performed during each iteration (e.g. growth, reproduction, mortality):

inds <- grow.inds(inds)
inds <- reproduce.inds(inds)
inds <- kill.inds(inds)


Since the previous post, I have gone towards using data.frame objects to store individuals and their attributes since I find it easier to apply functions for extracting summary statistics (e.g. histogram of age distribution in the final population):


The example shown here is a modification of the genetic drift example that I showed before - only this time I have included 4 color phenotypes. The model runs until either one phenotype dominates the population or a maximum number of iterations is reached. Reproduction allows for a individuals to have more than one offspring per time step, but skewed towards zero. Death is modeled by a constant instantaneous mortality rate. I have used such a setup with more complicated models of fish genetics and found the performance to be quite fast, even with population sizes of 300,000 individuals. The key is to maintain as much vectorization in the functions as possible.

To reproduce the example:

Wednesday, September 17, 2014

Maximal Information Coefficient (Part II)



A while back, I wrote a post simply announcing a recent paper that described a new statistic called the "Maximal Information Coefficient" (MIC), which is able to describe the correlation between paired variables regardless of linear or nonlinear relationship. This turned out to be quite a popular post, and included a lively discussion as to the merits of the work and difficulties in using the software provided by the authors. Regarding the latter, I also had difficulties running the software on R and thus did not include an example. Checking back on this topic, I was pleased to see that an R package had subsequently been developed: minerva: Maximal Information-Based Nonparametric Exploration R package for Variable Analysis (Albanese et al. 2013). Further documentation of the package can be found here: http://minepy.sourceforge.net/

I tried out the package on the baseball data set used in the original paper by Reshef et al. (2011), where a suite of variables are correlated against a baseball player's salary. The author's state in their paper:
"In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged. On the basis of p, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. By contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27, 34) (fig. S12 and table S12). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary."
Here is a summary from the results computed with the function mine() of the minerva package (top 10 ranking MIC coefficients), which reproduces the same results as are shown in the Supplementary table S12 of the original paper:


For a visual representation of these results, the top figure plots MIC vs. Pearson and MIC Rank vs. Pearson Rank. Thanks to minerva author and maintainer M. Filosi for helping in reproducing the example.


References:

Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., & Furlanello, C. (2013). minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3), 407-408. [link]

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518-1524. [link]


Code to reproduce the example:

Monday, September 15, 2014

PCA / EOF for data with missing values - a comparison of accuracy

Not all Principal Component Analysis (PCA) (also called Empirical Orthogonal Function analysis, EOF) approaches are equal when it comes to dealing with a data field that contain missing values (i.e. "gappy"). The following post compares several methods by assessing the accuracy of the derived PCs to reconstruct the "true" data set, as was similarly conducted by Taylor et al. (2013).

The gappy EOF methods to be compared are:
  1. LSEOF - "Least-Squares Empirical Orthogonal Functions" - The traditional approach, which modifies the covariance matrix used for the EOF decomposition by the number of paired observations, and further scales the projected PCs by these same weightings (see Björnsson and Venegas 1997, von Storch and Zweiers 1999 for details).
  2. RSEOF - "Recursively Subtracted Empirical Orthogonal Functions" - This approach modifies the LSEOF approach by recursively solving for the leading EOF, whose reconstructed field is then subtracted from the original field. This recursive subtraction is done until a given stopping point (i.e. number of EOFs, % remaining variance, etc.) (see Taylor et al. 2013 for details)
  3. DINEOF - "Data Interpolating Empirical Orthogonal Functions" - This approach gradually solves for EOFs by means of an iterative algorothm to fit EOFS to a given number of non-missing value reference points (small percentage of observations) via RMSE minimization (see Beckers and Rixen 2003 for details).
I have introduced both the LSEOF [link] and DINEOF [link] methods in the past, but have never directly compared them for the blog. The purpose of this post is to make this comparison and to also introduce a more general EOF function that is capable of conducting RSEOF. All analyses can be reproduced following installation of the "sinkr" package: https://github.com/marchtaylor/sinkr

The basic problem comes down to the difficulties of decomposing a matrix that is not "positive-definite", i.e. the estimated covariance matrix from a gappy data set. DINEOF entirely avoids this issue by first interpolating the values to create a full data field, while LSEOF and RSEOF rely on decomposing this estimation. A known problem is that the trailing EOFs derived from such a matrix are amplified in their singular values, which can consequently amplify errors in field reconstructions when included. The RSEOF approach thus attempts to remedy these issues by recursively solving for only leading EOFs. In the following examples, I show the performance of the three approaches in terms of reconstructing the data field (including the "true" values).

Example 1 - Synthetic data set:
The first example uses a synthetic data set used by Beckers and Rixen (2003) in their introduction of the DINEOF approach. The accuracy of the reconstruction is dependent on the number of  EOFs used. In a non-gappy example, a perfect reconstruction should be possible using this full set of EOFs - In fact it only takes 9 EOFs when using the non-noisy true field, since it is a composite of 9 signals. In the case of the noisy, gappy data sets, reconstructions with trailing EOFs may increase errors. This can be seen in the figure at the top of the post showing RMSE vs the number of EOFs used in the reconstruction.

The figure shows the DINEOF approach to be the most accurate. The LSEOF approach has a clear RSME minimum with 4 EOFs, while the RSEOF approach was largely able to remedy the amplification of error when using trailing EOFs. The problem of error amplification is even more dramatic when viewed visually, as in the following where the full set of EOFs have been used:

Tuesday, September 2, 2014

"sinkr" - a collection of functions featured on "me nugget"

The R package sinkr (version 1.0) has now been released:  https://github.com/marchtaylor/sinkr

I have finally gotten around to learning how to create an R package and decided to start by bundling functions that I have featured on the blog. Thanks to the R Studio team for making this so easy (in combination with the R packages roxygen2 and devtools). In addition to the great tips on the R Studio website,  I found the following Youtube videos helpful along the way:
Being new to the world of R packages (and due to the eclectic nature of the functions in sinkr), I'm not confident to upload this to CRAN. But, one can easily install the package using devtools and the following code (see https://github.com/hadley/devtools for OS-specific package updating tips):

library(devtools)

install_github("marchtaylor/sinkr")

For those using functions that were featured on this site in the past, the present versions may differ slightly, especially in the case of function names. For example, function names like plot.stacked were getting associated with the general plot function during the package build and thus "." characters have been removed from all function names (e.g. plotStacked).

sinkr functions include: