Tuesday, September 30, 2014

Additional tips for structuring an individual-based model in R

I had a reader ask me recently to help understand how to modify the code of an individual-based model (IBM) that I posted a while back. It was my first attempt at an IBM in R, and I realized that I have made some significant changes to the way that I code such models nowadays. Most of the changes are structural, but seem to help a lot in clearly documenting the model and its underlying processes.
Basically, I follow a structure that a friend taught me (a very experienced modeller, specializing in IBMs). Granted R isn't the best language for such models but, depending on your computational needs, it can also be quite easy to implement if you already have experience in basic R programming. The idea is to code important processes of the IBM as functions, which accept an object of individuals as the main argument. The functions can be as simple or elaborate as needed, but the key is that when you finally set up your simulation, you only need to call these functions again. This results in easily legible code, where you are less likely to get lost in the model and can concentrate more on the processes to be performed during each iteration (e.g. growth, reproduction, mortality):

inds <- grow.inds(inds)
inds <- reproduce.inds(inds)
inds <- kill.inds(inds)


Since the previous post, I have gone towards using data.frame objects to store individuals and their attributes since I find it easier to apply functions for extracting summary statistics (e.g. histogram of age distribution in the final population):


The example shown here is a modification of the genetic drift example that I showed before - only this time I have included 4 color phenotypes. The model runs until either one phenotype dominates the population or a maximum number of iterations is reached. Reproduction allows for a individuals to have more than one offspring per time step, but skewed towards zero. Death is modeled by a constant instantaneous mortality rate. I have used such a setup with more complicated models of fish genetics and found the performance to be quite fast, even with population sizes of 300,000 individuals. The key is to maintain as much vectorization in the functions as possible.

To reproduce the example:

######################################
### Population dynamics parameters ###
######################################
 
Colors <- c("blue", "green", "orange", "red") # population phenotypes
tmax <- 2000 # maximum time duration of simulation
n.initial <- 50*length(Colors) # inital population size
CC <- 500 # carrying capacity of offspring
M <- 0.5 # natural mortality
 
 
#########################
### Process functions ###
#########################
 
# Creation
make.inds <- function(id=NaN, color=NaN, age=NaN){
  inds <- data.frame(id = id, color = color, age = 0) # give then starting attributes (id, color, age)
  inds
}
 
# Growth
grow.inds <- function(inds){
  inds$age <- inds$age + 1 # advance age by one
  inds
}
 
# Reproduction (breeding, inheritance)
reproduce.inds <- function(inds){
  n.eggs <- rlnorm(nrow(inds), mean=0.01, sd=0.7) # how many offspring per adult
  n.eggs <- round(n.eggs * (1 - sum(n.eggs)/CC)) # mediate by carrying capacity
  if(sum(n.eggs) > 0){
    motherRow <- rep(seq(nrow(inds)), times=n.eggs) # mother's row
    offspring <- make.inds(id=seq(max(inds$id)+1, length.out=sum(n.eggs))) # make offspring
    offspring$color <- inds$color[motherRow] # inherit color
    inds <- rbind(inds, offspring) # add new offspring to adults
  }
  inds
}
 
# Mortality
kill.inds <- function(inds){
  pDeath <- 1 - exp(-M) # prob. of death
  kill <- which(runif(nrow(inds)) < pDeath) # kill these ones
  inds <- inds[-kill,]
  inds
}
 
 
###################
### Simulation ####
###################
 
# Initial population
inds <- make.inds(id=1:n.initial, color=as.factor(array(Colors, dim=n.initial)))
head(inds, 10)
 
# Object for storing results (population size in time)
N <- array(NaN, dim=c(tmax, length(Colors)))
colnames(N) <- Colors
N[1,] <- summary(inds$color)
 
# Run 
set.seed(1) # optional starting seed
for(t in 2:tmax){ 
  # population processes
  inds <- grow.inds(inds)
  inds <- reproduce.inds(inds)
  inds <- kill.inds(inds)
 
  # store results
  N[t,] <- summary(inds$color)
  print(paste("time =", t, "; n = ", sum(N[t,])))
 
  # break when one phenotype dominates
  if(sum(N[t,] > 0) == 1) {break}
}
N <- N[1:t,] # remove excess rows of results object
 
 
##############
#### Plots ###
##############
# Population sizes
png("pop_size.png", width=5, height=5, res=300, units="in", type="cairo")
op <- par(mar=c(5,5,1,1))
ylim <- 1.2 * range(N)
Time <- seq(nrow(N))
plot(Time, N[,1], t="n", lwd=2, col=4, ylim=ylim, ylab="Population size")
for(i in seq(Colors)){
  lines(Time, N[,i], col=Colors[i], lwd=2)
}
legend("topleft", legend=paste(Colors, "pop."), lwd=2, col=Colors, bty="n")
par(op)
dev.off()
 
# Age distribution of final population
png("age_dist.png", width=5, height=5, res=300, units="in", type="cairo")
op <- par(mar=c(5,5,2,1))
hist(inds$age, freq=FALSE, col=8, xlab="Age [years]", main="Age distribution")
par(op)
dev.off()
Created by Pretty R at inside-R.org



4 comments:

  1. Dear Marc,

    I'm currently facing a very similar situation. One approach towards storing individuals' information I've considered (but haven't formally explored yet) is to use package data.table for data.frame-like operations, but at much higher speeds.

    Have you given data.table a try?

    Nice work, btw! Cheers

    ReplyDelete
    Replies
    1. Hi Ricardo - thanks for your comment! I have indeed tried data.table instead of data.frame recently. In my case there were only modest increases in speed (if any). If, however, you were able to take advantage of data.table's indexing, then this would be a great speed-up I believe.
      Cheers

      Delete
    2. Hi, Marc. Thanks for the additional information. I'm experimenting with data.table in a similar context right now. I'm planning on applying some of the summarizing functionalities of data.frame to output intermediate simulation results while it's still running. I'll come back here to report about the performance.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete