Monday, July 20, 2015

The rise (and fall?) of R-bloggers - A web scraping of posting trends


I recently came across a great new package for web scraping called rvest by Hadley Wickham. This nice little post got me started: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/. It was really incredibly easy to use when combined with the css selector tool "selector gadget"(http://selectorgadget.com/), which helps you select elements of a webpage to scrape.

I decided to have a go at extracting some stats for R-bloggers postings (e.g. dates, authors). Aggregating by date results in the plot above. You may notice a prominent outlier on 2011-05-26, which was due to repeat postings of the same entry. I also checked to see if there is a trend according to the number of postings by day of the week:

 
 As one might expect, there is a lull in postings on the weekend. So, if you want longer exposure on the front page of R-bloggers, then you might want to post later in the week to lessen the competition!

I fit a GAMM (family="poisson") to the data with a cyclical spline term for day of the week ("wday") and another spline term for time (i.e. numeric conversion of date) to further explore the trends (with help from a great post on using GAMs with temporal data):

Both terms are significant, and you can better visualize the long-term trend. Here is a plot of the model prediction:

The day of the week oscillations are strong enough to result in what looks like a band of predicted values - I added a line for a model that only contains the long-term trend (in red) for easier viewing. A zoom in on a smaller subset of the data reveals the weekly oscillations:

The long-term trend indicates that a decline in the number of posts started in 2013, and continues to to the present. I binned the data by year quarter and found the same:

This decline seems to be related to the number of authors that are regularly posting, which is also declining at a similar rate. The correlation is strong enough that we can estimate an average of 3.7 postings / author / quarter (or, ~1 posting / author / month). So, what's going on? My hunch is that, while the use of R (and probably the number of blogs) continues to grow (link1, link2), R-bloggers doesn't seem to be accepting all blogs that apply to have their feeds aggregated (and may even be actively removing some; e.g. this blog has unfortunately been dropped without explanation). Maybe the time invested is maxed out at ~ 7 posts / day?

All in all, I was pleasantly surprised how easy web scraping can be. The code to reproduce this exercise can be found below (be warned that the web scraping loop took about 30 min to gather data from 620 webpages).

To reproduce the example (required packages: rvest (CRAN), sinkr (GitHub)):
#library(devtools)
#install_github("marchtaylor/sinkr")
library(rvest)
library(sinkr)
 
page1_path <- "http://www.r-bloggers.com/page/1/"
tmp <- html(page1_path)
tmp <- html_node(tmp, ".last")
npages <- as.numeric(html_text(tmp))
npages
#npages=10
dates <- vector(mode="list", npages)
authors <- vector(mode="list", npages)
for(i in 1:npages){
  pagei_path <- gsub("page/1/", paste0("page/",i, "/"), page1_path)
  HTML <- html(pagei_path)
 
  # dates
  tmp <- html_nodes(HTML, ".date")
  tmp <- html_text(tmp)
  tmp <- as.POSIXct(strptime(tmp, tz="GMT", format="%B %d, %Y"))
  dates[[i]] <- tmp
 
  # authors
  tmp <- html_nodes(HTML, ".meta")
  tmp <- html_text(tmp)
  tmp <- substring(tmp, regexpr("By ", tmp)+3)
  authors[[i]] <- tmp
 
  print(paste(i, "of", npages, "finished"))
}
 
#save(dates, file="dates.Rdata")
#save(authors, file="authors.Rdata")
 
#load("dates.Rdata"); load("authors.Rdata")
 
 
### By day
df <- data.frame(
  date=as.Date(as.POSIXct(unlist(dates), origin="1970-01-01", tz="GMT")),
  author=unlist(authors),
  npost=1
)
head(df)
 
daydf <- aggregate(npost ~ date, df, sum)
daydf$time <- as.numeric(daydf$date)
daydf$wday <- as.POSIXlt(daydf$date)$wday + 1
head(daydf)
 
png("rbloggers_raw_day.png", width=5, height=4, units="in", res=400, type="cairo")
op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
plot(npost ~ date, daydf, t="p", main="R-bloggers posts per day", ylab="Number of posts", pch=1, cex=1, col=rgb(0,0,0,0.3))
par(op)
dev.off()
 
png("rbloggers_boxplot_wday.png", width=6, height=4, units="in", res=400, type="cairo")
op <- par(mar=c(6,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
boxplot(npost ~ wday, daydf, main="R-bloggers posts per weekday", ylab="Number of posts", col=8, outline=FALSE, xaxt="n" )
axis(1, at=1:7, labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), las=2)
par(op)
dev.off()
 
 
 
### Model
library(mgcv)
 
# wday + time
fit0 <- gamm(npost ~ s(wday, bs = "cc", k = 7) + s(time), data=daydf, family=poisson,
  knots=list(wday=c(0.5,7.5))
)
summary(fit0$gam)
 
png("rbloggers_gam_terms.png", width=7, height=4, units="in", res=400, type="cairo")
op <- par(mar=c(4,4,0.5,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
plot(fit0$gam, pages=1)
par(op)
dev.off()
 
plot(fit0$gam, select=1)
plot(fit0$gam, select=2)
daydf$pred0 <- predict(fit0$gam, type="response")
 
# time only
fit1 <- gamm(npost ~ s(time), data=daydf, family=poisson)
summary(fit1$gam)
plot(fit1$gam, pages=1)
daydf$pred1 <- predict(fit1$gam, type="response")
AIC(fit0$lme, fit1$lme)
 
 
# plot
png("rbloggers_posts_per_day.png", width=5, height=4, units="in", res=400, type="cairo")
op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
plot(npost ~ date, daydf[,], t="p", ylab="Number of posts", main="R-bloggers posts per day", pch=1, cex=1, col=rgb(0,0,0,0.3), ylim=c(0,30))
lines(pred0 ~ date, daydf, col=rgb(0,0,0,0.5))
lines(pred1 ~ date, daydf, col=2, lwd=3, lty=1)
legend("topleft", legend=c("data","s(wday)+s(time)","s(time)"), 
  pch=c(1,NA,NA), lty=c(NA,1,1), lwd=c(1,1,3),
  col=c(rgb(0,0,0,0.3), rgb(0,0,0,0.5), 2)
)
par(op)
dev.off()
 
png("rbloggers_posts_per_day_short.png", width=5, height=4, units="in", res=400, type="cairo")
op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
plot(npost ~ date, daydf[300:600,], t="p", ylab="Number of posts", main="R-bloggers posts per day", pch=1, cex=1, col=rgb(0,0,0,0.5))
lines(pred0 ~ date, daydf, col=rgb(0,0,0,1))
lines(pred1 ~ date, daydf, col=2, lwd=3, lty=1)
legend("topleft", legend=c("data","s(wday)+s(time)","s(time)"), 
  pch=c(1,NA,NA), lty=c(NA,1,1), lwd=c(1,1,3),
  col=c(rgb(0,0,0,0.5), rgb(0,0,0,1), 2)
)
par(op)
dev.off()
 
# residuals
plot(daydf$time, resid(fit0$gam), col=rgb(0,0,0,0.5))
plot(daydf$wday, resid(fit0$gam), col=rgb(0,0,0,0.5))
 
 
### Quarters
xint <- seq(as.Date("2005-01-01"), as.Date(Sys.time()), by="3 months")
xmid <- xint[-length(xint)]+(diff(xint)/2)
quartdf <- data.frame(date=xmid)
quartdf$nauthor <- NaN
quartdf$npost <- NaN
xbin <- cut(df$date, xint)
rownum <- match(xbin, levels(xbin))
for(i in seq(length(quartdf$date))){
  hits <- which(rownum == i)
  quartdf$nauthor[i] <- length(unique(df$author[hits]))
  quartdf$npost[i] <- sum(df$npost[hits])
}
 
png("rbloggers_quarter_stats.png", width=8, height=3, units="in", res=400, type="cairo", family="Arial")
op <- par(mar=c(4,4,3,0.5), mgp=c(2.5,0.5,0), tcl=-0.25)
par(mfcol=c(1,3))
plot(nauthor ~ date, quartdf, xlab="", ylab="Unique authors", t="o", pch=1, cex=1, col=1)
plot(npost ~ date, quartdf, xlab="", ylab="Number of posts", t="o", pch=1, cex=1, col=1)
plot(npost ~ nauthor, quartdf, t="o", xlab="Unique authors", ylab="Number of posts")
fit <- lm(npost ~ nauthor -1, quartdf)
abline(fit, lty=2, col=4)
reltext(0, 0.95, pos=4, col=4,
   labels=bquote(
    italic(npost) == .(sprintf("%0.2f", (coef(fit)[1])))~"*"~italic(nauthor)
   )
)
reltext(0, 0.85, pos=4, col=4,
        labels=bquote( .("adj.")~bolditalic(R)^2 == .(round(summary(fit)$adj.r.squared,2)) )
)
mtext("R-bloggers posts by quarter", font=2, cex=1, side=3, line=-1.5, outer=TRUE)
par(op)
dev.off()
Created by Pretty R at inside-R.org


2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Really enjoyed reading that, you made some really good points. I'm also dealing in web crawling services. I'm very glad to read your post.

    ReplyDelete