I decided to have a go at extracting some stats for R-bloggers postings (e.g. dates, authors). Aggregating by date results in the plot above. You may notice a prominent outlier on 2011-05-26, which was due to repeat postings of the same entry. I also checked to see if there is a trend according to the number of postings by day of the week:

As one might expect, there is a lull in postings on the weekend. So, if you want longer exposure on the front page of R-bloggers, then you might want to post later in the week to lessen the competition!

I fit a GAMM (family="poisson") to the data with a cyclical spline term for day of the week ("wday") and another spline term for time (i.e. numeric conversion of date) to further explore the trends (with help from a great post on using GAMs with temporal data):

The day of the week oscillations are strong enough to result in what looks like a band of predicted values - I added a line for a model that only contains the long-term trend (in red) for easier viewing. A zoom in on a smaller subset of the data reveals the weekly oscillations:

The long-term trend indicates that a decline in the number of posts started in 2013, and continues to to the present. I binned the data by year quarter and found the same:

This decline seems to be related to the number of authors that are regularly posting, which is also declining at a similar rate. The correlation is strong enough that we can estimate an average of 3.7 postings / author / quarter (or, ~1 posting / author / month). So, what's going on? My hunch is that, while the use of R (and probably the number of blogs) continues to grow (link1, link2), R-bloggers doesn't seem to be accepting all blogs that apply to have their feeds aggregated (and may even be actively removing some; e.g. this blog has unfortunately been dropped without explanation). Maybe the time invested is maxed out at ~ 7 posts / day?

All in all, I was pleasantly surprised how easy web scraping can be. The code to reproduce this exercise can be found below (be warned that the web scraping loop took about 30 min to gather data from 620 webpages).

**To reproduce the example (**required packages: rvest (CRAN), sinkr (GitHub)):

#library(devtools) #install_github("marchtaylor/sinkr") library(rvest) library(sinkr) page1_path <- "http://www.r-bloggers.com/page/1/" tmp <- html(page1_path) tmp <- html_node(tmp, ".last") npages <- as.numeric(html_text(tmp)) npages #npages=10 dates <- vector(mode="list", npages) authors <- vector(mode="list", npages) for(i in 1:npages){ pagei_path <- gsub("page/1/", paste0("page/",i, "/"), page1_path) HTML <- html(pagei_path) # dates tmp <- html_nodes(HTML, ".date") tmp <- html_text(tmp) tmp <- as.POSIXct(strptime(tmp, tz="GMT", format="%B %d, %Y")) dates[[i]] <- tmp # authors tmp <- html_nodes(HTML, ".meta") tmp <- html_text(tmp) tmp <- substring(tmp, regexpr("By ", tmp)+3) authors[[i]] <- tmp print(paste(i, "of", npages, "finished")) } #save(dates, file="dates.Rdata") #save(authors, file="authors.Rdata") #load("dates.Rdata"); load("authors.Rdata") ### By day df <- data.frame( date=as.Date(as.POSIXct(unlist(dates), origin="1970-01-01", tz="GMT")), author=unlist(authors), npost=1 ) head(df) daydf <- aggregate(npost ~ date, df, sum) daydf$time <- as.numeric(daydf$date) daydf$wday <- as.POSIXlt(daydf$date)$wday + 1 head(daydf) png("rbloggers_raw_day.png", width=5, height=4, units="in", res=400, type="cairo") op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) plot(npost ~ date, daydf, t="p", main="R-bloggers posts per day", ylab="Number of posts", pch=1, cex=1, col=rgb(0,0,0,0.3)) par(op) dev.off() png("rbloggers_boxplot_wday.png", width=6, height=4, units="in", res=400, type="cairo") op <- par(mar=c(6,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) boxplot(npost ~ wday, daydf, main="R-bloggers posts per weekday", ylab="Number of posts", col=8, outline=FALSE, xaxt="n" ) axis(1, at=1:7, labels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"), las=2) par(op) dev.off() ### Model library(mgcv) # wday + time fit0 <- gamm(npost ~ s(wday, bs = "cc", k = 7) + s(time), data=daydf, family=poisson, knots=list(wday=c(0.5,7.5)) ) summary(fit0$gam) png("rbloggers_gam_terms.png", width=7, height=4, units="in", res=400, type="cairo") op <- par(mar=c(4,4,0.5,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) plot(fit0$gam, pages=1) par(op) dev.off() plot(fit0$gam, select=1) plot(fit0$gam, select=2) daydf$pred0 <- predict(fit0$gam, type="response") # time only fit1 <- gamm(npost ~ s(time), data=daydf, family=poisson) summary(fit1$gam) plot(fit1$gam, pages=1) daydf$pred1 <- predict(fit1$gam, type="response") AIC(fit0$lme, fit1$lme) # plot png("rbloggers_posts_per_day.png", width=5, height=4, units="in", res=400, type="cairo") op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) plot(npost ~ date, daydf[,], t="p", ylab="Number of posts", main="R-bloggers posts per day", pch=1, cex=1, col=rgb(0,0,0,0.3), ylim=c(0,30)) lines(pred0 ~ date, daydf, col=rgb(0,0,0,0.5)) lines(pred1 ~ date, daydf, col=2, lwd=3, lty=1) legend("topleft", legend=c("data","s(wday)+s(time)","s(time)"), pch=c(1,NA,NA), lty=c(NA,1,1), lwd=c(1,1,3), col=c(rgb(0,0,0,0.3), rgb(0,0,0,0.5), 2) ) par(op) dev.off() png("rbloggers_posts_per_day_short.png", width=5, height=4, units="in", res=400, type="cairo") op <- par(mar=c(4,4,2,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) plot(npost ~ date, daydf[300:600,], t="p", ylab="Number of posts", main="R-bloggers posts per day", pch=1, cex=1, col=rgb(0,0,0,0.5)) lines(pred0 ~ date, daydf, col=rgb(0,0,0,1)) lines(pred1 ~ date, daydf, col=2, lwd=3, lty=1) legend("topleft", legend=c("data","s(wday)+s(time)","s(time)"), pch=c(1,NA,NA), lty=c(NA,1,1), lwd=c(1,1,3), col=c(rgb(0,0,0,0.5), rgb(0,0,0,1), 2) ) par(op) dev.off() # residuals plot(daydf$time, resid(fit0$gam), col=rgb(0,0,0,0.5)) plot(daydf$wday, resid(fit0$gam), col=rgb(0,0,0,0.5)) ### Quarters xint <- seq(as.Date("2005-01-01"), as.Date(Sys.time()), by="3 months") xmid <- xint[-length(xint)]+(diff(xint)/2) quartdf <- data.frame(date=xmid) quartdf$nauthor <- NaN quartdf$npost <- NaN xbin <- cut(df$date, xint) rownum <- match(xbin, levels(xbin)) for(i in seq(length(quartdf$date))){ hits <- which(rownum == i) quartdf$nauthor[i] <- length(unique(df$author[hits])) quartdf$npost[i] <- sum(df$npost[hits]) } png("rbloggers_quarter_stats.png", width=8, height=3, units="in", res=400, type="cairo", family="Arial") op <- par(mar=c(4,4,3,0.5), mgp=c(2.5,0.5,0), tcl=-0.25) par(mfcol=c(1,3)) plot(nauthor ~ date, quartdf, xlab="", ylab="Unique authors", t="o", pch=1, cex=1, col=1) plot(npost ~ date, quartdf, xlab="", ylab="Number of posts", t="o", pch=1, cex=1, col=1) plot(npost ~ nauthor, quartdf, t="o", xlab="Unique authors", ylab="Number of posts") fit <- lm(npost ~ nauthor -1, quartdf) abline(fit, lty=2, col=4) reltext(0, 0.95, pos=4, col=4, labels=bquote( italic(npost) == .(sprintf("%0.2f", (coef(fit)[1])))~"*"~italic(nauthor) ) ) reltext(0, 0.85, pos=4, col=4, labels=bquote( .("adj.")~bolditalic(R)^2 == .(round(summary(fit)$adj.r.squared,2)) ) ) mtext("R-bloggers posts by quarter", font=2, cex=1, side=3, line=-1.5, outer=TRUE) par(op) dev.off()

Great article and concept that demonstrate good and effective use of web scraping.When websites do not provide any api or feeds then to get the data from that website we must use web scraping by respecting robots.txt. I am also web scraper and doing it for 6 years.Here is my website to look at : http://prowebscraping.com

ReplyDeleteReally enjoyed reading that, you made some really good points. I'm also dealing in web crawling services. I'm very glad to read your post.

ReplyDelete