me nugget

Thursday, April 25, 2013

A plea for less word clouds

Word cloud of DOMA hearing transcripts

I must admit, there is something appealing about the word cloud - that is, until you try to understand what it actually means...

Word clouds are pervasive - even in the science world. I was somewhat spurred to write this given the incredibly wasteful summaries of EGU General Assembly survey results that include several useless word clouds (link to document). Capitalization of words isn't even considered; e.g. "Nice" vs."nice". I have been hesitant to equate word clouds to the hilariously labeled "mullets of the internet" but, on second thought, it is entirely appropriate. They were once fad, but seem reluctant to die...

Oh, and yes, a "tag cloud" is a type of word cloud - I have fallen into the trap myself by including such a thing on this blog! I honestly didn't make the connection at first, because, at least, it had the function of showing the relative importance of terms that I personally defined as topics - not an arbitrary puking up of all the words that I have ever written here. Nevertheless, I think it must be removed now - I can't tell you how many times that I have wanted to go to a specific blog post by clicking on a tag, only to be forced to search into the nether regions of (extremely) small font size. Simple alphabetical arrangement probably makes more sense.

There are some attempts at making word clouds with R (most notable the "wordcloud" package), but they don't seem to be as visually appealing as those easily produced by sites such as Wordle. Nevertheless, you continue to see such things produced - just do a search for "word cloud" on R-bloggers for many examples.

I decided to give Wordle a try, and chose the Defence of Marriage Act (DOMA) hearing transcripts as a source for text. The above word cloud shows the results (with some beautiful patriotic colonial-looking font to boot!). It doesn't reveal much to me. An initial attempt caught me off-guard in that the dominant word was "justice" (below), which would have possibly been insightful if it hadn't been a construct of the prevalence of the speakers titles (i.e. "Justice Kagan"):

An even more worthless word cloud of DOMA hearing transcripts

Anyway, I'm glad I'm not alone in this thinking - I have come across many discussions along the same lines; in particular, the nice article Jacob Harris. Unfortunately, it seems they are here to stay, and I will just have to learn to better avert my eyes from their alluring power in the future...

Monday, January 28, 2013

My template for controlling publication quality figures

The following is a template that I usually start with when producing figures for publication. It allows me to control:

The overall size of the figure (in inches) (WIDTH, HEIGHT)
The layout of figure subplots (using the layout() function) (LO)
The resolution of the figure (for a .png file) (RESO)

I define the overall dimensions of the figure in units of measurement (e.g. inches or centimeters) in order to control how the figure will look on the printed page. For example, a typical journal page might have ~8 inches of space for a 2 column figure and ~4 inches for a 1 column figure.

I define margins (mar, oma) in terms of point size (ps), since this relates to the height of text, which allows of control of axis labeling. By defining the outer margins (OMA) and point size (PS) before calling layout, you will have these margins incorporated. Then, by running the x11() device (after the #), you can check your figure layout with layout.show(n):

I learned recently that the layout() function will adjust the character expansion size (par()$cex) depending on how your device is split up. For that reason, I usually include another line of code resetting par(cex=1) before proceeding with individual plots.

Finally, the three different device types included in the template are:

x11(), for initial tweaking of the layout and general functionality of the plotting code
png(), for producing a compact figure useful in pasting into Word documents, and for cases where the figure contains a lot of information and would be slow to loading as a .pdf
pdf(), for a vector-based figure that is fully scalable / zoomable. When not too big, these figures look the best, and can also be embedded in LaTeX documents

I have been able to use this template to successfully control my figures to the formatting requirements of specific journals or other publications (e.g. overall size, point size, resolution, etc.).

Figure template:

Choosing colors visually with 'getcolors'

When plotting, I am constantly defaulting to the "main" colors in R - In other words, the colors that one can quickly call by number (1="black", 2="red", 3="green", 4="blue", ... etc.) . In my opinion, these colors do not lend themselves well to compelling graphics. I imagine this is the reason for the inclusion of the much more pleasing color palettes used by default in the popular graphical package ggplot2. I try and choose better colors for final figure versions for publishing, but it is usually a tedious process of trial and error with functions like rgb(). There are some nice alternate color palettes out there probably more in line with color theory, and one has a lot of flexibility with functions like colorRampPalette(), but I wanted to have a function where I can choose colors visually in order to speed up the process. Below is the function getcolors(), which allows for this selection by using a simplified color swatch to allow selection with a mouse using the locator() function (above, top plot). Following selection, a second plot opens showing how these colors look next to each other and on a background gradient of black to white. The function uses an RGB color model: Red increases on the y-axis, Green increases on the x-axis, and Blue is a repeated sequence of levels across the x-axis).

For the example, I chose 4 colors, which are saved in a vector. These colors were subsequently used to make the following line plot:

the getcolors function:

Lomb-Scargle periodogram for unevenly sampled time series

In the natural sciences, it is common to have incomplete or unevenly sampled time series for a given variable. Determining cycles in such series is not directly possible with methods such as Fast Fourier Transform (FFT) and may require some degree of interpolation to fill in gaps. An alternative is the Lomb-Scargle method (or least-squares spectral analysis, LSSA), which estimates a frequency spectrum based on a least squares fit of sinusoid.

The above figure shows a Lomb-Scargle periodogram of a time series of sunspot activity (1749-1997) with 50% of monthly values missing. As expected (link1, link2), the periodogram displays a a highly significant maximum peak at a frequency of ~11 years.

The function comes from a nice set of functions that I found here: http://research.stowers-institute.org/efg/2005/LombScargle/R/index.htm. An accompanying paper focusing on its application to time series of gene expression can be found here.

Below is a comparison to an FFT of the full time series. For another great resource on spectral analysis, and time series-related R methods in general, see the following website: http://zoonek2.free.fr/UNIX/48_R/15.html.

To reproduce the example:

Producing animated GIFs and Videos

It took me a while to figure out how to use the animation package on my Windows OS. In making an animated GIF, the problem seems to have been quite simple in the end (and I should have been more patient in reading the instructions!) - Following installation of the program ImageMagick, one has to define where the program convert.exe is located using 'ani.options()'.

One is also able to make great videos:

To reproduce the example (requires 'spirographR' function):

Spirograph with R

Just had to figure out how to replicate this old toy of mine with R! I had no idea how long it's been around:

Finding a pin in a haystack - PCA image filtering

I found the following post regarding the anomalous metal object observed in a Curiosity Rover photo to be fascinating - specifically, the clever ways that some programmers used for filtering the image for the object. The following answer on mathematica.stackexchange.com was especially illuminating for its use of a multivariate distribution to describe the color channels for a test region of "sand". This distribution was subsequently used to assess if the rest of the image colors belonged to the same distribution.

DINEOF (Data Interpolating Empirical Orthogonal Functions)

I finally got around to reproducing the DINEOF method (Beckers and Rixon, 2003) for optimizing EOF analysis on gappy data fields - it is especially useful for remote sensing data where cloud cover can result in large gaps in data. Their paper gives a nice overview of some of the various methods that have been used for such data sets. One of these approaches, which I have written about before, involves deriving EOFs from a covariance matrix as calculated from available data. Unfortunately, as the author's point out, such covariance matrices are no longer positive-definite, which can lead to several problems. The DINEOF method seems to overcome several of these issues.

Create polygons from a matrix

The following function matrix.poly allows for the addition of polygons to a plot based on a matrix and defined matrix positions. I have used this function on occasion to highlight specific matrix locations (e.g. in the above figure). You can do the same by overlaying another image (left in above plot) but with this function you will have all other polygon plotting possibilities (e.g. borders etc.).

Adding a transparent image layer to a plot

The following example shows how to add a transparent image-type layer to a plot. The add.alpha function (below) simply adds transparency to a vector of colors which is then introduced in the "col" argument of an image plot.

me nugget

Thursday, April 25, 2013

A plea for less word clouds

Monday, January 28, 2013

My template for controlling publication quality figures

Friday, January 18, 2013

Choosing colors visually with 'getcolors'

Thursday, January 10, 2013

Lomb-Scargle periodogram for unevenly sampled time series

Wednesday, January 2, 2013

Producing animated GIFs and Videos

Sunday, December 30, 2012

Spirograph with R

Tuesday, December 4, 2012

Finding a pin in a haystack - PCA image filtering

Tuesday, October 30, 2012

DINEOF (Data Interpolating Empirical Orthogonal Functions)

Friday, April 27, 2012

Create polygons from a matrix

Thursday, April 19, 2012

Adding a transparent image layer to a plot

NPR Topics: Science

WWF - Environmental News