Materials: If you have not already done so, please download the lesson materials for this bootcamp, unzip, then go to the directory reproducible
, and open (double click) on the file reproducible.Rproj
to open Rstudio.
We're now in the home stretch of the workshop - congratulations! Up to this point, we've talked about how to make your code efficient (good programming practice), accurate (testing), and maintainable (modularization + version control). Now we're going to talk about a final and very important concept known as reproducibility.
In case you hadn't already noticed, science has a growing credibility problem, mostly due to inability of researchers to reproduce published results.
For our purposes, we can summarize the goal of reproducibility in two related ways, one technical and one colloquial.
In a technical sense, your goal is to have a complete chain of custody (ie, provenance) from your raw data to your finished results and figures. That is, you should always be able to figure out precisely what data and what code were used to generate what result - there should be no "missing links". If you have ever had the experience of coming across a great figure that you made months ago and having no idea how in the world you made it, then you understand why provenance is important. Or, worse, if you've ever been unable to recreate the results that you once showed on a poster or (gasp) published in a paper…
In a colloquial sense, I should be able to sneak into your lab late at night, delete everything except for your raw data and your code, and you should be able to run a single command to regenerate EVERYTHING, including all of your results, tables, and figures in their final, polished form. Think of this as the "push button" workflow. This is your ultimate organizational goal as a computational scientist. Importantly, note that this rules out the idea of manual intervention at any step along the way - no tinkering with figure axes in a pop-up window, no deleting columns from tables, no copying data from one folder to another, etc. All of that needs to be fully automated.
As an added bonus, if you couple this with a version control system that tracks changes over time to your raw data and your code, you will be able to instantly recreate your results from any stage in your research (the lab presentation version, the dissertation version, the manuscript version, the Nobel Prize committee version, etc.). Wouldn't that be nice?
So let's look at how the things we've learnt help reproducibility. Throughout the bootcamp we've been giving you all the elements of a reproducible work flow.
Qu: What are these elements
The only one your missing now is the last one. We're going to show you two ways to bind all your work together.
The first involves simply printing output to file. This works well for final versions of papers. You want a nice short readable analysis script and some functions that produce figures. By default R produces plots and tables in the output window, so our goal is to save these to file.
Saving tables is relatively straight forward using write.csv
.
Let's say we have a table of values from
library(plyr)
source("R/functions.R")
data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE)
# For each year, fit linear model to life expectancy vs gdp by continent
model.data <- ddply(data, .(continent,year), fit.model, x="lifeExp", y="gdpPercap")
Now we just want to write this table to file:
dir.create("output")
write.csv(model.data, file="output/table1.csv")
This is good, but there's a couple of problems:
So here's a better version:
write.csv(format(model.data, digits=2, trim=TRUE), file="output/table1.csv", row.names=FALSE, quote=FALSE)
One way of plotting to a file is to open a plotting device (such
as pdf
or png
) run a series of commands that generate plotting
output, and then close the device with dev.off()
, e.g.
pdf("my-plot.pdf", width=6, height=4)
# ...my figure here
dev.off()
Hopefully your figure code is stored as a function, so that you can type something like
library(plyr)
source("R/functions.R")
data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE)
data.1982 <- data[data$year == 1982,]
myplot(data.1982,"gdpPercap","lifeExp", main =1982)
to make your figure. Now you can type
pdf("output/my-plot.pdf", width=6, height=4)
myplot(data.1982,"gdpPercap","lifeExp", main =1982)
dev.off()
However, this still gets a bit unweildly when you have a large number of figures to make (especially for talks where you might make 20 or 30 figures).
A better approach proposed by Rich is to use a little function called to.pdf
:
to.pdf <- function(expr, filename, ..., verbose=TRUE) {
if ( verbose )
cat(sprintf("Creating %s\n", filename))
pdf(filename, ...)
on.exit(dev.off())
eval.parent(substitute(expr))
}
Which can be used like so:
to.pdf(myplot(data.1982,"gdpPercap","lifeExp", main=1982), "output/1982.pdf", width=6, height=4)
A couple of nice things about this approach:
pdf
via ...
, so we don't need to
duplicate pdf
's argument list in our function.on.exit
call ensures that the device is always closed, even if
the figure function fails.For talks, I often build up figures piece-by-piece. This can be done by adding an option to your function (for a two-part figure)
fig.progressive <- function(with.trend=FALSE) {
set.seed(10)
x <- runif(100)
y <- rnorm(100, x)
par(mar=c(4.1, 4.1, .5, .5))
plot(y ~ x, las=1)
if ( with.trend ) {
fit <- lm(y ~ x)
abline(fit, col="red")
legend("topleft", c("Data", "Trend"),
pch=c(1, NA), lty=c(NA, 1), col=c("black", "red"), bty="n")
}
}
Now – if run with as
fig.progressive(FALSE)
just the data are plotted, and if run as
fig.progressive(TRUE)
the trend line and legend are included. Then with the to.pdf
function, we can do:
to.pdf(fig.progressive(TRUE), "output/progressive-1.pdf", width=6, height=4)
to.pdf(fig.progressive(FALSE), "output/progressive-2.pdf", width=6, height=4)
which will generate the two figures. The general idea can be expanded to more devices, such as png (see this blog post for details).
We can use a similar approach to export figures in png format
to.dev <- function(expr, dev, filename, ..., verbose=TRUE) {
if ( verbose )
cat(sprintf("Creating %s\n", filename))
dev(filename, ...)
on.exit(dev.off())
eval.parent(substitute(expr))
}
to.dev(myplot(data.1982, "gdpPercap","lifeExp", main=1982), png, "output/1982.png", width=600, height=400)
Another approach for making reproducible outputs is to combine plots, tables, code and a description of content in a single document.
The classical (sweave) has been progressively replaced by an R package called knitr
, which uses a language called Markdown. knitr
can be used for many nice features, from reports to full presentations. It can be coded in a specific language called R markdown. For example, this report that we provided to you derived from this Rmd file.
You can find out more about how knitr
works (here) and (here).
Here is what my ideal paper would look like
The work flow that we're following here is just a suggestion. Organizing code and data is an art, and a room of 100 scientists will give you 101 opinions about how to do it best. Consider this material a useful place to get started, and don't hesitate to tinker and branch out as you get a better feel for this process.
Some good examples of reproducible research we know of
Qu: What are some of the additional benefits of this workflow
Qu: What are some of barriers to achieving this workflow?
This material presented here was adapted from swc by Daniel Falster, incorporating new material from Rich FitzJohn and also views presented by a wide range of people across the twittersphere and within the swc community