Materials: If you have not already done so, please download the lesson materials for this bootcamp, unzip, then go to the folder repeating
, and open (double click) on the file repeating.Rproj
to open Rstudio.
Previously we looked at how you can use functions to simplify your code. Ideally you have a function that performs a single operation, and now you want to use it many times to do the same operation on lots of different data. The naive way to do that would be something like this:
res1 <- f(input1)
res2 <- f(input2)
...
res10 <- f(input10)
But this isn't very nice. Yes, by using a function, you have reduced a substantial amount of repetition. That is nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs. When it comes to repetition in code: Don't.
The more code I read, the more I realize the need to keep repeating: Keep it DRY! Don't. Repeat. Yourself. http://t.co/8lDtck2jZO
— Ben Simo (@QualityFrog) August 3, 2013
The nice way of repeating elements of code is to use a loop of some sort. A loop is a coding structure that reruns the same bit of code over and over, but with only small fragments differing between runs. In R there is a whole family of looping functions, each with their own strengths.
plyr
packageBy now you may have recognised that most operations that involve looping are instances of the split-apply-combine strategy (this term and idea comes from the prolific Hadley Wickham, who coined the term in this paper). You start with a bunch of data. Then you then Split it up into many smaller datasets, Apply a function to each piece, and finally Combine the results back together.
Some data arrives already in its pieces - e.g. output files from from a leaf scanner or temperature machine. Your job is then to analyse each bit, and put them together into a larger data set.
Sometimes the combine phase means making a new data frame, other times it might mean something more abstract, like combining a bunch of plots in a report.
Either way, the challenge for you is to identify the pieces that remain the same between different runs of your function, then structure your analysis around that.
R has a list of built-in functions for repeating things. This includes a range of functions that allow you to apply some function
to a series of objects (eg. vectors, matrices, dataframes or files). This is called the apply family, and includes: lapply
, sapply
, tapply
, aggregate
, mapply
, apply
.
Each repeats a function or operation on a series of elements, but they differ in the data types they accept and return.
While R's built in function do work, we're going to introduce you to another method for repeating things using the package plyr. plyr is an R Package for Split-Apply-Combine workflows. Its functional programming model encourages writing reusable functions which can be called across varied datasets and frees you from needing to manage for loop indices.
You can load plyr as
install.packages("plyr")
library(plyr)
plyr has functions for operating on lists
, data.frames
and arrays
. Each
function performs:
The functions are named based on which type of object they expect as input ([a]rray, [l]ist or [d]ata frame) and which type of data structure should be returned as output. Note here that plyr's use of "array" is different to R's, an array in ply can include a vector or matrix.
This gives us 9 core functions **ply. There are an additional three functions
which will only perform the split and apply steps, and not any combine step.
They're named by their input data type and represent null output by a _
(see
table)
Each of the xxply functions (daply
, ddply
, llply
, laply
,…) has the same structure and has 4 key features and structure:
xxply(data, variables, fun)
data
- gives the data object to be processedvariables
- identifies the splitting variablesfun
- gives the function to be called on each pieceFor an example, let's pull up gapminder dataset as before
data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE)
And while we're at it, we'll also load into memory the functions we wrote previously
and define the variable col.table
:
source("R/functions.R")
col.table <- c(Asia="tomato",
Europe="chocolate4",
Africa="dodgerblue2",
Americas="darkgoldenrod1",
Oceania="green4")
Now, what is we want to know is the number of countries by continent. So let's make a function that takes a variable as input (e.g. list of countries) and returns the number of unique values.
Why don't you try - hint, function unique
n.unique <- function(x) length(unique(x))
n.unique(data$country)
So first do it hard way:
data.new <- data[data$continent == "Asia",]
Asia.n <- n.unique(data.new$country)
data.new <- data[data$continent == "Africa",]
Africa.n <- n.unique(data.new$country)
data.new <- data[data$continent == "Europe",]
Europe.n <- n.unique(data.new$country)
data.new <- data[data$continent == "Oceania",]
Oceania.n <- n.unique(data.new$country)
data.new <- data[data$continent == "Americas",]
Americas.n <- n.unique(data.new$country)
n.countries <- c(Africa.n, Asia.n, Americas.n, Europe.n, Oceania.n)
Now here's the equivalent in plyr:
ddply(data, "continent", summarise, n=n.unique(country))
Isn't that nice? A single line of code, easy to read.
Let's look at what happened here
ddply
function feeds in a data.frame
(function starts with d) and returns an dataframe
(2nd letter is an d)data
continent
summarise
, which tells plyr to create a new data frame.n.unique
, with argument country
being one of the variables in the data.frame
Instead of ddply
we could also use daply
of dlply
. Which to use? You need to decide which type of output is most useful to you, i.e. a list
, array
or data.frame
Finally, there's several ways we can represent the split argument:
ddply(data, "continent", summarise, n=n.unique(country))
ddply(data, .(continent), summarise, n=n.unique(country))
ddply(data, ~continent, summarise, n=n.unique(country))
.Now let's try another example.
We want to sum total population in a dataframe. Try writing this using the builtin sum
function:
ddply(data, "continent", summarise, Tot=sum(pop))
Anyone notice a problem here? Yes, the total population of the world is about 10 times to big because it's repeated every 5 years. So we need to add year
to our list of splitting criteria
ddply(data, c("continent", "year"), summarise, Tot=sum(pop))
You try
Next we want the maximum gdpPercap
on each continent.
ddply(data, c("continent", "year"), summarise, gdpPercap = max(gdpPercap))
The above example uses plyrs inbuilt summarise
function. One of the nice things about summarise is that you can apply a whole bunch of functions simultaneously. For example, what if you want to find the min
, max
, mean
, and var
of each group and the entire dataset.
For summaries of the whole dataset you can call summarise directly:
summarise(data, pop.mean=sum(pop), pop.var=var(pop), pop.max=max(pop))
But if you want to split by groups, simply combine with ddply
. All the functions you want to call are simply listed at the end as extra arguments:
ddply(data, c("continent", "year"), summarise, pop.mean=sum(pop), pop.var=var(pop), pop.max=max(pop))
Instead of summarise, we could instead use the mutate
function, to modify the existing dataframe.
tmp <- ddply(data, "continent", mutate, n.c=n.unique(country))
head(tmp)
(The output is big so doesn't print nicely. So to get something better nicer, I save the output as tmp, then used the function head
to checkout the first few rows.)
summarise
and mutate
The strategy above using either summarise
and mutate
is useful is you want to create a new or modify a dataframe. But there are other options you might like to repeat.
We can also write functions returning our own dataframes, arrays or lists and feed these into plyr. For example, the following function returns the number of unique countries
n.unique.countries <- function(x) length(unique(x$country))
Note it's like n.unique
about but now takes a dataframe and specifies the variable to use, as part of the function. Of course they should give the same result:
n.unique.countries(data) == n.unique(data$country)
daply(data, "continent", get.n.countries)
It's also possible to define the function in place as an anonymous function:
daply(data, "continent", function(x) length(unique(x$country)) )
Sometimes we want to return something that doesn't fit into a dataframe or vector; in that case you should return a list. so in this case we'll want to use dlply
because we're putting a dataframe in and getting a list out.
See if you can write a function that given a dataframe, returns a vector of countries.
get.countries <- function(x) unique(x$country)
Now let's apply it to the whole dataset
get.countries(data)
And then apply to each continent using dlpy
countries <- dlply(data, "continent", function(x) unique(x$country))
Ok, now it gets really fun.
In each year, we want to fit a model for each continent on the relationship between life expectancy and gdp per capita (as we did in functions section).
First, see if you can write a function that given a data frame x
fits a model to data
model <- function(x){
lm(lifeExp ~ log10(gdpPercap), data=x)
}
Now let's try it on a subset of data
fit <- model(data[data$year==1982 & data$continent =="Asia" ,])
Ok, so let's apply it to all continents in all years:
fitted.linear.model <- dlply(data, c("continent", "year"), model)
The output fitted.linear.model
is a list of fitted models, with same structure as fit
. We can use the coef
function to extract coefficients of a model :
coef(fitted.linear.model[[1]])
ldply(fitted.linear.model, coef)
You probably want the R2 too right?
ldply(fitted.linear.model, function(x) summary(x)$r.squared)
We could also alter our model function to return the desired output then call ddply
to get a summary for models fitted by year and continent:
model <- function(x){
fit <- lm(lifeExp ~ log10(gdpPercap), data=x)
data.frame(n=length(x$lifeExp), r2=summary(fit)$r.squared, a=coef(fit)[[1]], b=coef(fit)[[2]])
}
ddply(data, c("continent", "year"), model)
As a final extension, we could add the variables we want to fit to the function definition, so that we could fit other combinations.
model <- function(d, x, y) {
fit <- lm( d[[y]] ~ log10(d[[x]]) )
data.frame(n=length(d[[y]]), r2=summary(fit)$r.squared,a=coef(fit)[1],b=coef(fit)[2])
}
ddply(data, c("continent", "year"), model, y="lifeExp", x="gdpPercap")
ddply(data, c("continent", "year"), model, y="lifeExp", x="pop")
So there you have it - in just 6 lines we can fit about 120 linear models and return two tables summarising these models. That's why plyr rocks!
We can also exploit plyr
's ability to repeat things when generating plots. In the section on functions we used a function to help us add trend lines by continent:
add.trend.line <- function(x, y, d, ...) {
fit <- lm(d[[y]] ~ log10(d[[x]]))
abline(fit, ...)
}
But we still had to run all this code to fit lines to each continent:
data.1982 <- data[data$year == 1982,]
plot(lifeExp ~ gdpPercap, data.1982, log="x", cex=rescale(sqrt(data.1982$pop), c(0.2, 10)), col= colour.by.category(data.1982$continent, col.table), pch=21)
plot(lifeExp ~ gdpPercap, data.1982, log="x", cex=cex, col=col, pch=21)
add.trend.line("gdpPercap", "lifeExp", data.1982[data.1982$continent == "Asia",], col=col.table["Asia"])
add.trend.line("gdpPercap", "lifeExp", data.1982[data.1982$continent == "Africa",], col=col.table["Africa"])
add.trend.line("gdpPercap", "lifeExp", data.1982[data.1982$continent == "Europe",], col=col.table["Europe"])
add.trend.line("gdpPercap", "lifeExp", data.1982[data.1982$continent == "Americas",], col=col.table["Americas"])
add.trend.line("gdpPercap", "lifeExp", data.1982[data.1982$continent == "Oceania",], col=col.table["Oceania"])
That's a lot of typing that is really very similar, and the sort of thing that is (a) boring to type, (b) prone to errors, and (c) hard to change (e.g. if we wanted to run it on a different data set, or change which continents we ran it over etc).
One way to avoid repetition is to pass the add.trend.line
function into d_ply
. The underscore in d_ply
tells us that we don't want any output, we just want to run the function
plot(lifeExp ~ gdpPercap, data.1982, log="x", cex=cex, col=col, pch=21)
d_ply(data.1982, "continent", function(x) add.trend.line("gdpPercap", "lifeExp", x, col=col.table[x$continent]))
Another example Plyr really shines when there are many things to deal with at once. For example, we plotted relative population growth by country for three countries before:
pop.by.country.relative <- function(country, data, base.year=1952) {
dsub <- data[data$country == country, c("year", "pop")]
dsub$pop.rel <- dsub$pop / dsub$pop[dsub$year == base.year]
dsub
}
plot(pop.rel ~ year, pop.by.country.relative("India", dat), type="o")
lines(pop.rel ~ year, pop.by.country.relative("Australia", dat), type="o", col="green4")
lines(pop.rel ~ year, pop.by.country.relative("China", dat), type="o", col="red")
But if we wanted to do this for, say, all the countries in Asia that'd be a lot of copy and paste. With plyr this is easy. Let's make a function that plot's a growth function for all years within a dataset:
add.growth.line <- function(x, base.year,...){
lines(x$year,x$pop / x$pop[x$year == base.year], type="o",...)
}
We can then feed this into d_ply
to generate a plot for all countries
plot(NA, type="n", xlim=range(data$year), ylim=c(1, 6), xlab="Year", ylab="Relative population size")
d_ply( data[data$continent =="Asia" ,], c("country"), function(x) add.growth.line(x, 1952))
And we could use the same approach to make plots for the entire world, colouring by continent
plot(NA, type="n", xlim=range(data$year), ylim=c(1, 6), xlab="Year", ylab="Relative population size")
d_ply(data, c("country"), function(x) add.growth.line(x, 1952, col=col.table[x$continent]))
When you mention looping, many people immediately reach for for
. Perhaps
that's because they are already familiar with these other languages,
like basic, python, perl, C, C++ or matlab. While for
is definitely the most
flexible of the looping options, we suggest you avoid it wherever you can, for
the following two reasons:
The main problems with this code are that
All it's doing is making a plot! Compare that to something like this
for (i in unique(Raw$SPP))
makePlot(i, data = Raw)
That's much nicer! It's obvious what the loop does, and no new variables are
created. Of course, for the code to work, we need to define the function makePlot
which actually makes our plot, but having all that detail off in a
function has many benefits:
But as you saw above, plotting can also be handled with the d_ply
functions so we'd suggest you use that, as it enforces good habits.
In fact, we'd encourage you to avoid for
unless the order of iteration is important. A key feature of all the examples in the plyr
section is that order of iteration is not important. This is crucial. If each each iteration is independent, then you can cycle through them in whatever order you like.
One place where for
loops shine is in writing simulations; if one iteration depends on the value of a previous iteration, then a for
loop is probably the best way of repeating things.
In an (unbiased) random walk, each time step we move left or right with probability 0.5. R has lots of random number generation functions. The runif
function generates random numbers uniformly on [0,1]
so we can draw random steps directions like this:
for (i in 1:10)
print(if (runif(1) < 0.5) 1 else -1)
## [1] 1
## [1] 1
## [1] -1
## [1] -1
## [1] 1
## [1] 1
## [1] 1
## [1] -1
## [1] -1
## [1] -1
Or we could take the sign
of a standard normal:
for (i in 1:10)
print(sign(rnorm(1)))
## [1] 1
## [1] -1
## [1] 1
## [1] 1
## [1] -1
## [1] -1
## [1] -1
## [1] 1
## [1] 1
## [1] -1
The implementation does not matter.
random.step <- function() {
sign(rnorm(1))
}
We can then use this to step 20 steps:
x <- 0
for (i in 1:20)
x <- x + random.step()
x
## [1] 0
which will end up somewhere between -20 and 20, but with a mean of 0.
We want to track the entire trajectory:
nsteps <- 200
x <- numeric(nsteps + 1) # space to store things in
x[1] <- 0 # start at 0
for (i in seq_len(nsteps)) {
x[i+1] <- x[i] + random.step()
}
plot(x, type="l", xlab="Step number", ylab="Position")
We might want to run that lots of times, so put it into a function:
random.walk <- function(nsteps, x0=0) {
x <- numeric(nsteps + 1)
x[1] <- x0
for (i in seq_len(nsteps)) {
x[i+1] <- x[i] + random.step()
}
x
}
which is now much easier to use:
plot(random.walk(200), type="l", xlab="Step number", ylab="Position")
nsteps <- 200
nrep <- 40
cols <- rainbow(nrep)
plot(NA, xlim=c(1, nsteps+1), ylim=c(-30, 30), xlab="Step number", ylab="Position")
for (i in 1:nrep)
lines(random.walk(200), col=cols[i])
This material was developed by Daniel Falster and Rich FitzJohn based on material prepared by Karthik Ram and Hadley Wickam.