R is capable of producing publication-quality graphics. During this session, we will develop your R skills by introducing you to the basics of graphing.
Typing plot(1,1) does a lot by default. For example, the axes are automatically set to encapsulate the data, a box is drawn around the plotting space, and some basic labels are given as well.
Similarly, typing hist(rnorm(100)) or boxplot(rnorm(100)) does a lot of the work for you.
Plotting in R is about layering data and detail onto a canvas. So, let’s start with a blank canvas:
We are simply calling a new plotting device (plot always initiates a new device), plotting the coordinates (5,5), which we can’t see because of type="n", and we’re also telling R that we don’t want any default axes or annotations (e.g., titles or axis labels). Basically, we just want a blank slate. Finally, we set the x- and y-axis ranges, so that we have some space to work.
If you look at the helpfile for axis (?axis) you’ll see that you can pretty much control any apsect of axis creation. The only argument required for an axis is the side of the plot you want it on. Here 1 and 2 correspond with x and y; 3 and 4 also correspond with x and y, but on the other side of the plot (e.g., for secondary axes).
I prefer tick labels to align horizontally where possible. Here we use the argument las=2 (“label axis style”?). A value of 2 means “always horizontal”.
When plotting with R you are adding subsequent layers to your canvas. If something needs redoing, you need to start again from stratch. Therefore, it is very important to save a graph’s provenance using a text editor. If you change something, re-run all the lines of code to generate your graph. Saving as text is also great for when you want re-use a particular graphic style that you’ve developed. Graphing is one part of the scientific process for which you have some creative freedom.
Next, add axis labels and a title:
123456
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")
mtext stands for margin text. You need an argument for the text to be printed, one for the side of the plot (as for the axis function), and one saying how far from the axis you want to label is numbers of “lines”. Generally, the tick marks are at the zeroth line (line=0) and the tick labels in the first line. The default for plot is line=3, which I generally leave as is. You don’t generally need titles for plots in manuscripts, but they can come in handy for presentations, blogs, etc.
There is also the figure box(), which I don’t tend to use, but you may want to (?):
1234567
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()
Let’s add some data. A red point at (5,5):
12345678910
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)
Now set your random number generator seed to 11, so that our plots all look the same. We’ll simulate some data. First, some normally distribution “independent” values. Then, some values that depend on these first set of values via a prescribed relationship and error distribution:
1234567891011121314
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)set.seed(11)x <- rnorm(100,5,1.5)y <- rnorm(100,2+0.66*x)points(x, y, pch=21, col="white", bg="black")
Typing ?points will give you the common options for the points function. I have used pch=21 (presumably “plotting character”?). The character (a circle) is white and the character background is black. I like this, because it helps distinguish points when they overlap.
Fit a linear model and add the line of best-fit.
12
mod <- lm(y ~ x)summary(mod)
12345678910111213141516
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.4229 -0.5412 -0.0208 0.5534 2.6547
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0414 0.3616 5.65 1.6e-07
x 0.6768 0.0723 9.37 2.9e-15
Residual standard error: 0.986 on 98 degrees of freedom
Multiple R-squared: 0.472, Adjusted R-squared: 0.467
F-statistic: 87.7 on 1 and 98 DF, p-value: 2.86e-15
Add the regression line:
123456789101112131415
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)set.seed(11)x <- rnorm(100,5,1.5)y <- rnorm(100,2+0.66*x)points(x, y, pch=21, col="white", bg="black")abline(mod)
You see that the model estimates reflect the parameters we used to generate the data (lm appears to be working). Let’s add some elements to our graph that highlight these fitted model results. Start by generating a sequence of numbers spanning the range of x:
1
x.seq <- seq(0,10,0.1)# ss <- seq(min(x), max(x), 0.1)
Using the best-fit model, we can now predict values for each of the values in the x.seq vector:
12345678910111213141516171819
y.fit <- predict(mod, list(x = x.seq))# This is essentially what `abline(mod)` has done, and so there is no point plotting these again. However, using the `interval` argument, we can extract prediction and confidence intervals:y.pred <- predict(mod, list(x = x.seq), interval ="prediction")plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)points(x, y, pch=21, col="white", bg="black")abline(mod)lines(x.seq, y.pred[,"lwr"], lty =2)lines(x.seq, y.pred[,"upr"], lty =2)
lty=2 (presumably “line type”?) makes a dashed line. What’s the difference between confidence intervals and prediction intervals? Now, let’s calculate confidence intervals and do something a little bit more exciting: a transparent, shaded confidence band:
12345678910111213141516
y.conf <- predict(mod, list(x = x.seq), interval ="confidence")plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)points(x, y, pch=21, col="white", bg="black")lines(x.seq, y.pred[,"lwr"], lty =2)lines(x.seq, y.pred[,"upr"], lty =2)polygon(c(x.seq, rev(x.seq)), c(y.conf[,"lwr"], rev(y.conf[,"upr"])), col = rgb(0,0,0,0.2), border =NA)
Text can also be added easily, using the coordinates of the current plot:
123456789101112131415161718
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)points(x, y, pch=21, col="white", bg="black")abline(mod)lines(x.seq, y.pred[,"lwr"], lty =2)lines(x.seq, y.pred[,"upr"], lty =2)polygon(c(x.seq, rev(x.seq)), c(y.conf[,"lwr"], rev(y.conf[,"upr"])), col = rgb(0,0,0,0.2), border =NA)text(1,10,"A", cex=1.5)mtext("A", side=3, adj=0, cex=1.5)text(3,8.5, expression(Y[i]== beta[0]+beta[1]*X[i1]+epsilon[i]))# see demo(plotmath) for more examples
Legends typically take a bit of trial and error, but can do most things.
12345678910111213141516171819
plot(5,5, type="n", axes=FALSE, ann=FALSE, xlim=c(0,10), ylim = c(0,10))axis(1)axis(2, las=2)mtext("x-axis", side =1, line =3)mtext("y-axis", side =2, line =3)title("My progressive plot")box()points(5,5, col="red")points(5,7, col="orange", pch=3, cex=2)points(c(0,0,1), c(2,4,6), col="green", pch=4)points(x, y, pch=21, col="white", bg="black")abline(mod)lines(x.seq, y.pred[,"lwr"], lty =2)lines(x.seq, y.pred[,"upr"], lty =2)polygon(c(x.seq, rev(x.seq)), c(y.conf[,"lwr"], rev(y.conf[,"upr"])), col = rgb(0,0,0,0.2), border =NA)text(1,10,"A", cex=1.5)mtext("A", side=3, adj=0, cex=1.5)text(3,8.5, expression(Y[i]== beta[0]+beta[1]*X[i1]+epsilon[i]))# see demo(plotmath) for more exampleslegend("bottomright", c("green data","random data","orange point"), pch=c(4,20,3), col=c("green","black","orange"), pt.cex=c(1,1,2), bty="n")
Saving your plot
You can save your plot by simply using the “Export” functionality in the RStudio GUI. In general, plots should be saved as vector graphics files (i.e., PDF), because these “scale” (i.e., don’t lose resolution as you zoom in). However, vector graphics files can get large and slow things down if they contain a lot of information, in which case you might want to save as a raster or image file (i.e., PNG). PNGs work better on webpages and in presentations, because such software is not good at dealing with vector graphics. Do not save as a JPG.
You can also save your plot from the command line. Why would this be useful?
To do so, you need to fire-up a graphics device (e.g., PDF or PNG), write the layers to the file, and then close the device off (you won’t be able to open the file if you miss this last step).
You’ll notice above that data, labels and titles are within the plot function, rather than layering up as demostrated earlier.
Exercise
Analyse and graph the relationship between height and weight of plants in the herbivore dataset. Attempt to reproduce the following graph.
123456
data <- read.csv("data/seed_root_herbivores.csv")data$lWeight <- log10(data$Weight)mod <- lm(lWeight ~ Height, data)summary(mod)
12345678910111213141516
Call:
lm(formula = lWeight ~ Height, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.765 -0.125 0.030 0.140 0.579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18527 0.06669 -2.78 0.0061
Height 0.01945 0.00116 16.84 <2e-16
Residual standard error: 0.237 on 167 degrees of freedom
Multiple R-squared: 0.629, Adjusted R-squared: 0.627
F-statistic: 284 on 1 and 167 DF, p-value: <2e-16
12345678910
ss <- seq(min(data$Height), max(data$Height),0.5)ss.conf <- predict(mod, list(Height = ss), interval ="confidence")plot(lWeight ~ Height, data, axes=FALSE, xlab=expression(paste("Height (", m^2,")")), ylab="Weight (g), log-scale")polygon(c(ss, rev(ss)), c(ss.conf[,"lwr"], rev(ss.conf[,"upr"])), col = rgb(0,0,0,0.2), border =NA)lines(ss, ss.conf[,"fit"])axis(2, at = log10(c(seq(0.3,0.9,0.1), seq(1,9,1), seq(10,60,10))), labels =FALSE, tck =-0.015)axis(2, at = log10(c(0.3,1,3,10,30)), labels = c(0.3,1,3,10,30), las =2)axis(1)
Plotting parameters
Each device has a set of graphical parameters that can be set up before you start plotting. Most of these parameters control the “look and feel” of the plotting device. Take a look at the par helpfile for the vast array of options:
1
?par
For example, you can control the portion of your canvas taken up by each of the margins around your plot using mar (i.e., “lines” in the margin). The default is:
12
par(mar=c(5,4,4,2)+0.1)hist(x)
Widen the right axis.
1234
op <- par(mar=c(5,4,4,5)+0.1)hist(x)axis(side=4)mtext("y-axis on this side", side=4, line=3)
A graphics device can have many panels. mfrow allows you to add multiple frames by row - so each time you call a plot, it will be added to the next panel by rows first
12345678
op <- par(mfrow=c(2,2))hist(x, xlim=c(0,10))text(1,25,"A")hist(y, xlim=c(0,10))text(1,25,"B")h <- hist(x, plot=FALSE)h
Panels may have the same information and axes don’t need to be repeated.
12345678910111213141516171819202122232425
op <- par(mfrow=c(2,2), mar=c(1,1,1,1), oma=c(4,3,4,2), ann=FALSE)plot(x, y, axes=FALSE)axis(1, labels=FALSE)axis(2)mtext("A", side=3, adj =0)plot(x, y, col=rainbow(10), axes=FALSE)axis(1, labels=FALSE)axis(2, labels=FALSE)mtext("B", side=3, adj =0)plot(x, y, axes=FALSE)axis(1)axis(2)mtext("C", side=3, adj =0)plot(x, y, axes=FALSE)axis(1)axis(2, labels=FALSE)mtext("D", side=3, adj =0)mtext("x-axis", side=1, outer=TRUE, line=2)mtext("y-axis", side=2, outer=TRUE, line=2)mtext("title", side=3, outer=TRUE, line=1, cex=1.5)
1
par(op)
Barplots
Barplots are commonly used in biology, but are not as straightword as you might hope in R. Make sure the herbivore data is loaded:
1
data <- read.csv("data/seed_root_herbivores.csv")
Using what you’ve learned so far, make a new column containing “Both”, “Root only”, “Seed only” and “None” based on the four possible combinations of root and seed herbivores in the dataset. This column needs to be a factor for the analysis we will be conducting and graphing.
Let’s run an ANOVA. First, let’s look at the linear model:
12
mod.lm <- lm(Height ~ Herbivory, data = data)summary(mod.lm)
123456789101112131415161718
Call:
lm(formula = Height ~ Herbivory, data = data)
Residuals:
Min 1Q Median 3Q Max
-36.20 -11.20 -0.14 8.86 38.86
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.14 2.03 24.67 <2e-16
HerbivoryNone 18.13 3.34 5.43 2e-07
HerbivoryRoot only 2.06 2.76 0.75 0.457
HerbivorySeed only 8.79 3.41 2.57 0.011
Residual standard error: 14.5 on 165 degrees of freedom
Multiple R-squared: 0.174, Adjusted R-squared: 0.159
F-statistic: 11.6 on 3 and 165 DF, p-value: 6.08e-07
Now an analysis of variance of that linear model:
12
mod.aov <- aov(mod.lm)summary(mod.aov)
123
Df Sum Sq Mean Sq F value Pr(>F)
Herbivory 3 7339 2446 11.6 6.1e-07
Residuals 165 34769 211
Okay, so there are significant differences among the treatments. A Tukey Honest Significance Differences test will suggest where these differences occur:
filled.contour(volcano, color.palette = terrain.colors, asp =1)
The R package “ggplot” has become popular. However, you need to learn a syntax that is somewhat different to standard R syntax (hence the reason we did not cover it here).
The R package “lattice” is great for multivariate data. Here’s the volcano again using wireframe():