Nice R Code

Punning code better since 2013

As empirical biologists, you’ll generally have data to read in. You probably have your data in an Excel spreadsheet. The simplest way to load these into R is to save a copy of the data as a comma separated values file (csv) and work with that.

It is actually possibl to read directly from Excel (but see the gdata package that has a read.xls function, and see this page for other alternatives. This is usually more hassle than it’s worth, and going through a comma separated file is easy enough.

To load the data into R:

(this doesn’t usually produce any output – the data is “just there” now).

Clicking the little table icon next to the data in the Workspace browser will view the data. Running View(data) will do the same thing.

The data variable contains data.frame object. It is a number of columns of the same length, arranged like a matrix. That sentence is tricky, for reasons that will become apparent.

Often, looking at the first few rows is all you need to remind yourself about what is in a data set.

You can get a vector of names of columns

You can get the number of rows:

and the number of columns

The last one is surprising to most people. There is a logical (if not good) reason for this, which we will get to later.

Aside from issues around factors and character vectors (that we’ll cover shortly) this is most of what you need to know about loading data.

However, it’s useful to know things about saving it.

• column names should be consistent, the right length, contain no special characters.

• for missing values, either leave them blank or use NA. But be consistent and don’t use -999 or ? or your cat’s name.

• Be careful with whitespace “x” will be treated differently to “x “, and Excel makes it easy to accidently do the latter. Consider the strip.white=TRUE argument to read.csv.

• Think about the type of the data. We’ll cover this more, but are you dealing with a TRUE/FALSE or a category or a count or a measurements.

• Dates and times will cause you nothing but pain. Excel and R both have issues with dates and times, and exporting through CSV can make them worse. I had a case with two different year-zero offsets being used in one exported file. I recommend Year-Month-Day (ISO 8601 format, or different colummns for different entries and combine later.

• Watch out for dashes between numbers. Excel will convert these into dates. So if you have “Site-Plant” style numbers 5-20 will get converted into the 20th of May 1904 or something equally useless. Similar problems happen to gene names in bioinformatics!

• Merged rows and columns will not work (or at least not in an easily predictible way.

• Spare rows at the top, or double header rows will not work without jumping through hoops.

• Equations will (should) convert to the value displayed in Excel on export.

Exercise:

The file data/seed_root_herbivores.txt has almost the same data, but in tab separated format (it does have the same number of rows and columns). Look at the ?read.table help page and work out how to load this file in.

Remember: == tests for equality, != tests for inequality

or

The point here is that many of the functions and operators in R will try to do the Right Thing, depending on what you give them.

This won’t work, because the default arguments of read.table and read.csv are different for the header.

Notice that a fake header (V1, V2, etc) has been created and the actual header is now the first row of data.

There are other ways of looking at your data. The summary function works with most types, and gives a by-column summary of the data set

Subsetting

So, we see there is an issue in the file – how to we get to it?

There a bunch of different ways of extracting bits of your data.

Columns of data.frames

Get the column Plot

This does almost the same thing

This is the main difference: if the column name is in a variable, then $ won’t work, while [[ will. Let’s define a variable v that has the name if the first column as its value: We can extract this column of the data set using the [[ notation: but using the $ notation won’t work as it will look for the column called v:

It returns NULL to indicate that the column does not exist (confusingly, this value can be difficult to work with and give cryptic error messages. More confusingly, getting a nonexistant column with [[ generates an error instead).

Also, data$P will “expand” to make data$Plot, but data\$S will return NULL because that is ambiguous. Always use the full name!

Single square brackets also index the data, but do so differently. This returns a data.frame with one column:

This returns a data.frame with two columns:

(I’m just using head here to keep the output under control. If you actually wanted a data.frame like this you might do

and then continue to use the new data.sub object).

The difference between [ and [[ can be confusing.

The best explanation I have seen is that imagine that the thing you are subsetting is a train with a bunch of carriages. [x] returns a new train with carriages represented by the variable x. So train[c(1,2)] returns a train with just the first two carriages, and train[1] returns a train with just the first carriage. The [[ operator gets the contents of a single carriage. So train[[1]] gets the contents of the first carriage, and train[[c(1,2)]] doesn’t make any sense.

Plotting is covered in the next R module, but it’s one of the best things about R so I can’t resist showing how to do it:

Here is a histogram of the height variable:

(it will appear in the bottom right of your screen)

Here is a scatter plot of Height vs weight:

The order of arguments is x-variable, y-variable.

There is an alternative interface using R’s “formulae” – you’ll see this a lot in statistical models. Read this as “Height is a function of Weight”. It makes nicer axis labels, too.

Here is a series of bivariate plots for height, weight and the number of seed heads:

The take-home being that R makes it very easy to create graphs, and most people who use it casually just make plots of whatever they’re looking at. The plots can vary from quick and dirty like this to really beautiful pieces of art.

Rows of data.frames

Extracting a row always returns a new data.frame

Be careful with indexing by location

The above all index by name or by location (index). However, you generally want to avoid referencing by number in your saved code, e.g.:

This is because if you change the order of your spreadsheet (add or delete a column), everything that depends on data.height may change. You may also see people do this in their code.

This should really be avoided. By name is much more robust and easy to read later on, even if it is more typing at first.

When should you index by location?

When you are computing the indices. As an example: suppose that you wanted every other row (perhaps you’re trying to generate a nonrandom some sample of data?) Remember seq from above? We can generate a sequnce of integers 1, 3, …, up to the last (or second to last) row in our data set like this:

Then subset like this:

Our new data set has half the rows of the old data set:

Because row names are preserved, you can see the odd numbers in the row names.

Indexing by logical vector

This is one of the most powerful ways of indexing.

Remember our data mismatch:

There is one entry in the Height row that disagrees. How can we extract the line that the mismatch is on?

We could do it by index:

But that requires us to look for the error, note the row, write it down, etc. Boring, and computers are less error prone than humans. Plus, I just said that we should not do that.

This is a logical vector that indicates where the entries in vector 1 disagree with vector 2:

We can index by this - it will return rows for which there are true values:

You can convert from a logical (TRUE/FALSE) vector to an integer vector with the which function:

This can be really useful.

Excercise:

1. Return all the rows in data where both data sets have the same value for Height.
2. Return all the rows in data from plot-8

A solution:

or:

read !x as “not x”,

Subsetting can be useful when you want to look at bits of your data. For example, all the rows where the Height is at least 10 and there was no seed herbivore:

The & operator here is a logical “and” (read x & y as “x and y”):

• TRUE & TRUE is TRUE
• TRUE & FALSE is FALSE
• FALSE & TRUE is FALSE
• FALSE & FALSE is FALSE

In contrast, the | operator is a logical “or” (read as “or”)

• TRUE | TRUE is TRUE
• TRUE | FALSE is TRUE
• FALSE | TRUE is TRUE
• FALSE | FALSE is FALSE

The other, less common, operator is the exclusive or:

• xor(TRUE, TRUE) is FALSE
• xor(TRUE, FALSE) is TRUE
• xor(FALSE, TRUE) is TRUE
• xor(FALSE, FALSE) is FALSE

So you can do all sorts of crazy things like

and get all the cases in plot 2 where there were both seed herbivores and root herbivores. Or

and get all the plants that are quite tall in treatments with either a seed herbivore or a root herbivore (or both).

You can build these up if you want:

whatever you find easiest to read and write.

Programs should be written for people to read, and only incidentally for machines to execute (Structure and Interpretation of Computer Programs” by Abelson and Sussman)

The subset function to simplify writing complex subsets

There is a function subset that may help you write complex subsets.

This can help, especially interactively, but it can also bite you. It is not always obvious where the “value” of the variables in the second argument are coming from. For example:

This works fine, because it found idx.tall. So when you read your code, you need to think carefully about which values are coming from the data.frame and which are coming from elsewhere.

This is an unfortunate example of a function designed to be used by beginners, but it only really understandable once you understand more of what is going on. You’ll see it used widely, and it can simplify things. But be careful.

It is easy to add new columns, perhaps based on old ones:

You can delete a column by setting it to NULL:

In this data set, the last column contains the number of seeds in 25 seed heads. However, there weren’t always 25 seed heads on a plant:

In these three cases, the column contains the number of seeds over all seed heads.

Question: How do we compute the mean number of seeds per seed head?

R generally offers several ways of doing things:

Use the all function to determine if all values are TRUE:

(bonus topic) Indexing need not make things smaller

Given this vector with the first give letters of the alphabet:

Repeat the first letter once, the second letter twice, etc.

Much better!

rep is incredibly useful, and can be used in many ways. See the help page ?rep

Back to main page