Missing data
Missing data are a fact of life in biology. Individuals die,
equipment breaks, you forget to measure something, you can’t read
your writing, etc.
If you load in data with blank cells, they will appear as an NA
value.
1
| data <- read.csv("data/seed_root_herbivores.csv")
|
Some data to play with.
If the 5th element was missing
This is what it would look like:
1
| ## [1] 31 41 42 64 NA 52 57 27 40 33
|
Note that this is not a string “NA”; that is something different
entirely.
Treat a missing value as a number that could stand in for
anything. So what is
1
2
3
| 1 + NA
1 * NA
NA + NA
|
These are all NA because if the input could be anything, the output
could be anything.
What is the value of this:
It’s NA
too because x[1] + x[2] + NA + ...
must be NA
. And
then NA/length(x)
is also NA
.
This is a pretty common situation for data, so the mean function
takes an na.rm
argument
sum
takes the same argument too:
Be careful though:
1
| sum(x, na.rm = TRUE)/length(x) # not the mean!
|
The na.omit
function will strip out all NA values:
1
2
3
4
5
| ## [1] 31 41 42 64 52 57 27 40 33
## attr(,"na.action")
## [1] 5
## attr(,"class")
## [1] "omit"
|
So we can do this:
You can’t test for NA
-ness with ==
:
1
| ## [1] NA NA NA NA NA NA NA NA NA NA
|
(why not?)
Use is.na
instead:
1
| ## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
|
So na.omit
is (roughly) equivalent to
1
| ## [1] 31 41 42 64 52 57 27 40 33
|
Excercise
Our standard error function doesn’t deal well with missing values:
1
2
3
4
5
| standard.error <- function(x) {
v <- var(x)
n <- length(x)
sqrt(v/n)
}
|
Can you write one that always filters missing values?
If we get time, we’ll talk about how to write one that optionally
gets rid of missing values.
Other special values:
Positive and negative infinities
Not a number (different to NA
, but usually treatable the same
way).
We saw NULL
before. It’s the weirdest.