At some point, you will want to write a function, and it will
probably be sooner than you think. Functions are core to the way
that R works, and the sooner that you get comfortable writing them,
the sooner you’ll be able to leverage R’s power, and start having
fun with it.
The first function many people seem to need to write is to compute
the standard error of the mean for some variable, because curiusly
this function does not come with R’s base package. This is defined
as $\sqrt{\mathrm{var}(x)/n}$ (that is the square root of the
variance divided by the sample size.
Start by reloading our data set again.
1
data <- read.csv("data/seed_root_herbivores.csv")
We can already easily compute the mean
1
mean(data$Height)
1
## [1] 55.54
and the variance
1
var(data$Height)
1
## [1] 250.6
and the sample size
1
length(data$Height)
1
## [1] 169
so it seems easy to compute the standard error:
1
sqrt(var(data$Height)/length(data$Height))
1
## [1] 1.218
notice how data$Height is repeated there — not desirable.
Suppose we now want the standard error of the dry weight too:
1
sqrt(var(data$Weight)/length(data$Weight))
1
## [1] 0.7616
This is basically identical to the height case above. We’ve copied
and pasted the definition and replaced the variable that we are
interested in. This sort of substitution is tedious and error
prone, and the sort of things that computers are a lot better at
doing reliably than humans are.
It is also just not that clear from what is written what the
point of these lines is. Later on, you’ll be wondering what
those lines are doing.
Look more carefully at the two statements and see the similarity in
form, and what is changing between them. This pattern is the key
to writing functions.
The result of the last line is “returned” from the function.
We can call it like this:
1
standard.error(data$Height)
1
## [1] 1.218
1
standard.error(data$Weight)
1
## [1] 0.7616
Note that x has a special meaning within the curly braces. If we
do this:
12
x <-1:100standard.error(data$Height)
1
## [1] 1.218
we get the same answer. Because x appears in the “argument
list”, it will be treated specially. Note also that it is
completely unrelated to the name of what is provided as value to
the function.
You can define variables within functions
12345
standard.error <-function(x){ v <- var(x) n <- length(x) sqrt(v/n)}
This can often help you structure your function and your thoughts.
These are also treated specially — they do not affect the main
workspace (the “global environment”) and are destroyed when the
function ends. If you had some value v in the global
environment, it would be ignored in this function as soon as the
local v was defined, with the local definition used instead.
12
v <-1000standard.error(data$Height)
1
## [1] 1.218
Another example.
We used the variance function above, but let’s rewrite it.
The sample variance is defined as
This case is more compliated, so we’ll do it in pieces.
We’re going to use x for the argument, so name our first input
data x so we can use it.
1
x <- data$Height
The first term is easy:
12
n <- length(x)(1/(n -1))
1
## [1] 0.005952
The second term is harder. We want the difference between all the
x values and the mean.
Watch that you don’t do this, which is quite different
1
sum(x - m)^2
1
## [1] 1.022e-25
(this follows from the definition of the mean)
Putting both halves together, the variance is
1
(1/(n -1))* sum((x - m)^2)
1
## [1] 250.6
Which agrees with R’s variance function
1
var(x)
1
## [1] 250.6
The rm function cleans up:
1
rm(n, x, m)
We can then define our function, using the pieces that we wrote above.
12345
variance <-function(x){ n <- length(x) m <- mean(x)(1/(n -1))* sum((x - m)^2)}
And test it:
1
variance(data$Height)
1
## [1] 250.6
1
var(data$Height)
1
## [1] 250.6
1
variance(data$Weight)
1
## [1] 98.02
1
var(data$Weight)
1
## [1] 98.02
An aside on floating point comparisons:
Our function does not exactly agree with R’s function
1
variance(data$Height)== var(data$Height)
1
## [1] FALSE
This is not because one is more accurate than the other, it is
because “machine precision” is finite (that is, the number of
decimal places being kept).
1
variance(data$Height)- var(data$Height)
1
## [1] 2.842e-14
This affects all sorts of things:
1
sqrt(2)* sqrt(2)# looks like 2
1
## [1] 2
1
sqrt(2)* sqrt(2)-2# but not quite
1
## [1] 4.441e-16
So be careful with == for floating point comparisons. Usually
you have do something like:
1
abs(x - y)< eps
For some small value eps. The all.equal function can be very
helpful here.
Exercise: define a function to compute skew
Skewness is a measure of asymmetry of a probability distribution.
It can be defined as
Write a function that computes the skewness.
Hints:
Don’t try to do this in one step, but use intermediate variables
like the second version of standard.error, or like our
variance function.
1
##
The term on the top of the fraction is a lot like the variance
function.
Remember that parentheses can help with order-of-operation
control.
Get the pieces of your function working before putting it all
together.
123456789
skewness <-function(x){ n <- length(x) v <- var(x) m <- mean(x) third.moment <-(1/(n -2))* sum((x - m)^3) third.moment/(var(x)^(3/2))}skewness(data$Height)# should be 0.301