WELCOME TO SOFTWARE CARPENTRY UTS 2014.02.18

Please download this file and put it somewhere on your computer (e.g. the desktop) and unzip it.  We'll be working in various subdirectories here today.
https://github.com/nicercode/2014-02-18-UTS/raw/gh-pages/data/lessons.zip


zIs there a function to clear a screen in R?
  Yes - Control-L (command-L on mac)
Hooray! thanks

# need a function because we'll be doubling a lot of numbers

double <- function(value){
    value + value
    }

average <- function(numbers) {
  sum(numbers) / length(numbers)
}

variance <- function(some.data) {
  1 / (length(some.data) - 1) * sum(((some.data - mean(some.data))^2))
}

variance <- function(numbers) {
  dx2 <- (numbers - mean(numbers))^2
  nm1 <- length(numbers) - 1
  1/nm1 * sum(dx2)
}


set.seed(1)
x <- rnorm(10)
x
average(x)
mean(x)

skew <- function(x) {
  n <- length(x)
  moment.3 <- (1 / (n-2)) * sum((x - mean(x))^3)
  moment.3 / (var(x)^(3/2))
}
skew(x) # 0.3161183

#### Functions part 2 - gapminder


data <- read.csv("gapminder-FiveYearData.csv", stringsAsFactors=FALSE) # load data
data.1982 <- data[data$year==1982,]  # subset to year 1982
plot(lifeExp ~gdpPercap, data.1982, log="x")
plot(data.1982$gdpPercap,data.1982$lifeExp, log="x")

tmp <- sqrt(data.1982$pop)
p <- (tmp - min(tmp)) /  (max(tmp) - min(tmp))
range(p)
cex <- 0.2 + p*(10-0.2)
plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex)

linear.rescale <- function(x, r.out) {
  p <- (x - min(x)) / (max(x) - min(x))
  r.out[[1]] + p * (r.out[[2]] - r.out[[1]])
}
cex <-linear.rescale( sqrt(data.1982$pop), c(0.2, 10))
plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex)


# Your challenge - write a function that fits a linear model for a given continent using `data` and add trend lines for all continents to your plot.

# "Asia"     "Europe"   "Africa"   "Americas" "Oceania"

# Here are the arguments
add.trend.line <- function(continent.name, data, col){
    dsub <- data[data$continent == continent.name,]
    fit <- lm(lifeExp ~ log10(gdpPercap), data=dsub)
    abline(fit, col=col, lwd=2)
    }

plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex)
add.trend.line("Europe", data.1982, "red")
add.trend.line("Americas", data.1982, "blue")
add.trend.line("Africa", data.1982, "green")
add.trend.line("Oceania", data.1982, "green")

col.table <- c(Asia="tomato",
               Europe="chocolate4",
               Africa="dodgerblue2",
               Americas="darkgoldenrod1",
               Oceania="green4")

col <- col.table[data.1982$continent]
plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex, col=col)
add.trend.line("Europe", data.1982, col.table["Europe"])
add.trend.line("Americas", data.1982, col.table["Americas"])
add.trend.line("Africa", data.1982, col.table["Africa"])
add.trend.line("Oceania", data.1982, col.table["Oceania"])

GENERAL COMMENTS:
- the R code is hard to follow without any comments explaining what the different code lines do, i mean the code being shown on projector screen
- OK. We aim for the examples to be reasonably self documenting. If you are still unsure please ask more questions, to whole class or to one of the tutors, or your neighbour!


##### PROJECTS ###########################

1. In your project lesson folder, create 6 new folders: R, data, output (data, figures), docs;
2. Drag your gapminder-FiveYearData.csv file to the folder data;
3. In the project folder, create a new empty file: figures.R;
4. In the R folder, create two new files: functions.R, figures_functions.R;
5. From your file analysis.R drag all the 'analytical functions' into R/functions.R;
6. From your file analysis.R drag all the 'plotting functions' into R/figures_functions.R;
7. At the very beginning of your analysis.R file, source the new files with functions that you just created in the R folder and add the necessary libraries for it to work (in this case it's only plyr);
8. Remove any other library() lines from your analysis.R file;
9. Make sure you adjust the file path to your data in order to read it: data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE)
10. Drag the last part of your analysis.R file (making plot) to the figures.R file;
11. In figures.R file, make sure you source the functions files the same way you just did in yor analysis.R file. Also make sure you clean the console on your first line using: rm(list=ls())
12. Create two new lines between the sourced files and the actual code in order to read and modify your data: data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE) and data.1982  <-  data[data$year == 1982, ];
13. Finally, quit RStudio and reopen it from the projects.Rproj file in your project root directory;
14. Open analysis.R and figures.R and run each script entirely at once;

Alternative ways of organising your project:
   * Carl Boettiger project setup: http://carlboettiger.info/2012/05/06/research-workflow.html
   * PLoS Computational Biology article: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424

##### PLYR ############

# Exercises 1:  write a function that takes a dataframe as input and returns the number of countries, assuming there is a column called `country`

get.n.countries  <-  function(data) {
  length(unique(data$country))
}

You should find 142 countries in your data

get.n.countries(data)


# number of countries per continent using package plyr
daply(data, .(continent), get.n.countries)
daply(data, c("continent"), get.n.countries)

# you can split by multiple variables, e.g. year and continent
daply(data, c("continent", "year"), get.n.countries)

Note: plyr uses "array" to mean "vector" or "matrix"; basically if there is just one type within an object.  If that's still confusing sing out and we'll talk about it more.


ddply(data, .(continent, year), get.n.countries)
ddply(data, c("continent", "year"), get.n.countries)
ddply(data, ~continent*year, get.n.countries)


Exercise 2: write a function that takes a dataframe as input and returns the sum of the total population (variable "pop") then apply to data splitting by year and continent

get.total.pop <- function(x) {
  sum(x$pop)
}

get.total.pop(data) # total number of sampled individuals in data: 50440465802

out <- ddply(data, .(continent, year), get.total.pop)

# or as an anonymous function

out <- ddply(data, .(continent, year), function(x) sum(x$pop))

names(out)[3] <- "total_pop"  #names column in output

Naming the output:
http://stackoverflow.com/questions/1395271/renaming-the-output-column-with-the-plyr-package-in-r

get.n.countries <- function(x) {
  data.frame(pop=sum(x$pop))
}

This works by creating a *named vector* in the function that we repeatedly apply.  When ddply puts it back together for us it notices the name and will create the column name appropriately.  Sweeeeet.


Exercise 3: write a function that takes a dataframe as input and returns the maximum of the varaible gdpPercap then apply to data splitting by year and continent

max.gdp <- ddply(data,  .(continent, year), function(x) data.frame(gdp.max=max(x$gdpPercap), gdp.min=min(x$gdpPercap)) )

# with names columns


max.gdp <- ddply(data,  .(continent, year), function(x) data.frame(gdp.max=max(x$gdpPercap), gdp.min=min(x$gdpPercap)) )


Plyr functions:
  http://nicercode.github.io/2014-02-13-UNSW/lessons/40-repeating/full_apply_suite.png

get.max.gdp <-function(x) {
  max(x$gdpPercap)
}

ddply(data, .(continent, year), get.max.gdp)


ddply(data,.(year,continent),function(x) data.frame(tot_pop= sum(x$pop)))

# This also works here - just named elements in a vector
ddply(data,.(year,continent),function(x) c(tot_pop= sum(x$pop)))


#  Example of something returning a list

fit <- lm(lifeExp~log10(gdpPercap), data=data)
str(fit)  # use str to see what's in the object called fit


# Exercise 4:
# get the unique names of countries in the datasset
# then, using the function, dlply, write a function that get the names of the countries for each continent

unique.country.names<-dlply(data, .(continent), function(data) unique(data$country))


# Example of a simple list
ex_list <- list(a_vector=c(1,2,3,4,5,6), a_matrix=matrix(1:9,3,3), a_dataframe=head(data))
names(ex_list) # this function gets the names of the objects contained on our list
str(ex_list) # shows the full structure of our list

Slope, intersect, R and p


#### Fitting a model

model <- function(x){
  lm(lifeExp~log10(gdpPercap), data=x)
}

fitted.linear.models <- dlply(data, .(continent, year), model)

# P-value extraction.  It's almost like they don't want you to do it:
anova(fit)[1,"Pr(>F)"]
# R-squared:
summary(fit)$r.squared

tab  <-  coef(summary(fitted.linex_list <- list(a_vector=c(1,2,3,4,5,6), a_matrix=matrix(1:9,3,3), a_dataframe=head(data))
names(ex_list)dels[[1]]))
tab[2,'Pr(>|t|)']

model <- function(x){
  fit <- lm(lifeExp~log10(gdpPercap), data=x)
  data.frame(n=length(x$lifeExp), r2=summary(fit)$r.squared, a=coef(fit)[[1]],
             b=coef(fit)[[2]], p=anova(fit)[1,"Pr(>F)"])
}

fitted.linear.models <- ddply(data, .(continent, year), model)


model <- function(d,x,y){
  fit <- lm(d[[y]]~log10(d[[x]]), data=d)
  data.frame(n=length(d[[x]]), r2=summary(fit)$r.squared, a=coef(fit)[[1]],
             b=coef(fit)[[2]], p=anova(fit)[1,"Pr(>F)"])
}

fitted.linear.models <- ddply(data, .(continent, year), model, x= "gdpPercap", y="lifeExp")

fitted.linear.models2 <- ddply(data, .(continent, year), model, x= "gdpPercap", y="pop")

format(fitted.linear.models2, digits=2)

R Cheatsheets:
  * http://www.amaynard.ca/computing/R_Cheatsheet.pdf
  * http://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf
  * http://cran.r-project.org/doc/contrib/Short-refcard.pdf

R search engine:
  http://Rseek.org

Search the internal help:
  help.search("summary")

Stack overflow R help section:
  http://stackoverflow.com/questions/tagged/r
e.g., linear models in R
  http://stackoverflow.com/search?q=%5Br%5D+linear+model

hello world

######################Testing #####################################

source("https://etherpad.mozilla.org/ep/pad/view/ro.VPU-xAZD/latestR/functions.R")
data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE)

library(testthat)

data.1982 <- data[data$year == 1982,]


# This should throw an error
expect_that(variance(data.1982$gdpPercap)+100,
            equals(var(data.1982$gdpPercap)))


# Contents of the test-variance.R file
x <- runif(100)
expect_that(variance(x),
            equals(var(x)))
expect_that(length(variance(x)) == 1,
            is_true())
expect_that(is.nan(variance(1)),
            is_true())

x.with.nas <- runif(10)
x.with.nas[c(2, 6, 8)] <- NA
x.without.nas <- na.omit(x.with.nas)
expect_that(variance(x.with.nas),
            equals(variance(x.without.nas)))

# END OF THE test-variance.R file

# Definitions of the functions

average  <-  function(x) {
  sum(x) / length(x)
}

variance  <-  function(x) {
  x <- x[!is.na(x)]
  (1 / (length(x) - 1)) * sum((x-mean(x))^2)
}

# Run the tests
source("R/functions.R")
library(testthat)
test_file("test-variance.R")

# Exercise:
test the 'average' function.  Test it against R's mean function, and as many tricky cases as you can think of.
Think in particular about how the 'average' function should deal with missing values.  It probably should do the same thing as the 'variance' function, which will require you to modify the function a little.


# Links from the chat window

11:44 Pascal: some people might find this helpful. i did. http://yangfeng.wordpress.com/2010/05/19/abbreviations-of-r-commands/
11:44 Pascal: common abbreviations for R commands
14:45 Aaron: for the person who was squinting to count whether the populations were millions or billions:
14:46 Aaron: prettyNum(x, big.mark=",") will group 000's with commas
15:09 Pascal: choosing your beer using R : http://blog.yhathq.com/posts/recommender-system-in-r.html
16:25 Tim: dummy data used in tests are known as "test fixtures"
17:04 Tim: Apart from writing unit tests for each function, putting in integration tests in your main analysis code is also useful - these check that all your functions are working together as expected to produce known-good results from smaller subsets of your data.