WELCOME TO SOFTWARE CARPENTRY UTS 2014.02.18 Please download this file and put it somewhere on your computer (e.g. the desktop) and unzip it. We'll be working in various subdirectories here today. https://github.com/nicercode/2014-02-18-UTS/raw/gh-pages/data/lessons.zip zIs there a function to clear a screen in R? Yes - Control-L (command-L on mac) Hooray! thanks # need a function because we'll be doubling a lot of numbers double <- function(value){ value + value } average <- function(numbers) { sum(numbers) / length(numbers) } variance <- function(some.data) { 1 / (length(some.data) - 1) * sum(((some.data - mean(some.data))^2)) } variance <- function(numbers) { dx2 <- (numbers - mean(numbers))^2 nm1 <- length(numbers) - 1 1/nm1 * sum(dx2) } set.seed(1) x <- rnorm(10) x average(x) mean(x) skew <- function(x) { n <- length(x) moment.3 <- (1 / (n-2)) * sum((x - mean(x))^3) moment.3 / (var(x)^(3/2)) } skew(x) # 0.3161183 #### Functions part 2 - gapminder data <- read.csv("gapminder-FiveYearData.csv", stringsAsFactors=FALSE) # load data data.1982 <- data[data$year==1982,] # subset to year 1982 plot(lifeExp ~gdpPercap, data.1982, log="x") plot(data.1982$gdpPercap,data.1982$lifeExp, log="x") tmp <- sqrt(data.1982$pop) p <- (tmp - min(tmp)) / (max(tmp) - min(tmp)) range(p) cex <- 0.2 + p*(10-0.2) plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex) linear.rescale <- function(x, r.out) { p <- (x - min(x)) / (max(x) - min(x)) r.out[[1]] + p * (r.out[[2]] - r.out[[1]]) } cex <-linear.rescale( sqrt(data.1982$pop), c(0.2, 10)) plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex) # Your challenge - write a function that fits a linear model for a given continent using `data` and add trend lines for all continents to your plot. # "Asia" "Europe" "Africa" "Americas" "Oceania" # Here are the arguments add.trend.line <- function(continent.name, data, col){ dsub <- data[data$continent == continent.name,] fit <- lm(lifeExp ~ log10(gdpPercap), data=dsub) abline(fit, col=col, lwd=2) } plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex) add.trend.line("Europe", data.1982, "red") add.trend.line("Americas", data.1982, "blue") add.trend.line("Africa", data.1982, "green") add.trend.line("Oceania", data.1982, "green") col.table <- c(Asia="tomato", Europe="chocolate4", Africa="dodgerblue2", Americas="darkgoldenrod1", Oceania="green4") col <- col.table[data.1982$continent] plot(data.1982$gdpPercap,data.1982$lifeExp, log="x", cex=cex, col=col) add.trend.line("Europe", data.1982, col.table["Europe"]) add.trend.line("Americas", data.1982, col.table["Americas"]) add.trend.line("Africa", data.1982, col.table["Africa"]) add.trend.line("Oceania", data.1982, col.table["Oceania"]) GENERAL COMMENTS: - the R code is hard to follow without any comments explaining what the different code lines do, i mean the code being shown on projector screen - OK. We aim for the examples to be reasonably self documenting. If you are still unsure please ask more questions, to whole class or to one of the tutors, or your neighbour! ##### PROJECTS ########################### 1. In your project lesson folder, create 6 new folders: R, data, output (data, figures), docs; 2. Drag your gapminder-FiveYearData.csv file to the folder data; 3. In the project folder, create a new empty file: figures.R; 4. In the R folder, create two new files: functions.R, figures_functions.R; 5. From your file analysis.R drag all the 'analytical functions' into R/functions.R; 6. From your file analysis.R drag all the 'plotting functions' into R/figures_functions.R; 7. At the very beginning of your analysis.R file, source the new files with functions that you just created in the R folder and add the necessary libraries for it to work (in this case it's only plyr); 8. Remove any other library() lines from your analysis.R file; 9. Make sure you adjust the file path to your data in order to read it: data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE) 10. Drag the last part of your analysis.R file (making plot) to the figures.R file; 11. In figures.R file, make sure you source the functions files the same way you just did in yor analysis.R file. Also make sure you clean the console on your first line using: rm(list=ls()) 12. Create two new lines between the sourced files and the actual code in order to read and modify your data: data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE) and data.1982 <- data[data$year == 1982, ]; 13. Finally, quit RStudio and reopen it from the projects.Rproj file in your project root directory; 14. Open analysis.R and figures.R and run each script entirely at once; Alternative ways of organising your project: * Carl Boettiger project setup: http://carlboettiger.info/2012/05/06/research-workflow.html * PLoS Computational Biology article: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424 ##### PLYR ############ # Exercises 1: write a function that takes a dataframe as input and returns the number of countries, assuming there is a column called `country` get.n.countries <- function(data) { length(unique(data$country)) } You should find 142 countries in your data get.n.countries(data) # number of countries per continent using package plyr daply(data, .(continent), get.n.countries) daply(data, c("continent"), get.n.countries) # you can split by multiple variables, e.g. year and continent daply(data, c("continent", "year"), get.n.countries) Note: plyr uses "array" to mean "vector" or "matrix"; basically if there is just one type within an object. If that's still confusing sing out and we'll talk about it more. ddply(data, .(continent, year), get.n.countries) ddply(data, c("continent", "year"), get.n.countries) ddply(data, ~continent*year, get.n.countries) Exercise 2: write a function that takes a dataframe as input and returns the sum of the total population (variable "pop") then apply to data splitting by year and continent get.total.pop <- function(x) { sum(x$pop) } get.total.pop(data) # total number of sampled individuals in data: 50440465802 out <- ddply(data, .(continent, year), get.total.pop) # or as an anonymous function out <- ddply(data, .(continent, year), function(x) sum(x$pop)) names(out)[3] <- "total_pop" #names column in output Naming the output: http://stackoverflow.com/questions/1395271/renaming-the-output-column-with-the-plyr-package-in-r get.n.countries <- function(x) { data.frame(pop=sum(x$pop)) } This works by creating a *named vector* in the function that we repeatedly apply. When ddply puts it back together for us it notices the name and will create the column name appropriately. Sweeeeet. Exercise 3: write a function that takes a dataframe as input and returns the maximum of the varaible gdpPercap then apply to data splitting by year and continent max.gdp <- ddply(data, .(continent, year), function(x) data.frame(gdp.max=max(x$gdpPercap), gdp.min=min(x$gdpPercap)) ) # with names columns max.gdp <- ddply(data, .(continent, year), function(x) data.frame(gdp.max=max(x$gdpPercap), gdp.min=min(x$gdpPercap)) ) Plyr functions: http://nicercode.github.io/2014-02-13-UNSW/lessons/40-repeating/full_apply_suite.png get.max.gdp <-function(x) { max(x$gdpPercap) } ddply(data, .(continent, year), get.max.gdp) ddply(data,.(year,continent),function(x) data.frame(tot_pop= sum(x$pop))) # This also works here - just named elements in a vector ddply(data,.(year,continent),function(x) c(tot_pop= sum(x$pop))) # Example of something returning a list fit <- lm(lifeExp~log10(gdpPercap), data=data) str(fit) # use str to see what's in the object called fit # Exercise 4: # get the unique names of countries in the datasset # then, using the function, dlply, write a function that get the names of the countries for each continent unique.country.names<-dlply(data, .(continent), function(data) unique(data$country)) # Example of a simple list ex_list <- list(a_vector=c(1,2,3,4,5,6), a_matrix=matrix(1:9,3,3), a_dataframe=head(data)) names(ex_list) # this function gets the names of the objects contained on our list str(ex_list) # shows the full structure of our list Slope, intersect, R and p #### Fitting a model model <- function(x){ lm(lifeExp~log10(gdpPercap), data=x) } fitted.linear.models <- dlply(data, .(continent, year), model) # P-value extraction. It's almost like they don't want you to do it: anova(fit)[1,"Pr(>F)"] # R-squared: summary(fit)$r.squared tab <- coef(summary(fitted.linex_list <- list(a_vector=c(1,2,3,4,5,6), a_matrix=matrix(1:9,3,3), a_dataframe=head(data)) names(ex_list)dels[[1]])) tab[2,'Pr(>|t|)'] model <- function(x){ fit <- lm(lifeExp~log10(gdpPercap), data=x) data.frame(n=length(x$lifeExp), r2=summary(fit)$r.squared, a=coef(fit)[[1]], b=coef(fit)[[2]], p=anova(fit)[1,"Pr(>F)"]) } fitted.linear.models <- ddply(data, .(continent, year), model) model <- function(d,x,y){ fit <- lm(d[[y]]~log10(d[[x]]), data=d) data.frame(n=length(d[[x]]), r2=summary(fit)$r.squared, a=coef(fit)[[1]], b=coef(fit)[[2]], p=anova(fit)[1,"Pr(>F)"]) } fitted.linear.models <- ddply(data, .(continent, year), model, x= "gdpPercap", y="lifeExp") fitted.linear.models2 <- ddply(data, .(continent, year), model, x= "gdpPercap", y="pop") format(fitted.linear.models2, digits=2) R Cheatsheets: * http://www.amaynard.ca/computing/R_Cheatsheet.pdf * http://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf * http://cran.r-project.org/doc/contrib/Short-refcard.pdf R search engine: http://Rseek.org Search the internal help: help.search("summary") Stack overflow R help section: http://stackoverflow.com/questions/tagged/r e.g., linear models in R http://stackoverflow.com/search?q=%5Br%5D+linear+model hello world ######################Testing ##################################### source("https://etherpad.mozilla.org/ep/pad/view/ro.VPU-xAZD/latestR/functions.R") data <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors=FALSE) library(testthat) data.1982 <- data[data$year == 1982,] # This should throw an error expect_that(variance(data.1982$gdpPercap)+100, equals(var(data.1982$gdpPercap))) # Contents of the test-variance.R file x <- runif(100) expect_that(variance(x), equals(var(x))) expect_that(length(variance(x)) == 1, is_true()) expect_that(is.nan(variance(1)), is_true()) x.with.nas <- runif(10) x.with.nas[c(2, 6, 8)] <- NA x.without.nas <- na.omit(x.with.nas) expect_that(variance(x.with.nas), equals(variance(x.without.nas))) # END OF THE test-variance.R file # Definitions of the functions average <- function(x) { sum(x) / length(x) } variance <- function(x) { x <- x[!is.na(x)] (1 / (length(x) - 1)) * sum((x-mean(x))^2) } # Run the tests source("R/functions.R") library(testthat) test_file("test-variance.R") # Exercise: test the 'average' function. Test it against R's mean function, and as many tricky cases as you can think of. Think in particular about how the 'average' function should deal with missing values. It probably should do the same thing as the 'variance' function, which will require you to modify the function a little. # Links from the chat window 11:44 Pascal: some people might find this helpful. i did. http://yangfeng.wordpress.com/2010/05/19/abbreviations-of-r-commands/ 11:44 Pascal: common abbreviations for R commands 14:45 Aaron: for the person who was squinting to count whether the populations were millions or billions: 14:46 Aaron: prettyNum(x, big.mark=",") will group 000's with commas 15:09 Pascal: choosing your beer using R : http://blog.yhathq.com/posts/recommender-system-in-r.html 16:25 Tim: dummy data used in tests are known as "test fixtures" 17:04 Tim: Apart from writing unit tests for each function, putting in integration tests in your main analysis code is also useful - these check that all your functions are working together as expected to produce known-good results from smaller subsets of your data.