1  Refresher

1.1 Goal

Introduce the statistical software R, and show how it can be used to analyze biological data in an automated, replicable way. Showcase the RStudio development environment, illustrate the notion of assignment, present the main data structures available in R. Show how to read and write data, how to execute simple programs, and how to modify the stream of execution of a program through conditional branching and looping. Introduce the use of packages and user-defined functions.

1.2 Motivation

When it comes to analyzing data, there are two competing paradigms. First, one could use point-and-click software with a graphical user interface, such as Excel, to perform calculations and draw graphs; second, one could write programs that can be run to perform the analysis of the data, the generation of tables and statistics, and the production of figures automatically.

This latter approach is to be preferred, because it allows for the automation of analysis, it requires a good documentation of the procedures, and is completely replicable.

A few motivating examples:

  • You have written code to analyze your data. You receive from your collaborators a new batch of data. With simple modifications of your code, you can update your results, tables and figures automatically.

  • A new student joins the laboratory. The new student can read the code and understand the analysis without the need of a lab mate showing the procedure step-by-step.

  • The reviewers of your manuscript ask you to slightly alter the analysis. Rather than having to start over, you can modify a few lines of code and satisfy the reviewers.

Here we introduce R, which can help you write simple programs to analyze your data, perform statistical analysis, and draw beautiful figures.

1.3 Before we start

To follow this tutorial, you will need to install R and RStudio

  • Install R: download and install R from this page. Choose the right architecture (Windows, Mac, Linux). If possible, install the latest release.
  • Install RStudio: go to this page and download the “RStudio Desktop Open Source License”.
  • Install R packages: launch RStudio. Click on “Packages” in the bottom-right panel. Click on “Install”: a dialog window will open. Type tidyverse in the field “Packages” and click on “Install”. This might take a few minutes, and ask you to download further packages.

1.4 What is R?

R is a statistical software that is completely programmable. This means that one can write a program (script) containing a series of commands for the analysis of data, and execute them automatically. This approach is especially good as it makes the analysis of data well-documented, and completely replicable.

R is free software: anyone can download its source code, modify it, and improve it. The R community of users is vast and very active. In particular, scientists have enthusiastically embraced the program, creating thousands of packages to perform specific types of analysis, and adding many new capabilities. You can find a list of official packages (which have been vetted by R core developers) here; many more are available on GitHub and other websites.

The main hurdle new users face when approaching R is that it is based on a command line interface: when you launch R, you simply open a console with the character > signaling that R is ready to accept an input. When you write a command and press Enter, the command is interpreted by R, and the result is printed immediately after the command. For example,

1 + 1
[1] 2

A little history: R was modeled after the commercial statistical software S by Robert Gentleman and Ross Ihaka. The project was started in 1992, first released in 1994, and the first stable version appeared in 2000. Today, R is managed by the R Core Team.

1.5 RStudio

For this introduction, we’re going to use RStudio, an Integrated Development Environment (IDE) for R. The main advantage is that the environment will look identical irrespective of your computer architecture (Linux, Windows, Mac). Also, RStudio makes writing code much easier by automatically completing commands and file names (simply type the beginning of the name and press Tab), and allowing you to easily inspect data and code.

Typically, an RStudio window contains four panels:

  • Console This is a panel containing an instance of R. For this tutorial, we will work mainly in this panel.
  • Source code In this panel, you can write a program, save it to a file pressing Ctrl + S and then execute it by pressing Ctrl + Shift + S.
  • Environment This panel lists all the variables you created (more on this later); another tab shows you the history of the commands you typed.
  • Plots This panel shows you all the plots you drew. Other tabs allow you to access the list of packages you have loaded, and the help page for commands (just type help(name_of_command) in the Console) and packages.

1.6 How to write a simple program

An R program is simply a list of commands, which are executed one after the other. The commands are written in a text file (with extension .R). When R executes the program, it will start from the beginning of the file and proceed toward the end of the file. Every time R encounters a command, it will execute it. Special commands can modify this basic flow of the program by, for example, executing a series of commands only when a condition is met, or repeating the execution of a series of commands multiple times.

Note that if you were to copy and paste (or type) the code into the Console you would obtain exactly the same result. Writing a program is advantageous, however, because the analysis can be automated, and the code shared with other researchers. Moreover, after a while you will have a large code base, so that you can recycle much of your code.

We start by working on the console, and then start writing simple scripts.

1.6.1 The most basic operation: assignment

The most basic operation in any programming language is the assignment. In R, assignment is marked by the operator <- (can be typed quickly using Alt -). When you type a command in R, it is executed, and the output is printed in the Console. For example:

sqrt(9)
[1] 3

If we want to save the result of this operation, we can assign it to a variable. For example:

x <- sqrt(9)
x
[1] 3

What has happened? We wrote a command containing an assignment operator (<-). R has evaluated the right-hand-side of the command (sqrt(9)), and has stored the result (3) in a newly created variable called x. Now we can use x in our commands: every time the command needs to be evaluated, the program will look up which value is associated with the variable x, and substitute it. For example:

x * 2 
[1] 6

1.6.2 Data types

R provides different types of data that can be used in your programs. For each variable x, calling class(x) prints the type of the variable. The basic data types are:

  • logical, taking only two possible values: TRUE and FALSE
v <- TRUE
class(v)
[1] "logical"
  • numeric, storing real numbers (actually, their approximations, as computers have limited memory and thus cannot store numbers like π, or even 0.2)
v <- 3.77
class(v)
[1] "numeric"
  • Real numbers can also be specified using scientific notation:
v <- 6.022e23 # 6.022⋅10^23 (Avogadro's number)
class(v)
[1] "numeric"
  • integer, storing whole numbers
v <- 23L # the L signals that this should be stored as integer
class(v)
[1] "integer"
  • complex, storing complex numbers (i.e., with a real and an imaginary part)
v <- 23 + 5i # the i marks the imaginary part
class(v)
[1] "complex"
  • character, for strings, characters and text
v <- 'a string' # you can use single or double quotes
class(v)
[1] "character"

In R, the value and type of a variable are evaluated at run-time. This means that you can recycle the names of variables. This is very handy, but can make your programs more difficult to read and to debug (i.e., find mistakes). For example:

x <- '2.3' # this is a string
x
[1] "2.3"
x <- 2.3 # this is numeric
x
[1] 2.3

1.6.3 Operators and functions

Each data type supports a certain number of operators and functions. For example, numeric variables can be combined with + (addition), - (subtraction), * (multiplication), / (division), and ^ (or **, exponentiation). A possibly unfamiliar operator is the modulo (%%), calculating the remainder of an integer division:

5 %% 3
[1] 2

meaning that 5 %/% 3 (5 integer divided by 3) is 1 with a remainder of 2

The modulo operator is useful to determine whether a number is divisible for another: if y is divisible by x, then y %% x is 0.

R provides many built-in functions: each functions has a name, followed by round parentheses surrounding the (possibly optional) function arguments. For example, these functions operate on numeric variables:

  • abs(x) absolute value
  • sqrt(x) square root
  • round(x, digits = 3) round x to three decimal digits
  • cos(x) cosine (also supported are all the usual trigonometric functions)
  • log(x) natural logarithm (use log10 for base 10 logarithms)
  • exp(x) calculating \(e^x\)

Similarly, character variables have their own set of functions, such as:

  • toupper(x) make uppercase
  • nchar(x) count the number of characters in the string
  • paste(x, y, sep = "_") concatenate strings, joining them using the separator _
  • strsplit(x, "_") separate the string using the separator _

Calling a function meant for a certain data type on another will cause errors. If sensible, you can convert a type into another. For example:

v <- "2.13"
class(v)
[1] "character"
# if we call v * 2, we get an error.
# to avoid it, we can convert v to numeric:
as.numeric(v) * 2 
[1] 4.26

If sensible, you can use the comparison operators > (greater), < (lower), == (equals), != (differs), >= and <=, returning a logical value:

2 == sqrt(4)
[1] TRUE
2 < sqrt(4)
[1] FALSE
2 <= sqrt(4)
[1] TRUE

Exercise:

Why are two equal signs (==) used to check that two values are equal? What happens if you use only one = sign?

Similarly, you can concatenate several comparison and logical variables using & (and), | (or), and ! (not):

(2 > 3) & (3 > 1)
[1] FALSE
(2 > 3) | (3 > 1)
[1] TRUE

1.6.4 Getting help

If you want to know more about a function, type ?my_function_name in the console (e.g., ?abs). This will open the help page in one of the panels on the right. The same can be accomplished calling help(abs). For more complex questions, check out stackoverflow.

1.6.5 Data structures

Besides these simple types, R provides structured data types, meant to collect and organize multiple values.

1.6.5.1 Vectors

The most basic data structure in R is the vector, which is an ordered collection of values of the same type. Vectors can be created by concatenating different values with the function c() (“combine”):

x <- c(2, 3, 5, 27, 31, 13, 17, 19) 
x
[1]  2  3  5 27 31 13 17 19

You can access the elements of a vector by their index: the first element is indexed at 1, the second at 2, etc.

x[3]
[1] 5
x[8]
[1] 19
x[9] # what if the element does not exist?
[1] NA

NA stands for “Not Available”. Other special values are NaN (Not a Number, e.g., 0/0), Inf (Infinity, e.g., 1/0), and NULL (variable undefined). You can test for special values using is.na(x), is.infinite(x), is.null(x), etc.

Note that in R a single number (string, logical) is a vector of length 1 by default. That’s why if you type 3 in the console you see [1] 3 in the output.

You can extract several elements at once (i.e., create another vector), using the colon (:) command, or by concatenating the indices:

x[1:3]
[1] 2 3 5
x[4:7]
[1] 27 31 13 17
x[c(1,3,5)]
[1]  2  5 31

You can also use a vector of logical variables to extract values from vectors. For example, suppose we have two vectors:

sex <- c("M", "M", "F", "M", "F") # sex of Drosophila
weight <- c(0.230, 0.281, 0.228, 0.260, 0.231) # weight in mg

and that we want to extract only the weights for the males.

sex == "M"
[1]  TRUE  TRUE FALSE  TRUE FALSE

returns a vector of logical values, which we can use to subset the data:

weight[sex == "M"]
[1] 0.230 0.281 0.260

Given that R was born for statistics, there are many statistical functions you can perform on vectors:

length(x)
[1] 8
min(x)
[1] 2
max(x)
[1] 31
sum(x) # sum all elements
[1] 117
prod(x) # multiply all elements
[1] 105436890
median(x) # median value
[1] 15
mean(x) # arithmetic mean
[1] 14.625
var(x) # unbiased sample variance
[1] 119.4107
mean(x ^ 2) - mean(x) ^ 2 # population variance
[1] 104.4844
summary(x) # print a summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    4.50   15.00   14.62   21.00   31.00 

You can generate vectors of sequential numbers using the colon command:

x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10

For more complex sequences, use seq:

seq(from = 1, to = 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

To repeat a value or a sequence several times, use rep:

rep("abc", 3)
[1] "abc" "abc" "abc"
rep(c(1, 2, 3), 3)
[1] 1 2 3 1 2 3 1 2 3

Exercise:

  • Create a vector containing all the even numbers between 2 and 100 (inclusive) and store it in variable z.
  • Extract all the elements of z that are divisible by 12. How many elements match this criterion?
  • What is the sum of all the elements of z?
  • Is it equal to \(51 \cdot 50\)?
  • What is the product of elements 5, 10 and 15 of z?
  • Does seq(2, 100, by = 2) produce the same vector as (1:50) * 2?
  • What happens if you type z ^ 2?

1.6.5.2 Matrices

A matrix is a two-dimensional table of values. In case of numeric values, you can perform the usual operations on matrices (product, inverse, decomposition, etc.):

A <- matrix(c(1, 2, 3, 4), 2, 2) # values, nrows, ncols
A 
     [,1] [,2]
[1,]    1    3
[2,]    2    4
A %*% A # matrix product
     [,1] [,2]
[1,]    7   15
[2,]   10   22
solve(A) # matrix inverse
     [,1] [,2]
[1,]   -2  1.5
[2,]    1 -0.5
A %*% solve(A) # this should return the identity matrix
     [,1] [,2]
[1,]    1    0
[2,]    0    1
B <- matrix(1, 3, 2) # you can fill the whole matrix with a single number (1)
B
     [,1] [,2]
[1,]    1    1
[2,]    1    1
[3,]    1    1
B %*% t(B) # transpose
     [,1] [,2] [,3]
[1,]    2    2    2
[2,]    2    2    2
[3,]    2    2    2
Z <- matrix(1:9, 3, 3) # by default, matrices are filled by column
Z
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

To determine the dimensions of a matrix, use dim:

dim(B)
[1] 3 2
dim(B)[1]
[1] 3
nrow(B) 
[1] 3
dim(B)[2]
[1] 2
ncol(B)
[1] 2
nrow(B)
[1] 3

Use indices to access a particular row/column of a matrix:

Z
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
Z[1, ] # first row
[1] 1 4 7
Z[, 2] # second column
[1] 4 5 6
Z [1:2, 2:3] # submatrix with coefficients in first two rows, and second and third column
     [,1] [,2]
[1,]    4    7
[2,]    5    8
Z[c(1, 3), c(1, 3)] # indexing non-adjacent rows/columns
     [,1] [,2]
[1,]    1    7
[2,]    3    9

Some functions use all the elements of the matrix:

sum(Z)
[1] 45
mean(Z)
[1] 5

Some functions apply the operation across a given dimension (e.g., columns) of the matrix:

rowSums(Z) # returns a vector of the sums of the values in each row
[1] 12 15 18
colSums(Z) # does the same for columns
[1]  6 15 24
rowMeans(Z) # returns a vector of the means of the values in each row
[1] 4 5 6
colMeans(Z) # does the same for columns
[1] 2 5 8

1.6.5.3 Arrays

If you need tables with more than two dimensions, use arrays:

M <- array(1:24, c(4, 3, 2))
M 
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

You can still determine the dimensions using:

dim(M)
[1] 4 3 2

and access the elements as done for matrices. One thing you should be paying attention to: R drops dimensions that are not needed. So, if you access a “slice” of a 3-dimensional array:

M[, , 1]
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

you obtain a matrix:

dim(M[, , 1])
[1] 4 3

This can be problematic, for example, when your code expects an array and R turns your data into a matrix (or you expect a matrix but find a vector). To avoid this behavior, add drop = FALSE when subsetting:

dim(M[, , 1, drop = FALSE])
[1] 4 3 1

1.6.5.4 Lists

Vectors are good if each element is of the same type (e.g., numbers, strings). Lists are used when we want to store elements of different types, or more complex objects (e.g., vectors, matrices, even lists of lists). Each element of the list can be referenced either by its index, or by a label:

mylist <- list(Names = c("a", "b", "c", "d"), Values = c(1, 2, 3))
mylist
$Names
[1] "a" "b" "c" "d"

$Values
[1] 1 2 3
mylist[[1]] # access first element using index
[1] "a" "b" "c" "d"
mylist[[2]] # access second element by index
[1] 1 2 3
mylist$Names # access second element by label
[1] "a" "b" "c" "d"
mylist[["Names"]] # another way to access by label
[1] "a" "b" "c" "d"
mylist[["Values"]][3]  # access third element in second vector
[1] 3

1.6.5.5 Data frames

Data frames contain data organized like in a spreadsheet. The columns (typically representing different measurements) can be of different types (e.g., a column could be the date of measurement, another the weight of the individual, or the volume of the cell, or the treatment of the sample), while the rows typically represent different samples.

When you read a spreadsheet file in R, it is automatically stored as a data frame. The difference between a matrix and a data frame is that in a matrix all the values are of the same type (e.g., all numeric), while in a data frame each column can be of a different type.

Because typing a data frame by hand would be tedious, let’s use a data set that is already available in R:

data(trees) # girth, height and volume of cherry trees
str(trees) # structure of data frame
'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
ncol(trees)
[1] 3
nrow(trees)
[1] 31
head(trees) # print the first few rows
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
summary(trees) # Quickly get an overview of the data frame.
     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00  
trees$Girth # select column by name
 [1]  8.3  8.6  8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
trees$Height[1:5] # select column by name; return first five elements
[1] 70 65 63 72 81
trees[1:3, ] #select rows 1 through 3
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
trees[1:3, ]$Volume # select rows 1 through 3; return column Volume
[1] 10.3 10.3 10.2
trees <- rbind(trees, c(13.25, 76, 30.17)) # add a row
trees_double <- cbind(trees, trees) # combine columns
colnames(trees) <- c("Circumference", "Height", "Volume") # change column names

Exercise:

  • What is the average height of the cherry trees?
  • What is the average girth of those that are more than 75 ft tall?
  • What is the maximum height of trees with a volume between 15 and 35 ft\(^3\)?

1.7 Reading and writing data

In most cases, you will not generate your data in R, but import it from a file. By far, the best option is to have your data in a comma separated value text file or in a tab separated file. Then, you can use the function read.csv (or read.table) to import your data. The syntax of the functions is as follows:

read.csv("MyFile.csv") # read the file MyFile.csv
read.csv("MyFile.csv", header = TRUE) # the file has a header
read.csv("MyFile.csv", sep = ';') # specify the column separator
read.csv("MyFile.csv", skip = 5) # skip the first 5 lines

Note that columns containing strings are typically converted to factors (categorical values, useful when performing regressions). To avoid this behavior, you can specify stringsAsFactors = FALSE when calling the function.

Similarly, you can save your data frames using write.table or write.csv. Suppose you want to save the data frame MyDF:

write.csv(MyDF, "MyFile.csv") 
write.csv(MyDF, "MyFile.csv", append = TRUE) # append to the end of the file 
write.csv(MyDF, "MyFile.csv", row.names = TRUE) # include the row names
write.csv(MyDF, "MyFile.csv", col.names = FALSE) # do not include column names

Let’s look at an example: Read a file containing data on the 6th chromosome for a number of Europeans (Data adapted from Stanford HGDP SNP Genotyping Data by John Novembre). This example shows that you can read data directly from the internet!

# The actual URL is
# https://github.com/StefanoAllesina/BSD-QBio4/raw/master/tutorials/basic_computing_1/data/H938_Euro_chr6.geno
ch6 <- read.table("https://tinyurl.com/y7vctq3v", 
                  header = TRUE, stringsAsFactors = FALSE)

where header = TRUE means that we want to take the first line to be a header containing the column names. How big is this table?

dim(ch6)
[1] 43141     7

we have 7 columns, but more than 40k rows! Let’s see the first few:

head(ch6)
  CHR        SNP A1 A2 nA1A1 nA1A2 nA2A2
1   6  rs4959515  A  G     0    17   107
2   6   rs719065  A  G     0    26    98
3   6  rs6596790  C  T     0     4   119
4   6  rs6596796  A  G     0    22   102
5   6  rs1535053  G  A     5    39    80
6   6 rs12660307  C  T     0     3   121

and the last few:

tail(ch6)
      CHR        SNP A1 A2 nA1A1 nA1A2 nA2A2
43136   6 rs10946282  C  T     0    16   108
43137   6  rs3734763  C  T    19    56    48
43138   6   rs960744  T  C    32    60    32
43139   6  rs4428484  A  G     1    11   112
43140   6  rs7775031  T  C    26    56    42
43141   6 rs12213906  C  T     1    11   112

The data contains the number of homozygotes (nA1A1, nA2A2) and heterozygotes (nA1A2), for 43,141 single nucleotide polymorphisms (SNPs) obtained by sequencing European individuals:

  • CHR The chromosome (6 in this case)
  • SNP The identifier of the Single Nucleotide Polymorphism
  • A1 One of the alleles
  • A2 The other allele
  • nA1A1 The number of individuals with the particular combination of alleles.

Exercise:

  • How many individuals were sampled? Find the maximum of the sum nA1A1 + nA1A2 + nA2A2. Note: you can access the columns by index (e.g., ch6[,5]), or by name (e.g., ch6$nA1A1, or also ch6[,"nA1A1"]).
  • Try using the function rowSums to obtain the same result.
  • For how many SNPs do we have that all sampled individuals are homozygotes (i.e., all A1A1 or all A2A2)?
  • For how many SNPs, are more than 99% of the sampled individuals homozygous?

1.8 Conditional branching

Now we turn to writing actual programs in the Source code panel. To start a new R program, press Ctrl + Shift + N. This will open an Untitled script. Save the script by pressing Ctrl + S: save it as conditional.R in the directory programming_skills/sandbox/. To make sure you’re working in the directory where the script is contained, on the menu on the top choose Session -> Set Working Directory -> To Source File Location.

Now type the following script:

print("Hello world!")
x <- 4
print(x)

and execute the script by pressing Ctrl + Shift + S. You should see Hello World! and 4 printed in your console.

As you saw in this simple example, when R executes the program, it starts from the top and proceeds toward the end of the file. Every time it encounters a command (for example, print(x), printing the value of x into the console), it executes it.

When we want a certain block of code to be executed only when a certain condition is met, we can write a conditional branching point. The syntax is as follows:

if (condition is met){
  # execute this block of code
} else {
  # execute this other block of code
}

For example, add these lines to the script conditional.R, and run it again:

print("Hello world!")
x <- 4
print(x)
if (x %% 2 == 0){
  my_message <- paste(x, "is even")
} else {
  my_message <- paste(x, "is odd")
}
print(my_message)

We have created a conditional branching point, so that the value of my_message changes depending on whether x is even (and thus the remainder of the integer division by 2 is 0), or odd. Change the line x <- 4 to x <- 131 and run it again.

Exercise: What does this do?

x <- 36
if (x > 20){
  x <- sqrt(x)
} else {
  x <- x ^ 2
}
if (x > 7) {
  print(x)
} else if (x %% 2 == 1){
  print(x + 1)
}

1.9 Looping

Another way to change the flow of the program is to write a loop. A loop is simply a series of commands that are repeated a number of times. For example, you want to run the same analysis on different data sets that you collected; you want to plot the results contained in a set of files; you want to test your simulation over a number of parameter sets; etc.

R provides you with two ways to loop over blocks of commands: the for loop, and the while loop. Let’s start with the for loop, which is used to iterate over a vector (or a list): for each value of the vector, a series of commands will be run, as shown by the following example, which you can type in a new script called forloop.R.

myvec <- 1:10 # vector with numbers from 1 to 10

for (i in myvec) {
  a <- i ^ 2
  print(a)
}

In the code above, the variable i takes the value of each element of myvec in sequence. Inside the block defined by the for loop, you can use the variable i to perform operations.

The anatomy of the for statement:

for (variable in list_or_vector) {
  execute these commands
} # automatically moves to the next value

For loops are used when you know that you want to perform the analysis using a given set of values (e.g., run over all files of a directory, all samples in your data, all sequences of a fasta file, etc.).

The while loop is used when the commands need to be repeated while a certain condition is true, as shown by the following example, which you can type in a script called whileloop.R:

i <- 1

while (i <= 10) {
  a <- i ^ 2
  print(a)
  i <- i + 1 
}

The script performs exactly the same operations we wrote for the for loop above. Note that you need to update the value of i, (using i <- i + 1), otherwise the loop will run forever (infinite loop—to terminate click on the stop button in the top-right corner of the console). The anatomy of the while statement:

while (condition is met) {
  execute these commands
} # beware of infinite loops: remember to update the condition!

You can break a loop using the command break. For example:

i <- 1

while (i <= 10) {
  if (i > 5) {
    break
  }
  a <- i ^ 2
  print(a)
  i <- i + 1
}

Exercise: What does this do? Try to guess what each loop does, and then create and run a script to confirm your intuition.

z <- seq(1, 1000, by = 3)
for (k in z) {
  if (k %% 4 == 0) {
     print(k)
  }
}
z <- readline(prompt = "Enter a number: ")
z <- as.numeric(z)
isthisspecial <- TRUE
i <- 2
while (i < z) {
  if (z %% i == 0) {
     isthisspecial <- FALSE
     break
  }
  i <- i + 1
}
if (isthisspecial == TRUE) {
  print(z)
}

1.10 Useful Functions

Here’s a short list of useful functions that will help you write your programs:

  • range(x): minimum and maximum of a vector x
  • sort(x): sort a vector x
  • unique(x): remove duplicate entries from vector x
  • which(x == a): returns a vector of the indices of x having value a
  • list.files("path_to_directory"): list the files in a directory (current directory if not specified)
  • table(x) build a table of frequencies

Exercises: What does this code do? For each snippet of code, first try to guess what will happen. Then, write a script and run it to confirm your intuition.

v <- c(1, 3, 5, 5, 3, 1, 2, 4, 6, 4, 2)
v <- sort(unique(v))
for (i in v){
  if (i > 2){
    print(i)
  }
  if (i > 4){
    break
  }
}
x <- 1:100
x <- x[which(x %% 7 == 0)]
my_amount <- 10
while (my_amount > 0){
  my_color <- NA
  while(is.na(my_color)){
    tmp <- readline(prompt="Do you want to bet on black or red? ")
    tmp <- tolower(tmp)
    if (tmp == "black") my_color <- "black"
    if (tmp == "red") my_color <- "red"
    if (is.na(my_color)) print("Please enter either red or black")
  }
  my_bet <- NA
  while(is.na(my_bet)){
    tmp <- readline(prompt="How much do you want to bet? ")
    tmp <- as.numeric(tmp)
    if (is.numeric(tmp) == FALSE){
      print("Please enter a number")
    } else {
      if (tmp > my_amount){
        print("You don't have enough money!")
      } else {
        my_bet <- tmp
        my_amount <- my_amount - tmp
      }
    }
  }
  lady_luck <- sample(c("red", "black"), 1)
  if (lady_luck == my_color){
    my_amount <- my_amount + 2 * my_bet
    print(paste("You won!! Now you have", my_amount, "gold doubloons"))
  } else {
    print(paste("You lost!! Now you have", my_amount, "gold doubloons"))
  }
}

1.11 Packages

R is the most popular statistical computing software among biologists due to its highly specialized packages, often written by biologists for biologists. You can contribute a package too! The RStudio support (goo.gl/harVqF) provides guidance on how to start developing R packages and Hadley Wickham’s free online book (r-pkgs.had.co.nz) will make you a pro.

You can find highly specialized packages to address your research questions. Here are some suggestions for finding an appropriate package. The Comprehensive R Archive Network (CRAN) offers several ways to find specific packages for your task. You can either browse packages (goo.gl/7oVyKC) and their short description or select a scientific field of interest (goo.gl/0WdIcu) to browse through a compilation of packages related to each discipline.

From within your R terminal or RStudio you can also call the function RSiteSearch("KEYWORD"), which submits a search query to the website search.r-project.org. The website rseek.org casts an even wider net, as it not only includes package names and their documentation but also blogs and mailing lists related to R. If your research interests relate to high-throughput genomic data, you should have a look the packages provided by Bioconductor (goo.gl/7dwQlq).

1.11.1 Installing a package

To install a package type

install.packages("name_of_package")

in the Console, or choose the panel Packages and then click on Install in RStudio.

1.11.2 Loading a package

To load a package type

library(name_of_package)

or call the command into your script. If you want your script to automatically install a package in case it’s missing, use this boilerplate:

if (!require(needed_package, character.only = TRUE, quietly = TRUE)) {
    install.packages(needed_package)
    library(needed_package, character.only = TRUE)
}

1.11.3 Example

For example, say we want to access the dataset bacteria, which reports the incidence of H. influenzae in Australian children. The dataset is contained in the package MASS.

First, we need to load the package:

library(MASS)

Now we can load the data:

data(bacteria)
bacteria[1:3,]
  y ap hilo week  ID     trt
1 y  p   hi    0 X01 placebo
2 y  p   hi    2 X01 placebo
3 y  p   hi    4 X01 placebo

1.12 Random numbers

To perform randomization, or any simulation, we typically need to draw random numbers. R has functions to sample random numbers from very many different statistical distributions. For example:

runif(5) # sample 5 numbers from the uniform distribution between 0 and 1
[1] 0.6823943 0.4607084 0.3336380 0.8251196 0.5060820
runif(5, min = 1, max = 9) # set the limits of the uniform distribution
[1] 2.504311 6.776124 6.677304 7.029095 6.074845
rnorm(3) # three values from standard normal
[1] -0.5799389  1.1177498 -2.7340737
rnorm(3, mean = 5, sd = 4) # specify mean and standard deviation
[1]  4.366846  1.548894 16.205117

To sample from a set of values, use sample:

v <- c("a", "b", "c", "d")
sample(v, 2) # without replacement
[1] "d" "c"
sample(v, 6, replace = TRUE) # with replacement
[1] "a" "d" "d" "a" "d" "a"
sample(v) # simply shuffle the elements
[1] "a" "d" "b" "c"

1.13 Writing functions

The R community provides about 7,000 packages. Still, sometimes there isn’t an already made function capable of doing what you need. In these cases, you can write your own functions. In fact, it is generally a good idea to always divide your analysis into functions, and then write a small “master” program that calls the functions and performs the analysis. In this way, the code will be much more legible, and you will be able to recycle the functions for your other projects.

A function in R has this form:

my_function_name <- function(optional, arguments, separated, by_commas){
  # Body of the function
  # ...
  # 
  return(return_value) # this is optional
}

A few examples:

sum_two_numbers <- function(a, b){
  apb <- a + b  
  return(apb)
}
sum_two_numbers(5, 7.2)
[1] 12.2

You can set a default value for some of the arguments: if not specified by the user, the function will use these defaults:

sum_two_numbers <- function(a = 1, b = 2){
  apb <- a + b  
  return(apb)
}
sum_two_numbers()
[1] 3
sum_two_numbers(3)
[1] 5
sum_two_numbers(b = 9)
[1] 10

The return value is optional:

my_factorial <- function(a = 6){
  if (as.integer(a) != a) {
    print("Please enter an integer!")
  } else {
    tmp <- 1
    for (i in 2:a){
      tmp <- tmp * i
    }
    print(paste(a, "! = ", tmp, sep = ""))
  }
}
my_factorial()
[1] "6! = 720"
my_factorial(10)
[1] "10! = 3628800"

You can return only one object. If you need to return multiple values, organize them into a vector/matrix/list and return that.

order_two_numbers <- function(a, b){
  if (a > b) return(c(a, b)) #nothing after the first return is executed
  return(c(b,a))
}

order_two_numbers(runif(1), runif(1))
[1] 0.34457391 0.08024138

1.14 Organizing and running code

During the class, we will write a lot of code, of increasing complexity. Here is what you should do to ensure that your programs are well-organized, easy to understand, and easy to debug.

  1. Take the problem, and divide it into its basic building blocks. Each block should be its own function.
  2. Write the code for each building block separately, and test it thoroughly.
  3. Extensively document the code, so that you can understand what you did, how you did it, and why.
  4. Combine the building blocks into a master program.

For example, let’s write code that takes the data on Chromosome 6 we have seen above, and tries to identify which SNPs deviate the most from Hardy-Weinberg equilibrium. Remember that in an infinite population, where mating is random, there is no selection and no mutations, the proportion of people carrying the alleles \(A1A1\) should be approximately \(p_{11} = p^2\) (where \(p\) is the frequency of the first allele in the population \(p = p_{11} + \frac{1}{2} p_{12}\)), those carrying \(A1A2\) should be \(p_{12} = 2 p q\) (where \(q = 1-p\)) and finally those carrying \(A2A2\) should be \(p_{22} = q^2\). This is called the Hardy-Weinberg equilibrium.

We want to test this on a number of different SNPs. First, we write a function that takes as input the data and a given SNP, and computes the probability \(p\) of carrying the first allele.

compute_probabilities_HW <- function(my_data, my_SNP = "rs1535053"){
  # Take a SNP and compute the probabilities
  # p = frequency of first allele
  # q = frequency of second allele (1 - p)
  # p11 = proportion homozygous first allele
  # p12 = proportion heterozygous
  # p22 = proportion homozygous second allele
  my_SNP_data <- my_data[my_data$"SNP" == my_SNP,]
  AA <- my_SNP_data$nA1A1
  AB <- my_SNP_data$nA1A2
  BB <- my_SNP_data$nA2A2
  tot_observations <- AA + AB + BB
  p11 <- AA / tot_observations
  p12 <- AB / tot_observations
  p22 <- BB / tot_observations
  p <- p11 + p12 / 2
  q <- 1 - p
  return(list(SNP = my_SNP,
              p11 = p11,
              p12 = p12,
              p22 = p22,
              p = p,
              q = q,
              tot = tot_observations,
              AA = AA,
              AB = AB,
              BB = BB))
}

Now we can test our function:

compute_probabilities_HW(ch6)
$SNP
[1] "rs1535053"

$p11
[1] 0.04032258

$p12
[1] 0.3145161

$p22
[1] 0.6451613

$p
[1] 0.1975806

$q
[1] 0.8024194

$tot
[1] 124

$AA
[1] 5

$AB
[1] 39

$BB
[1] 80

If the allele conformed to Hardy-Weinberg, we should find approximately \(p^2 \cdot n\) people with \(A1A1\), where \(n\) is the number of people sampled. Let’s see whether these assumptions are met by the data:

observed_vs_expected_HW <- function(SNP_data){
  # compute expectations under Hardy-Weinberg equilibrium
  # organize expected and observed in a table
  observed <- c("AA" = SNP_data$AA, "AB" = SNP_data$AB, "BB" = SNP_data$BB)
  expected <- c("AA" = SNP_data$p^2 * SNP_data$tot, 
                "AB" = 2 * SNP_data$p * SNP_data$q * SNP_data$tot, 
                "BB" = SNP_data$q^2 * SNP_data$tot)
  return(rbind(observed, expected))
}

And test it:

my_SNP_data <- compute_probabilities_HW(ch6)
observed_vs_expected_HW(my_SNP_data)
               AA       AB       BB
observed 5.000000 39.00000 80.00000
expected 4.840726 39.31855 79.84073

Pretty good! This SNP seems very close to the theoretical expectation.

Let’s try another one

observed_vs_expected_HW(compute_probabilities_HW(ch6, "rs1316662"))
               AA       AB       BB
observed 26.00000 62.00000 36.00000
expected 26.20161 61.59677 36.20161

Because we have so many SNPs, we will surely find some that do not comply with the expectation. For example:

my_SNP_data <- compute_probabilities_HW(ch6, "rs6596835")
observed_vs_expected_HW(my_SNP_data)
                AA      AB      BB
observed 17.000000 24.0000 82.0000
expected  6.837398 44.3252 71.8374

To find those with the largest deviations, we can compute for the statistic:

\[ \sum_i \frac{(e_i - o_i)^2}{e_i} \] In genetics, this is called \(\chi^2\) statistics, because if the data were to follow the assumptions, these quantities would follow the \(\chi^2\) distribution.

compute_chi_sq_stat <- function(my_obs_vs_expected){
  observed <- my_obs_vs_expected["observed",]
  expected <- my_obs_vs_expected["expected",]
  return(sum((expected - observed)^2 / expected))
}

Now let’s compute the statistic for each SNPs:

# because this might take a while, we're going to only analyze the first 1000 SNPs
all_SNPs <- ch6$SNP[1:1000]
results <- data.frame(SNP = all_SNPs, ChiSq = 0)
for (i in 1:nrow(results)){
  results[i, 2] <- compute_chi_sq_stat(observed_vs_expected_HW(compute_probabilities_HW(ch6, results[i, 1])))
}

To find the ones with the largest discrepancy, run

results <- results[order(results$ChiSq, decreasing = TRUE),]
head(results)
          SNP     ChiSq
10  rs2281351 53.993853
221 rs1933650 27.724832
36  rs6596835 25.862675
681  rs689035  9.802277
178 rs6930805  9.491511
179 rs1737539  9.491511

This example showed how a seemingly difficult problem can be decomposed in smaller problems that are easier to solve.

1.15 Documenting the code using knitr

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.

Donald E. Knuth, Literate Programming, 1984

When doing experiments, we typically keep track of everything we do in a laboratory notebook, so that when writing the manuscript, or responding to queries, we can go back to our documentation to find exactly what we did, how we did it, and possibly why we did it. The same should be true for computational work.

RStudio makes it very easy to build a computational laboratory notebook. First, create a new R Markdown file (choose File -> New File -> R Markdown from the menu).

The gist of it is that you write a text file (.Rmd). The file is then read by an interpreter that transforms it into an .html or .pdf file, or even into a Word document. You can use special syntax to render the text in different ways. For example, type

***********

*Test* **Test2**

# Very large header

## Large header

### Smaller header

## Unordered lists

* First
* Second
    + Second 1
    + Second 2

1. This is
2. A numbered list

You can insert `inline code`

-----------

The most important feature of R Markdown, however, is that you can include blocks of code, and they will be interpreted and executed by R. You can therefore combine effectively the code itself with the description of what you are doing.

For example, including

{{r, eval=FALSE}} print("hello world!")

will become

print("hello world!")  
[1] "hello world!"

If you don’t want to run the R code, but just display it, use {r, eval = FALSE}; if you want to show the output but not the code, use {r, echo = FALSE}.

You can include plots, tables, and even render equations using LaTeX. In summary, when exploring your data or writing the methods of your paper, give R Markdown a try!

You can find inspiration in the notes for this class: all are written in R Markdown.

1.16 Resources

There are very many excellent books and tutorials you can read to become a proficient programmer in R. For example: