1 + 1
[1] 2
R
efresherIntroduce the statistical software R
, and show how it can be used to analyze biological data in an automated, replicable way. Showcase the RStudio
development environment, illustrate the notion of assignment, present the main data structures available in R
. Show how to read and write data, how to execute simple programs, and how to modify the stream of execution of a program through conditional branching and looping. Introduce the use of packages and user-defined functions.
When it comes to analyzing data, there are two competing paradigms. First, one could use point-and-click software with a graphical user interface, such as Excel, to perform calculations and draw graphs; second, one could write programs that can be run to perform the analysis of the data, the generation of tables and statistics, and the production of figures automatically.
This latter approach is to be preferred, because it allows for the automation of analysis, it requires a good documentation of the procedures, and is completely replicable.
A few motivating examples:
You have written code to analyze your data. You receive from your collaborators a new batch of data. With simple modifications of your code, you can update your results, tables and figures automatically.
A new student joins the laboratory. The new student can read the code and understand the analysis without the need of a lab mate showing the procedure step-by-step.
The reviewers of your manuscript ask you to slightly alter the analysis. Rather than having to start over, you can modify a few lines of code and satisfy the reviewers.
Here we introduce R
, which can help you write simple programs to analyze your data, perform statistical analysis, and draw beautiful figures.
To follow this tutorial, you will need to install R
and RStudio
R
: download and install R
from this page. Choose the right architecture (Windows, Mac, Linux). If possible, install the latest release.RStudio
: go to this page and download the “RStudio Desktop Open Source License”.R
packages: launch RStudio
. Click on “Packages” in the bottom-right panel. Click on “Install”: a dialog window will open. Type tidyverse
in the field “Packages” and click on “Install”. This might take a few minutes, and ask you to download further packages.R
is a statistical software that is completely programmable. This means that one can write a program (script) containing a series of commands for the analysis of data, and execute them automatically. This approach is especially good as it makes the analysis of data well-documented, and completely replicable.
R
is free software: anyone can download its source code, modify it, and improve it. The R
community of users is vast and very active. In particular, scientists have enthusiastically embraced the program, creating thousands of packages to perform specific types of analysis, and adding many new capabilities. You can find a list of official packages (which have been vetted by R
core developers) here; many more are available on GitHub and other websites.
The main hurdle new users face when approaching R
is that it is based on a command line interface: when you launch R
, you simply open a console with the character >
signaling that R
is ready to accept an input. When you write a command and press Enter
, the command is interpreted by R
, and the result is printed immediately after the command. For example,
1 + 1
[1] 2
A little history: R
was modeled after the commercial statistical software S
by Robert Gentleman and Ross Ihaka. The project was started in 1992, first released in 1994, and the first stable version appeared in 2000. Today, R
is managed by the R Core Team.
For this introduction, we’re going to use RStudio
, an Integrated Development Environment (IDE) for R
. The main advantage is that the environment will look identical irrespective of your computer architecture (Linux, Windows, Mac). Also, RStudio
makes writing code much easier by automatically completing commands and file names (simply type the beginning of the name and press Tab
), and allowing you to easily inspect data and code.
Typically, an RStudio
window contains four panels:
R
. For this tutorial, we will work mainly in this panel.Ctrl + S
and then execute it by pressing Ctrl + Shift + S
.help(name_of_command)
in the Console) and packages.An R
program is simply a list of commands, which are executed one after the other. The commands are written in a text file (with extension .R
). When R
executes the program, it will start from the beginning of the file and proceed toward the end of the file. Every time R
encounters a command, it will execute it. Special commands can modify this basic flow of the program by, for example, executing a series of commands only when a condition is met, or repeating the execution of a series of commands multiple times.
Note that if you were to copy and paste (or type) the code into the Console you would obtain exactly the same result. Writing a program is advantageous, however, because the analysis can be automated, and the code shared with other researchers. Moreover, after a while you will have a large code base, so that you can recycle much of your code.
We start by working on the console, and then start writing simple scripts.
The most basic operation in any programming language is the assignment. In R
, assignment is marked by the operator <-
(can be typed quickly using Alt -
). When you type a command in R
, it is executed, and the output is printed in the Console. For example:
sqrt(9)
[1] 3
If we want to save the result of this operation, we can assign it to a variable. For example:
<- sqrt(9)
x x
[1] 3
What has happened? We wrote a command containing an assignment operator (<-
). R
has evaluated the right-hand-side of the command (sqrt(9)
), and has stored the result (3
) in a newly created variable called x
. Now we can use x
in our commands: every time the command needs to be evaluated, the program will look up which value is associated with the variable x
, and substitute it. For example:
* 2 x
[1] 6
R
provides different types of data that can be used in your programs. For each variable x
, calling class(x)
prints the type of the variable. The basic data types are:
logical
, taking only two possible values: TRUE
and FALSE
<- TRUE
v class(v)
[1] "logical"
numeric
, storing real numbers (actually, their approximations, as computers have limited memory and thus cannot store numbers like π, or even 0.2
)<- 3.77
v class(v)
[1] "numeric"
<- 6.022e23 # 6.022⋅10^23 (Avogadro's number)
v class(v)
[1] "numeric"
integer
, storing whole numbers<- 23L # the L signals that this should be stored as integer
v class(v)
[1] "integer"
complex
, storing complex numbers (i.e., with a real and an imaginary part)<- 23 + 5i # the i marks the imaginary part
v class(v)
[1] "complex"
character
, for strings, characters and text<- 'a string' # you can use single or double quotes
v class(v)
[1] "character"
In R
, the value and type of a variable are evaluated at run-time. This means that you can recycle the names of variables. This is very handy, but can make your programs more difficult to read and to debug (i.e., find mistakes). For example:
<- '2.3' # this is a string
x x
[1] "2.3"
<- 2.3 # this is numeric
x x
[1] 2.3
Each data type supports a certain number of operators and functions. For example, numeric variables can be combined with +
(addition), -
(subtraction), *
(multiplication), /
(division), and ^
(or **
, exponentiation). A possibly unfamiliar operator is the modulo (%%
), calculating the remainder of an integer division:
5 %% 3
[1] 2
meaning that 5 %/% 3
(5 integer divided by 3) is 1 with a remainder of 2
The modulo operator is useful to determine whether a number is divisible for another: if y
is divisible by x
, then y %% x
is 0.
R
provides many built-in functions: each functions has a name, followed by round parentheses surrounding the (possibly optional) function arguments. For example, these functions operate on numeric
variables:
abs(x)
absolute valuesqrt(x)
square rootround(x, digits = 3)
round x
to three decimal digitscos(x)
cosine (also supported are all the usual trigonometric functions)log(x)
natural logarithm (use log10
for base 10 logarithms)exp(x)
calculating \(e^x\)Similarly, character
variables have their own set of functions, such as:
toupper(x)
make uppercasenchar(x)
count the number of characters in the stringpaste(x, y, sep = "_")
concatenate strings, joining them using the separator _
strsplit(x, "_")
separate the string using the separator _
Calling a function meant for a certain data type on another will cause errors. If sensible, you can convert a type into another. For example:
<- "2.13"
v class(v)
[1] "character"
# if we call v * 2, we get an error.
# to avoid it, we can convert v to numeric:
as.numeric(v) * 2
[1] 4.26
If sensible, you can use the comparison operators >
(greater), <
(lower), ==
(equals), !=
(differs), >=
and <=
, returning a logical value:
2 == sqrt(4)
[1] TRUE
2 < sqrt(4)
[1] FALSE
2 <= sqrt(4)
[1] TRUE
Exercise:
Why are two equal signs (
==
) used to check that two values are equal? What happens if you use only one=
sign?
Similarly, you can concatenate several comparison and logical variables using &
(and), |
(or), and !
(not):
2 > 3) & (3 > 1) (
[1] FALSE
2 > 3) | (3 > 1) (
[1] TRUE
If you want to know more about a function, type ?my_function_name
in the console (e.g., ?abs
). This will open the help page in one of the panels on the right. The same can be accomplished calling help(abs)
. For more complex questions, check out stackoverflow.
Besides these simple types, R
provides structured data types, meant to collect and organize multiple values.
The most basic data structure in R
is the vector, which is an ordered collection of values of the same type. Vectors can be created by concatenating different values with the function c()
(“combine”):
<- c(2, 3, 5, 27, 31, 13, 17, 19)
x x
[1] 2 3 5 27 31 13 17 19
You can access the elements of a vector by their index: the first element is indexed at 1, the second at 2, etc.
3] x[
[1] 5
8] x[
[1] 19
9] # what if the element does not exist? x[
[1] NA
NA
stands for “Not Available”. Other special values are NaN
(Not a Number, e.g., 0/0
), Inf
(Infinity, e.g., 1/0
), and NULL
(variable undefined). You can test for special values using is.na(x)
, is.infinite(x)
, is.null(x)
, etc.
Note that in R
a single number (string, logical) is a vector of length 1 by default. That’s why if you type 3
in the console you see [1] 3
in the output.
You can extract several elements at once (i.e., create another vector), using the colon (:
) command, or by concatenating the indices:
1:3] x[
[1] 2 3 5
4:7] x[
[1] 27 31 13 17
c(1,3,5)] x[
[1] 2 5 31
You can also use a vector of logical variables to extract values from vectors. For example, suppose we have two vectors:
<- c("M", "M", "F", "M", "F") # sex of Drosophila
sex <- c(0.230, 0.281, 0.228, 0.260, 0.231) # weight in mg weight
and that we want to extract only the weights for the males.
== "M" sex
[1] TRUE TRUE FALSE TRUE FALSE
returns a vector of logical values, which we can use to subset the data:
== "M"] weight[sex
[1] 0.230 0.281 0.260
Given that R
was born for statistics, there are many statistical functions you can perform on vectors:
length(x)
[1] 8
min(x)
[1] 2
max(x)
[1] 31
sum(x) # sum all elements
[1] 117
prod(x) # multiply all elements
[1] 105436890
median(x) # median value
[1] 15
mean(x) # arithmetic mean
[1] 14.625
var(x) # unbiased sample variance
[1] 119.4107
mean(x ^ 2) - mean(x) ^ 2 # population variance
[1] 104.4844
summary(x) # print a summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 4.50 15.00 14.62 21.00 31.00
You can generate vectors of sequential numbers using the colon command:
<- 1:10
x x
[1] 1 2 3 4 5 6 7 8 9 10
For more complex sequences, use seq
:
seq(from = 1, to = 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
To repeat a value or a sequence several times, use rep
:
rep("abc", 3)
[1] "abc" "abc" "abc"
rep(c(1, 2, 3), 3)
[1] 1 2 3 1 2 3 1 2 3
Exercise:
- Create a vector containing all the even numbers between 2 and 100 (inclusive) and store it in variable
z
.- Extract all the elements of
z
that are divisible by 12. How many elements match this criterion?- What is the sum of all the elements of
z
?- Is it equal to \(51 \cdot 50\)?
- What is the product of elements 5, 10 and 15 of
z
?- Does
seq(2, 100, by = 2)
produce the same vector as(1:50) * 2
?- What happens if you type
z ^ 2
?
A matrix is a two-dimensional table of values. In case of numeric values, you can perform the usual operations on matrices (product, inverse, decomposition, etc.):
<- matrix(c(1, 2, 3, 4), 2, 2) # values, nrows, ncols
A A
[,1] [,2]
[1,] 1 3
[2,] 2 4
%*% A # matrix product A
[,1] [,2]
[1,] 7 15
[2,] 10 22
solve(A) # matrix inverse
[,1] [,2]
[1,] -2 1.5
[2,] 1 -0.5
%*% solve(A) # this should return the identity matrix A
[,1] [,2]
[1,] 1 0
[2,] 0 1
<- matrix(1, 3, 2) # you can fill the whole matrix with a single number (1)
B B
[,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 1 1
%*% t(B) # transpose B
[,1] [,2] [,3]
[1,] 2 2 2
[2,] 2 2 2
[3,] 2 2 2
<- matrix(1:9, 3, 3) # by default, matrices are filled by column
Z Z
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
To determine the dimensions of a matrix, use dim
:
dim(B)
[1] 3 2
dim(B)[1]
[1] 3
nrow(B)
[1] 3
dim(B)[2]
[1] 2
ncol(B)
[1] 2
nrow(B)
[1] 3
Use indices to access a particular row/column of a matrix:
Z
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
1, ] # first row Z[
[1] 1 4 7
2] # second column Z[,
[1] 4 5 6
1:2, 2:3] # submatrix with coefficients in first two rows, and second and third column Z [
[,1] [,2]
[1,] 4 7
[2,] 5 8
c(1, 3), c(1, 3)] # indexing non-adjacent rows/columns Z[
[,1] [,2]
[1,] 1 7
[2,] 3 9
Some functions use all the elements of the matrix:
sum(Z)
[1] 45
mean(Z)
[1] 5
Some functions apply the operation across a given dimension (e.g., columns) of the matrix:
rowSums(Z) # returns a vector of the sums of the values in each row
[1] 12 15 18
colSums(Z) # does the same for columns
[1] 6 15 24
rowMeans(Z) # returns a vector of the means of the values in each row
[1] 4 5 6
colMeans(Z) # does the same for columns
[1] 2 5 8
If you need tables with more than two dimensions, use arrays:
<- array(1:24, c(4, 3, 2))
M M
, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
You can still determine the dimensions using:
dim(M)
[1] 4 3 2
and access the elements as done for matrices. One thing you should be paying attention to: R
drops dimensions that are not needed. So, if you access a “slice” of a 3-dimensional array:
1] M[, ,
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
you obtain a matrix:
dim(M[, , 1])
[1] 4 3
This can be problematic, for example, when your code expects an array and R
turns your data into a matrix (or you expect a matrix but find a vector). To avoid this behavior, add drop = FALSE
when subsetting:
dim(M[, , 1, drop = FALSE])
[1] 4 3 1
Vectors are good if each element is of the same type (e.g., numbers, strings). Lists are used when we want to store elements of different types, or more complex objects (e.g., vectors, matrices, even lists of lists). Each element of the list can be referenced either by its index, or by a label:
<- list(Names = c("a", "b", "c", "d"), Values = c(1, 2, 3))
mylist mylist
$Names
[1] "a" "b" "c" "d"
$Values
[1] 1 2 3
1]] # access first element using index mylist[[
[1] "a" "b" "c" "d"
2]] # access second element by index mylist[[
[1] 1 2 3
$Names # access second element by label mylist
[1] "a" "b" "c" "d"
"Names"]] # another way to access by label mylist[[
[1] "a" "b" "c" "d"
"Values"]][3] # access third element in second vector mylist[[
[1] 3
Data frames contain data organized like in a spreadsheet. The columns (typically representing different measurements) can be of different types (e.g., a column could be the date of measurement, another the weight of the individual, or the volume of the cell, or the treatment of the sample), while the rows typically represent different samples.
When you read a spreadsheet file in R
, it is automatically stored as a data frame. The difference between a matrix and a data frame is that in a matrix all the values are of the same type (e.g., all numeric), while in a data frame each column can be of a different type.
Because typing a data frame by hand would be tedious, let’s use a data set that is already available in R
:
data(trees) # girth, height and volume of cherry trees
str(trees) # structure of data frame
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
ncol(trees)
[1] 3
nrow(trees)
[1] 31
head(trees) # print the first few rows
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
summary(trees) # Quickly get an overview of the data frame.
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
$Girth # select column by name trees
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
[16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
[31] 20.6
$Height[1:5] # select column by name; return first five elements trees
[1] 70 65 63 72 81
1:3, ] #select rows 1 through 3 trees[
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
1:3, ]$Volume # select rows 1 through 3; return column Volume trees[
[1] 10.3 10.3 10.2
<- rbind(trees, c(13.25, 76, 30.17)) # add a row
trees <- cbind(trees, trees) # combine columns
trees_double colnames(trees) <- c("Circumference", "Height", "Volume") # change column names
Exercise:
- What is the average height of the cherry trees?
- What is the average girth of those that are more than 75 ft tall?
- What is the maximum height of trees with a volume between 15 and 35 ft\(^3\)?
In most cases, you will not generate your data in R
, but import it from a file. By far, the best option is to have your data in a comma separated value text file or in a tab separated file. Then, you can use the function read.csv
(or read.table
) to import your data. The syntax of the functions is as follows:
read.csv("MyFile.csv") # read the file MyFile.csv
read.csv("MyFile.csv", header = TRUE) # the file has a header
read.csv("MyFile.csv", sep = ';') # specify the column separator
read.csv("MyFile.csv", skip = 5) # skip the first 5 lines
Note that columns containing strings are typically converted to factors (categorical values, useful when performing regressions). To avoid this behavior, you can specify stringsAsFactors = FALSE
when calling the function.
Similarly, you can save your data frames using write.table
or write.csv
. Suppose you want to save the data frame MyDF
:
write.csv(MyDF, "MyFile.csv")
write.csv(MyDF, "MyFile.csv", append = TRUE) # append to the end of the file
write.csv(MyDF, "MyFile.csv", row.names = TRUE) # include the row names
write.csv(MyDF, "MyFile.csv", col.names = FALSE) # do not include column names
Let’s look at an example: Read a file containing data on the 6th chromosome for a number of Europeans (Data adapted from Stanford HGDP SNP Genotyping Data by John Novembre). This example shows that you can read data directly from the internet!
# The actual URL is
# https://github.com/StefanoAllesina/BSD-QBio4/raw/master/tutorials/basic_computing_1/data/H938_Euro_chr6.geno
<- read.table("https://tinyurl.com/y7vctq3v",
ch6 header = TRUE, stringsAsFactors = FALSE)
where header = TRUE
means that we want to take the first line to be a header containing the column names. How big is this table?
dim(ch6)
[1] 43141 7
we have 7 columns, but more than 40k rows! Let’s see the first few:
head(ch6)
CHR SNP A1 A2 nA1A1 nA1A2 nA2A2
1 6 rs4959515 A G 0 17 107
2 6 rs719065 A G 0 26 98
3 6 rs6596790 C T 0 4 119
4 6 rs6596796 A G 0 22 102
5 6 rs1535053 G A 5 39 80
6 6 rs12660307 C T 0 3 121
and the last few:
tail(ch6)
CHR SNP A1 A2 nA1A1 nA1A2 nA2A2
43136 6 rs10946282 C T 0 16 108
43137 6 rs3734763 C T 19 56 48
43138 6 rs960744 T C 32 60 32
43139 6 rs4428484 A G 1 11 112
43140 6 rs7775031 T C 26 56 42
43141 6 rs12213906 C T 1 11 112
The data contains the number of homozygotes (nA1A1
, nA2A2
) and heterozygotes (nA1A2
), for 43,141 single nucleotide polymorphisms (SNPs) obtained by sequencing European individuals:
CHR
The chromosome (6 in this case)SNP
The identifier of the Single Nucleotide PolymorphismA1
One of the allelesA2
The other allelenA1A1
The number of individuals with the particular combination of alleles.Exercise:
- How many individuals were sampled? Find the maximum of the sum
nA1A1 + nA1A2 + nA2A2
. Note: you can access the columns by index (e.g.,ch6[,5]
), or by name (e.g.,ch6$nA1A1
, or alsoch6[,"nA1A1"]
).- Try using the function
rowSums
to obtain the same result.- For how many SNPs do we have that all sampled individuals are homozygotes (i.e., all
A1A1
or allA2A2
)?- For how many SNPs, are more than 99% of the sampled individuals homozygous?
Now we turn to writing actual programs in the Source code panel. To start a new R
program, press Ctrl + Shift + N
. This will open an Untitled
script. Save the script by pressing Ctrl + S
: save it as conditional.R
in the directory programming_skills/sandbox/
. To make sure you’re working in the directory where the script is contained, on the menu on the top choose Session -> Set Working Directory -> To Source File Location
.
Now type the following script:
print("Hello world!")
<- 4
x print(x)
and execute the script by pressing Ctrl + Shift + S
. You should see Hello World!
and 4
printed in your console.
As you saw in this simple example, when R
executes the program, it starts from the top and proceeds toward the end of the file. Every time it encounters a command (for example, print(x)
, printing the value of x
into the console), it executes it.
When we want a certain block of code to be executed only when a certain condition is met, we can write a conditional branching point. The syntax is as follows:
if (condition is met){
# execute this block of code
else {
} # execute this other block of code
}
For example, add these lines to the script conditional.R
, and run it again:
print("Hello world!")
<- 4
x print(x)
if (x %% 2 == 0){
<- paste(x, "is even")
my_message else {
} <- paste(x, "is odd")
my_message
}print(my_message)
We have created a conditional branching point, so that the value of my_message
changes depending on whether x
is even (and thus the remainder of the integer division by 2 is 0), or odd. Change the line x <- 4
to x <- 131
and run it again.
Exercise: What does this do?
<- 36 x if (x > 20){ <- sqrt(x) x else { } <- x ^ 2 x }if (x > 7) { print(x) else if (x %% 2 == 1){ } print(x + 1) }
Another way to change the flow of the program is to write a loop. A loop is simply a series of commands that are repeated a number of times. For example, you want to run the same analysis on different data sets that you collected; you want to plot the results contained in a set of files; you want to test your simulation over a number of parameter sets; etc.
R
provides you with two ways to loop over blocks of commands: the for
loop, and the while
loop. Let’s start with the for
loop, which is used to iterate over a vector (or a list): for each value of the vector, a series of commands will be run, as shown by the following example, which you can type in a new script called forloop.R
.
<- 1:10 # vector with numbers from 1 to 10
myvec
for (i in myvec) {
<- i ^ 2
a print(a)
}
In the code above, the variable i
takes the value of each element of myvec
in sequence. Inside the block defined by the for
loop, you can use the variable i
to perform operations.
The anatomy of the for
statement:
for (variable in list_or_vector) {
execute these commands# automatically moves to the next value }
For loops are used when you know that you want to perform the analysis using a given set of values (e.g., run over all files of a directory, all samples in your data, all sequences of a fasta file, etc.).
The while
loop is used when the commands need to be repeated while a certain condition is true, as shown by the following example, which you can type in a script called whileloop.R
:
<- 1
i
while (i <= 10) {
<- i ^ 2
a print(a)
<- i + 1
i }
The script performs exactly the same operations we wrote for the for
loop above. Note that you need to update the value of i
, (using i <- i + 1
), otherwise the loop will run forever (infinite loop—to terminate click on the stop button in the top-right corner of the console). The anatomy of the while
statement:
while (condition is met) {
execute these commands# beware of infinite loops: remember to update the condition! }
You can break a loop using the command break
. For example:
<- 1
i
while (i <= 10) {
if (i > 5) {
break
}<- i ^ 2
a print(a)
<- i + 1
i }
Exercise: What does this do? Try to guess what each loop does, and then create and run a script to confirm your intuition.
<- seq(1, 1000, by = 3) z for (k in z) { if (k %% 4 == 0) { print(k) } }
<- readline(prompt = "Enter a number: ") z <- as.numeric(z) z <- TRUE isthisspecial <- 2 i while (i < z) { if (z %% i == 0) { <- FALSE isthisspecial break }<- i + 1 i }if (isthisspecial == TRUE) { print(z) }
Here’s a short list of useful functions that will help you write your programs:
range(x)
: minimum and maximum of a vector x
sort(x)
: sort a vector x
unique(x)
: remove duplicate entries from vector x
which(x == a)
: returns a vector of the indices of x
having value a
list.files("path_to_directory")
: list the files in a directory (current directory if not specified)table(x)
build a table of frequenciesExercises: What does this code do? For each snippet of code, first try to guess what will happen. Then, write a script and run it to confirm your intuition.
<- c(1, 3, 5, 5, 3, 1, 2, 4, 6, 4, 2) v <- sort(unique(v)) v for (i in v){ if (i > 2){ print(i) }if (i > 4){ break } }
<- 1:100 x <- x[which(x %% 7 == 0)] x
<- 10 my_amount while (my_amount > 0){ <- NA my_color while(is.na(my_color)){ <- readline(prompt="Do you want to bet on black or red? ") tmp <- tolower(tmp) tmp if (tmp == "black") my_color <- "black" if (tmp == "red") my_color <- "red" if (is.na(my_color)) print("Please enter either red or black") }<- NA my_bet while(is.na(my_bet)){ <- readline(prompt="How much do you want to bet? ") tmp <- as.numeric(tmp) tmp if (is.numeric(tmp) == FALSE){ print("Please enter a number") else { } if (tmp > my_amount){ print("You don't have enough money!") else { } <- tmp my_bet <- my_amount - tmp my_amount } } }<- sample(c("red", "black"), 1) lady_luck if (lady_luck == my_color){ <- my_amount + 2 * my_bet my_amount print(paste("You won!! Now you have", my_amount, "gold doubloons")) else { } print(paste("You lost!! Now you have", my_amount, "gold doubloons")) } }
R
is the most popular statistical computing software among biologists due to its highly specialized packages, often written by biologists for biologists. You can contribute a package too! The RStudio
support (goo.gl/harVqF
) provides guidance on how to start developing R
packages and Hadley Wickham’s free online book (r-pkgs.had.co.nz
) will make you a pro.
You can find highly specialized packages to address your research questions. Here are some suggestions for finding an appropriate package. The Comprehensive R Archive Network (CRAN) offers several ways to find specific packages for your task. You can either browse packages (goo.gl/7oVyKC
) and their short description or select a scientific field of interest (goo.gl/0WdIcu
) to browse through a compilation of packages related to each discipline.
From within your R
terminal or RStudio
you can also call the function RSiteSearch("KEYWORD")
, which submits a search query to the website search.r-project.org
. The website rseek.org
casts an even wider net, as it not only includes package names and their documentation but also blogs and mailing lists related to R
. If your research interests relate to high-throughput genomic data, you should have a look the packages provided by Bioconductor (goo.gl/7dwQlq
).
To install a package type
install.packages("name_of_package")
in the Console, or choose the panel Packages and then click on Install in RStudio
.
To load a package type
library(name_of_package)
or call the command into your script. If you want your script to automatically install a package in case it’s missing, use this boilerplate:
if (!require(needed_package, character.only = TRUE, quietly = TRUE)) {
install.packages(needed_package)
library(needed_package, character.only = TRUE)
}
For example, say we want to access the dataset bacteria
, which reports the incidence of H. influenzae in Australian children. The dataset is contained in the package MASS
.
First, we need to load the package:
library(MASS)
Now we can load the data:
data(bacteria)
1:3,] bacteria[
y ap hilo week ID trt
1 y p hi 0 X01 placebo
2 y p hi 2 X01 placebo
3 y p hi 4 X01 placebo
To perform randomization, or any simulation, we typically need to draw random numbers. R
has functions to sample random numbers from very many different statistical distributions. For example:
runif(5) # sample 5 numbers from the uniform distribution between 0 and 1
[1] 0.2693485 0.9740900 0.5165642 0.6801005 0.4315497
runif(5, min = 1, max = 9) # set the limits of the uniform distribution
[1] 4.659539 4.119449 4.398275 1.888714 6.093033
rnorm(3) # three values from standard normal
[1] -1.2730339 0.3212495 -0.9539879
rnorm(3, mean = 5, sd = 4) # specify mean and standard deviation
[1] 7.776821 7.388011 1.902903
To sample from a set of values, use sample
:
<- c("a", "b", "c", "d")
v sample(v, 2) # without replacement
[1] "d" "a"
sample(v, 6, replace = TRUE) # with replacement
[1] "d" "d" "a" "d" "d" "b"
sample(v) # simply shuffle the elements
[1] "d" "a" "b" "c"
The R
community provides about 7,000 packages. Still, sometimes there isn’t an already made function capable of doing what you need. In these cases, you can write your own functions. In fact, it is generally a good idea to always divide your analysis into functions, and then write a small “master” program that calls the functions and performs the analysis. In this way, the code will be much more legible, and you will be able to recycle the functions for your other projects.
A function in R
has this form:
<- function(optional, arguments, separated, by_commas){
my_function_name # Body of the function
# ...
#
return(return_value) # this is optional
}
A few examples:
<- function(a, b){
sum_two_numbers <- a + b
apb return(apb)
}sum_two_numbers(5, 7.2)
[1] 12.2
You can set a default value for some of the arguments: if not specified by the user, the function will use these defaults:
<- function(a = 1, b = 2){
sum_two_numbers <- a + b
apb return(apb)
}sum_two_numbers()
[1] 3
sum_two_numbers(3)
[1] 5
sum_two_numbers(b = 9)
[1] 10
The return value is optional:
<- function(a = 6){
my_factorial if (as.integer(a) != a) {
print("Please enter an integer!")
else {
} <- 1
tmp for (i in 2:a){
<- tmp * i
tmp
}print(paste(a, "! = ", tmp, sep = ""))
}
}my_factorial()
[1] "6! = 720"
my_factorial(10)
[1] "10! = 3628800"
You can return only one object. If you need to return multiple values, organize them into a vector/matrix/list and return that.
<- function(a, b){
order_two_numbers if (a > b) return(c(a, b)) #nothing after the first return is executed
return(c(b,a))
}
order_two_numbers(runif(1), runif(1))
[1] 0.6234127 0.5273124
During the class, we will write a lot of code, of increasing complexity. Here is what you should do to ensure that your programs are well-organized, easy to understand, and easy to debug.
For example, let’s write code that takes the data on Chromosome 6 we have seen above, and tries to identify which SNPs deviate the most from Hardy-Weinberg equilibrium. Remember that in an infinite population, where mating is random, there is no selection and no mutations, the proportion of people carrying the alleles \(A1A1\) should be approximately \(p_{11} = p^2\) (where \(p\) is the frequency of the first allele in the population \(p = p_{11} + \frac{1}{2} p_{12}\)), those carrying \(A1A2\) should be \(p_{12} = 2 p q\) (where \(q = 1-p\)) and finally those carrying \(A2A2\) should be \(p_{22} = q^2\). This is called the Hardy-Weinberg equilibrium.
We want to test this on a number of different SNPs. First, we write a function that takes as input the data and a given SNP, and computes the probability \(p\) of carrying the first allele.
<- function(my_data, my_SNP = "rs1535053"){
compute_probabilities_HW # Take a SNP and compute the probabilities
# p = frequency of first allele
# q = frequency of second allele (1 - p)
# p11 = proportion homozygous first allele
# p12 = proportion heterozygous
# p22 = proportion homozygous second allele
<- my_data[my_data$"SNP" == my_SNP,]
my_SNP_data <- my_SNP_data$nA1A1
AA <- my_SNP_data$nA1A2
AB <- my_SNP_data$nA2A2
BB <- AA + AB + BB
tot_observations <- AA / tot_observations
p11 <- AB / tot_observations
p12 <- BB / tot_observations
p22 <- p11 + p12 / 2
p <- 1 - p
q return(list(SNP = my_SNP,
p11 = p11,
p12 = p12,
p22 = p22,
p = p,
q = q,
tot = tot_observations,
AA = AA,
AB = AB,
BB = BB))
}
Now we can test our function:
compute_probabilities_HW(ch6)
$SNP
[1] "rs1535053"
$p11
[1] 0.04032258
$p12
[1] 0.3145161
$p22
[1] 0.6451613
$p
[1] 0.1975806
$q
[1] 0.8024194
$tot
[1] 124
$AA
[1] 5
$AB
[1] 39
$BB
[1] 80
If the allele conformed to Hardy-Weinberg, we should find approximately \(p^2 \cdot n\) people with \(A1A1\), where \(n\) is the number of people sampled. Let’s see whether these assumptions are met by the data:
<- function(SNP_data){
observed_vs_expected_HW # compute expectations under Hardy-Weinberg equilibrium
# organize expected and observed in a table
<- c("AA" = SNP_data$AA, "AB" = SNP_data$AB, "BB" = SNP_data$BB)
observed <- c("AA" = SNP_data$p^2 * SNP_data$tot,
expected "AB" = 2 * SNP_data$p * SNP_data$q * SNP_data$tot,
"BB" = SNP_data$q^2 * SNP_data$tot)
return(rbind(observed, expected))
}
And test it:
<- compute_probabilities_HW(ch6)
my_SNP_data observed_vs_expected_HW(my_SNP_data)
AA AB BB
observed 5.000000 39.00000 80.00000
expected 4.840726 39.31855 79.84073
Pretty good! This SNP seems very close to the theoretical expectation.
Let’s try another one
observed_vs_expected_HW(compute_probabilities_HW(ch6, "rs1316662"))
AA AB BB
observed 26.00000 62.00000 36.00000
expected 26.20161 61.59677 36.20161
Because we have so many SNPs, we will surely find some that do not comply with the expectation. For example:
<- compute_probabilities_HW(ch6, "rs6596835")
my_SNP_data observed_vs_expected_HW(my_SNP_data)
AA AB BB
observed 17.000000 24.0000 82.0000
expected 6.837398 44.3252 71.8374
To find those with the largest deviations, we can compute for the statistic:
\[ \sum_i \frac{(e_i - o_i)^2}{e_i} \] In genetics, this is called \(\chi^2\) statistics, because if the data were to follow the assumptions, these quantities would follow the \(\chi^2\) distribution.
<- function(my_obs_vs_expected){
compute_chi_sq_stat <- my_obs_vs_expected["observed",]
observed <- my_obs_vs_expected["expected",]
expected return(sum((expected - observed)^2 / expected))
}
Now let’s compute the statistic for each SNPs:
# because this might take a while, we're going to only analyze the first 1000 SNPs
<- ch6$SNP[1:1000]
all_SNPs <- data.frame(SNP = all_SNPs, ChiSq = 0)
results for (i in 1:nrow(results)){
2] <- compute_chi_sq_stat(observed_vs_expected_HW(compute_probabilities_HW(ch6, results[i, 1])))
results[i, }
To find the ones with the largest discrepancy, run
<- results[order(results$ChiSq, decreasing = TRUE),]
results head(results)
SNP ChiSq
10 rs2281351 53.993853
221 rs1933650 27.724832
36 rs6596835 25.862675
681 rs689035 9.802277
178 rs6930805 9.491511
179 rs1737539 9.491511
This example showed how a seemingly difficult problem can be decomposed in smaller problems that are easier to solve.
knitr
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.
Donald E. Knuth, Literate Programming, 1984
When doing experiments, we typically keep track of everything we do in a laboratory notebook, so that when writing the manuscript, or responding to queries, we can go back to our documentation to find exactly what we did, how we did it, and possibly why we did it. The same should be true for computational work.
RStudio
makes it very easy to build a computational laboratory notebook. First, create a new R Markdown
file (choose File
-> New File
-> R Markdown
from the menu).
The gist of it is that you write a text file (.Rmd
). The file is then read by an interpreter that transforms it into an .html
or .pdf
file, or even into a Word document. You can use special syntax to render the text in different ways. For example, type
***********
*Test* **Test2**
# Very large header
## Large header
### Smaller header
## Unordered lists
* First
* Second
+ Second 1
+ Second 2
1. This is
2. A numbered list
You can insert `inline code`
-----------
The most important feature of R Markdown
, however, is that you can include blocks of code, and they will be interpreted and executed by R
. You can therefore combine effectively the code itself with the description of what you are doing.
For example, including
{{r, eval=FALSE}} print("hello world!")
will become
print("hello world!")
[1] "hello world!"
If you don’t want to run the R
code, but just display it, use {r, eval = FALSE}
; if you want to show the output but not the code, use {r, echo = FALSE}
.
You can include plots, tables, and even render equations using LaTeX. In summary, when exploring your data or writing the methods of your paper, give R Markdown
a try!
You can find inspiration in the notes for this class: all are written in R Markdown
.
There are very many excellent books and tutorials you can read to become a proficient programmer in R
. For example: