read.csv("MyFile.csv") # read the file MyFile.csv
read.csv("MyFile.csv", header = TRUE) # the file has a header
read.csv("MyFile.csv", sep = ';') # specify the column separator
read.csv("MyFile.csv", skip = 5) # skip the first 5 lines
R
efresher
Goal
Introduce the statistical software R
, and show how it can be used to analyze biological data in an automated, replicable way. Showcase the RStudio
development environment, illustrate the notion of assignment, present the main data structures available in R
. Show how to read and write data, how to execute simple programs, and how to modify the stream of execution of a program through conditional branching and looping. Introduce the use of packages and user-defined functions.
Motivation
When it comes to analyzing data, there are two competing paradigms. First, one could use point-and-click software with a graphical user interface, such as Excel, to perform calculations and draw graphs; second, one could write programs that can be run to perform the analysis of the data, the generation of tables and statistics, and the production of figures automatically.
This latter approach is to be preferred, because it allows for the automation of analysis, it requires a good documentation of the procedures, and is completely replicable.
A few motivating examples:
You have written code to analyze your data. You receive from your collaborators a new batch of data. With simple modifications of your code, you can update your results, tables and figures automatically.
A new student joins the laboratory. The new student can read the code and understand the analysis without the need of a lab mate showing the procedure step-by-step.
The reviewers of your manuscript ask you to slightly alter the analysis. Rather than having to start over, you can modify a few lines of code and satisfy the reviewers.
Here we introduce R
, which can help you write simple programs to analyze your data, perform statistical analysis, and draw beautiful figures.
Before we start
To follow this tutorial, you will need to install R
and RStudio
- Install
R
: download and installR
from this page. Choose the right architecture (Windows, Mac, Linux). If possible, install the latest release. - Install
RStudio
: go to this page and download the “RStudio Desktop Open Source License”. - Install
R
packages: launchRStudio
. Click on “Packages” in the bottom-right panel. Click on “Install”: a dialog window will open. Typetidyverse
in the field “Packages” and click on “Install”. This might take a few minutes, and ask you to download further packages.
What is R?
R
is a statistical software that is completely programmable. This means that one can write a program (script) containing a series of commands for the analysis of data, and execute them automatically. This approach is especially good as it makes the analysis of data well-documented, and completely replicable.
R
is free software: anyone can download its source code, modify it, and improve it. The R
community of users is vast and very active. In particular, scientists have enthusiastically embraced the program, creating thousands of packages to perform specific types of analysis, and adding many new capabilities. You can find a list of official packages (which have been vetted by R
core developers) here; many more are available on GitHub and other websites.
The main hurdle new users face when approaching R
is that it is based on a command line interface: when you launch R
, you simply open a console with the character >
signaling that R
is ready to accept an input. When you write a command and press Enter
, the command is interpreted by R
, and the result is printed immediately after the command. For example,
A little history: R
was modeled after the commercial statistical software S
by Robert Gentleman and Ross Ihaka. The project was started in 1992, first released in 1994, and the first stable version appeared in 2000. Today, R
is managed by the R Core Team.
RStudio
For this introduction, we’re going to use RStudio
, an Integrated Development Environment (IDE) for R
. The main advantage is that the environment will look identical irrespective of your computer architecture (Linux, Windows, Mac). Also, RStudio
makes writing code much easier by automatically completing commands and file names (simply type the beginning of the name and press Tab
), and allowing you to easily inspect data and code.
Typically, an RStudio
window contains four panels:
- Console This is a panel containing an instance of
R
. For this tutorial, we will work mainly in this panel. - Source code In this panel, you can write a program, save it to a file pressing
Ctrl + S
and then execute it by pressingCtrl + Shift + S
. - Environment This panel lists all the variables you created (more on this later); another tab shows you the history of the commands you typed.
- Plots This panel shows you all the plots you drew. Other tabs allow you to access the list of packages you have loaded, and the help page for commands (just type
help(name_of_command)
in the Console) and packages.
How to write a simple program
An R
program is simply a list of commands, which are executed one after the other. The commands are written in a text file (with extension .R
). When R
executes the program, it will start from the beginning of the file and proceed toward the end of the file. Every time R
encounters a command, it will execute it. Special commands can modify this basic flow of the program by, for example, executing a series of commands only when a condition is met, or repeating the execution of a series of commands multiple times.
Note that if you were to copy and paste (or type) the code into the Console you would obtain exactly the same result. Writing a program is advantageous, however, because the analysis can be automated, and the code shared with other researchers. Moreover, after a while you will have a large code base, so that you can recycle much of your code.
We start by working on the console, and then start writing simple scripts.
The most basic operation: assignment
The most basic operation in any programming language is the assignment. In R
, assignment is marked by the operator <-
(can be typed quickly using Alt -
). When you type a command in R
, it is executed, and the output is printed in the Console. For example:
If we want to save the result of this operation, we can assign it to a variable. For example:
What has happened? We wrote a command containing an assignment operator (<-
). R
has evaluated the right-hand-side of the command (sqrt(9)
), and has stored the result (3
) in a newly created variable called x
. Now we can use x
in our commands: every time the command needs to be evaluated, the program will look up which value is associated with the variable x
, and substitute it. For example:
Data types
R
provides different types of data that can be used in your programs. For each variable x
, calling class(x)
prints the type of the variable. The basic data types are:
logical
, taking only two possible values:TRUE
andFALSE
numeric
, storing real numbers (actually, their approximations, as computers have limited memory and thus cannot store numbers like π, or even0.2
)
- Real numbers can also be specified using scientific notation:
integer
, storing whole numbers
complex
, storing complex numbers (i.e., with a real and an imaginary part)
character
, for strings, characters and text
In R
, the value and type of a variable are evaluated at run-time. This means that you can recycle the names of variables. This is very handy, but can make your programs more difficult to read and to debug (i.e., find mistakes). For example:
Operators and functions
Each data type supports a certain number of operators and functions. For example, numeric variables can be combined with +
(addition), -
(subtraction), *
(multiplication), /
(division), and ^
(or **
, exponentiation). A possibly unfamiliar operator is the modulo (%%
), calculating the remainder of an integer division:
meaning that 5 %/% 3
(5 integer divided by 3) is 1 with a remainder of 2
The modulo operator is useful to determine whether a number is divisible for another: if y
is divisible by x
, then y %% x
is 0.
R
provides many built-in functions: each functions has a name, followed by round parentheses surrounding the (possibly optional) function arguments. For example, these functions operate on numeric
variables:
abs(x)
absolute valuesqrt(x)
square rootround(x, digits = 3)
roundx
to three decimal digitscos(x)
cosine (also supported are all the usual trigonometric functions)log(x)
natural logarithm (uselog10
for base 10 logarithms)exp(x)
calculating \(e^x\)
Similarly, character
variables have their own set of functions, such as:
toupper(x)
make uppercasenchar(x)
count the number of characters in the stringpaste(x, y, sep = "_")
concatenate strings, joining them using the separator_
strsplit(x, "_")
separate the string using the separator_
Calling a function meant for a certain data type on another will cause errors. If sensible, you can convert a type into another. For example:
If sensible, you can use the comparison operators >
(greater), <
(lower), ==
(equals), !=
(differs), >=
and <=
, returning a logical value:
Exercise:
Why are two equal signs (
==
) used to check that two values are equal? What happens if you use only one=
sign?
Similarly, you can concatenate several comparison and logical variables using &
(and), |
(or), and !
(not):
Getting help
If you want to know more about a function, type ?my_function_name
in the console (e.g., ?abs
). This will open the help page in one of the panels on the right. The same can be accomplished calling help(abs)
. For more complex questions, check out stackoverflow.
Data structures
Besides these simple types, R
provides structured data types, meant to collect and organize multiple values.
Vectors
The most basic data structure in R
is the vector, which is an ordered collection of values of the same type. Vectors can be created by concatenating different values with the function c()
(“combine”):
You can access the elements of a vector by their index: the first element is indexed at 1, the second at 2, etc.
NA
stands for “Not Available”. Other special values are NaN
(Not a Number, e.g., 0/0
), Inf
(Infinity, e.g., 1/0
), and NULL
(variable undefined). You can test for special values using is.na(x)
, is.infinite(x)
, is.null(x)
, etc.
Note that in R
a single number (string, logical) is a vector of length 1 by default. That’s why if you type 3
in the console you see [1] 3
in the output.
You can extract several elements at once (i.e., create another vector), using the colon (:
) command, or by concatenating the indices:
You can also use a vector of logical variables to extract values from vectors. For example, suppose we have two vectors:
and that we want to extract only the weights for the males.
returns a vector of logical values, which we can use to subset the data:
Given that R
was born for statistics, there are many statistical functions you can perform on vectors:
You can generate vectors of sequential numbers using the colon command:
For more complex sequences, use seq
:
To repeat a value or a sequence several times, use rep
:
Exercise:
- Create a vector containing all the even numbers between 2 and 100 (inclusive) and store it in variable
z
.- Extract all the elements of
z
that are divisible by 12. How many elements match this criterion?- What is the sum of all the elements of
z
?- Is it equal to \(51 \cdot 50\)?
- What is the product of elements 5, 10 and 15 of
z
?- Does
seq(2, 100, by = 2)
produce the same vector as(1:50) * 2
?- What happens if you type
z ^ 2
?
Matrices
A matrix is a two-dimensional table of values. In case of numeric values, you can perform the usual operations on matrices (product, inverse, decomposition, etc.):
To determine the dimensions of a matrix, use dim
:
Use indices to access a particular row/column of a matrix:
Some functions use all the elements of the matrix:
Some functions apply the operation across a given dimension (e.g., columns) of the matrix:
Arrays
If you need tables with more than two dimensions, use arrays:
You can still determine the dimensions using:
and access the elements as done for matrices. One thing you should be paying attention to: R
drops dimensions that are not needed. So, if you access a “slice” of a 3-dimensional array:
you obtain a matrix:
This can be problematic, for example, when your code expects an array and R
turns your data into a matrix (or you expect a matrix but find a vector). To avoid this behavior, add drop = FALSE
when subsetting:
Lists
Vectors are good if each element is of the same type (e.g., numbers, strings). Lists are used when we want to store elements of different types, or more complex objects (e.g., vectors, matrices, even lists of lists). Each element of the list can be referenced either by its index, or by a label:
Data frames
Data frames contain data organized like in a spreadsheet. The columns (typically representing different measurements) can be of different types (e.g., a column could be the date of measurement, another the weight of the individual, or the volume of the cell, or the treatment of the sample), while the rows typically represent different samples.
When you read a spreadsheet file in R
, it is automatically stored as a data frame. The difference between a matrix and a data frame is that in a matrix all the values are of the same type (e.g., all numeric), while in a data frame each column can be of a different type.
Because typing a data frame by hand would be tedious, let’s use a data set that is already available in R
:
Exercise:
- What is the average height of the cherry trees?
- What is the average girth of those that are more than 75 ft tall?
- What is the maximum height of trees with a volume between 15 and 35 ft\(^3\)?
Reading and writing data
In most cases, you will not generate your data in R
, but import it from a file. By far, the best option is to have your data in a comma separated value text file or in a tab separated file. Then, you can use the function read.csv
(or read.table
) to import your data. The syntax of the functions is as follows:
Note that columns containing strings are typically converted to factors (categorical values, useful when performing regressions). To avoid this behavior, you can specify stringsAsFactors = FALSE
when calling the function.
Similarly, you can save your data frames using write.table
or write.csv
. Suppose you want to save the data frame MyDF
:
write.csv(MyDF, "MyFile.csv")
write.csv(MyDF, "MyFile.csv", append = TRUE) # append to the end of the file
write.csv(MyDF, "MyFile.csv", row.names = TRUE) # include the row names
write.csv(MyDF, "MyFile.csv", col.names = FALSE) # do not include column names
Let’s look at an example: Read a file containing data on the 6th chromosome for a number of Europeans (Data adapted from Stanford HGDP SNP Genotyping Data by John Novembre). This example shows that you can read data directly from the internet!
where header = TRUE
means that we want to take the first line to be a header containing the column names. How big is this table?
we have 7 columns, but more than 40k rows! Let’s see the first few:
and the last few:
The data contains the number of homozygotes (nA1A1
, nA2A2
) and heterozygotes (nA1A2
), for 43,141 single nucleotide polymorphisms (SNPs) obtained by sequencing European individuals:
CHR
The chromosome (6 in this case)SNP
The identifier of the Single Nucleotide PolymorphismA1
One of the allelesA2
The other allelenA1A1
The number of individuals with the particular combination of alleles.
Exercise:
- How many individuals were sampled? Find the maximum of the sum
nA1A1 + nA1A2 + nA2A2
. Note: you can access the columns by index (e.g.,ch6[,5]
), or by name (e.g.,ch6$nA1A1
, or alsoch6[,"nA1A1"]
).- Try using the function
rowSums
to obtain the same result.- For how many SNPs do we have that all sampled individuals are homozygotes (i.e., all
A1A1
or allA2A2
)?- For how many SNPs, are more than 99% of the sampled individuals homozygous?
Conditional branching
Now we turn to writing actual programs in the Source code panel. To start a new R
program, press Ctrl + Shift + N
. This will open an Untitled
script. Save the script by pressing Ctrl + S
: save it as conditional.R
in the directory programming_skills/sandbox/
. To make sure you’re working in the directory where the script is contained, on the menu on the top choose Session -> Set Working Directory -> To Source File Location
.
Now type the following script:
print("Hello world!")
<- 4
x print(x)
and execute the script by pressing Ctrl + Shift + S
. You should see Hello World!
and 4
printed in your console.
As you saw in this simple example, when R
executes the program, it starts from the top and proceeds toward the end of the file. Every time it encounters a command (for example, print(x)
, printing the value of x
into the console), it executes it.
When we want a certain block of code to be executed only when a certain condition is met, we can write a conditional branching point. The syntax is as follows:
if (condition is met){
# execute this block of code
else {
} # execute this other block of code
}
For example, add these lines to the script conditional.R
, and run it again:
print("Hello world!")
<- 4
x print(x)
if (x %% 2 == 0){
<- paste(x, "is even")
my_message else {
} <- paste(x, "is odd")
my_message
}print(my_message)
We have created a conditional branching point, so that the value of my_message
changes depending on whether x
is even (and thus the remainder of the integer division by 2 is 0), or odd. Change the line x <- 4
to x <- 131
and run it again.
Exercise: What does this do?
{webr-r} x <- 36 if (x > 20){ x <- sqrt(x) } else { x <- x ^ 2 } if (x > 7) { print(x) } else if (x %% 2 == 1){ print(x + 1) }
Looping
Another way to change the flow of the program is to write a loop. A loop is simply a series of commands that are repeated a number of times. For example, you want to run the same analysis on different data sets that you collected; you want to plot the results contained in a set of files; you want to test your simulation over a number of parameter sets; etc.
R
provides you with two ways to loop over blocks of commands: the for
loop, and the while
loop. Let’s start with the for
loop, which is used to iterate over a vector (or a list): for each value of the vector, a series of commands will be run, as shown by the following example, which you can type in a new script called forloop.R
.
<- 1:10 # vector with numbers from 1 to 10
myvec
for (i in myvec) {
<- i ^ 2
a print(a)
}
In the code above, the variable i
takes the value of each element of myvec
in sequence. Inside the block defined by the for
loop, you can use the variable i
to perform operations.
The anatomy of the for
statement:
for (variable in list_or_vector) {
execute these commands# automatically moves to the next value }
For loops are used when you know that you want to perform the analysis using a given set of values (e.g., run over all files of a directory, all samples in your data, all sequences of a fasta file, etc.).
The while
loop is used when the commands need to be repeated while a certain condition is true, as shown by the following example, which you can type in a script called whileloop.R
:
<- 1
i
while (i <= 10) {
<- i ^ 2
a print(a)
<- i + 1
i }
The script performs exactly the same operations we wrote for the for
loop above. Note that you need to update the value of i
, (using i <- i + 1
), otherwise the loop will run forever (infinite loop—to terminate click on the stop button in the top-right corner of the console). The anatomy of the while
statement:
while (condition is met) {
execute these commands# beware of infinite loops: remember to update the condition! }
You can break a loop using the command break
. For example:
<- 1
i
while (i <= 10) {
if (i > 5) {
break
}<- i ^ 2
a print(a)
<- i + 1
i }
Exercise: What does this do? Try to guess what each loop does, and then create and run a script to confirm your intuition.
<- seq(1, 1000, by = 3) z for (k in z) { if (k %% 4 == 0) { print(k) } }
<- readline(prompt = "Enter a number: ") z <- as.numeric(z) z <- TRUE isthisspecial <- 2 i while (i < z) { if (z %% i == 0) { <- FALSE isthisspecial break }<- i + 1 i }if (isthisspecial == TRUE) { print(z) }
Useful Functions
Here’s a short list of useful functions that will help you write your programs:
range(x)
: minimum and maximum of a vectorx
sort(x)
: sort a vectorx
unique(x)
: remove duplicate entries from vectorx
which(x == a)
: returns a vector of the indices ofx
having valuea
list.files("path_to_directory")
: list the files in a directory (current directory if not specified)table(x)
build a table of frequencies
Exercises: What does this code do? For each snippet of code, first try to guess what will happen. Then, write a script and run it to confirm your intuition.
<- c(1, 3, 5, 5, 3, 1, 2, 4, 6, 4, 2) v <- sort(unique(v)) v for (i in v){ if (i > 2){ print(i) }if (i > 4){ break } }
<- 1:100 x <- x[which(x %% 7 == 0)] x
<- 10 my_amount while (my_amount > 0){ <- NA my_color while(is.na(my_color)){ <- readline(prompt="Do you want to bet on black or red? ") tmp <- tolower(tmp) tmp if (tmp == "black") my_color <- "black" if (tmp == "red") my_color <- "red" if (is.na(my_color)) print("Please enter either red or black") }<- NA my_bet while(is.na(my_bet)){ <- readline(prompt="How much do you want to bet? ") tmp <- as.numeric(tmp) tmp if (is.numeric(tmp) == FALSE){ print("Please enter a number") else { } if (tmp > my_amount){ print("You don't have enough money!") else { } <- tmp my_bet <- my_amount - tmp my_amount } } }<- sample(c("red", "black"), 1) lady_luck if (lady_luck == my_color){ <- my_amount + 2 * my_bet my_amount print(paste("You won!! Now you have", my_amount, "gold doubloons")) else { } print(paste("You lost!! Now you have", my_amount, "gold doubloons")) } }
Packages
R
is the most popular statistical computing software among biologists due to its highly specialized packages, often written by biologists for biologists. You can contribute a package too! The RStudio
support (goo.gl/harVqF
) provides guidance on how to start developing R
packages and Hadley Wickham’s free online book (r-pkgs.had.co.nz
) will make you a pro.
You can find highly specialized packages to address your research questions. Here are some suggestions for finding an appropriate package. The Comprehensive R Archive Network (CRAN) offers several ways to find specific packages for your task. You can either browse packages (goo.gl/7oVyKC
) and their short description or select a scientific field of interest (goo.gl/0WdIcu
) to browse through a compilation of packages related to each discipline.
From within your R
terminal or RStudio
you can also call the function RSiteSearch("KEYWORD")
, which submits a search query to the website search.r-project.org
. The website rseek.org
casts an even wider net, as it not only includes package names and their documentation but also blogs and mailing lists related to R
. If your research interests relate to high-throughput genomic data, you should have a look the packages provided by Bioconductor (goo.gl/7dwQlq
).
Installing a package
To install a package type
install.packages("name_of_package")
in the Console, or choose the panel Packages and then click on Install in RStudio
.
Loading a package
To load a package type
library(name_of_package)
or call the command into your script. If you want your script to automatically install a package in case it’s missing, use this boilerplate:
if (!require(needed_package, character.only = TRUE, quietly = TRUE)) {
install.packages(needed_package)
library(needed_package, character.only = TRUE)
}
Example
For example, say we want to access the dataset bacteria
, which reports the incidence of H. influenzae in Australian children. The dataset is contained in the package MASS
.
First, we need to load the package:
Now we can load the data:
Random numbers
To perform randomization, or any simulation, we typically need to draw random numbers. R
has functions to sample random numbers from very many different statistical distributions. For example:
To sample from a set of values, use sample
:
Writing functions
The R
community provides about 7,000 packages. Still, sometimes there isn’t an already made function capable of doing what you need. In these cases, you can write your own functions. In fact, it is generally a good idea to always divide your analysis into functions, and then write a small “master” program that calls the functions and performs the analysis. In this way, the code will be much more legible, and you will be able to recycle the functions for your other projects.
A function in R
has this form:
<- function(optional, arguments, separated, by_commas){
my_function_name # Body of the function
# ...
#
return(return_value) # this is optional
}
A few examples:
You can set a default value for some of the arguments: if not specified by the user, the function will use these defaults:
The return value is optional:
You can return only one object. If you need to return multiple values, organize them into a vector/matrix/list and return that.
Organizing and running code
During the class, we will write a lot of code, of increasing complexity. Here is what you should do to ensure that your programs are well-organized, easy to understand, and easy to debug.
- Take the problem, and divide it into its basic building blocks. Each block should be its own function.
- Write the code for each building block separately, and test it thoroughly.
- Extensively document the code, so that you can understand what you did, how you did it, and why.
- Combine the building blocks into a master program.
For example, let’s write code that takes the data on Chromosome 6 we have seen above, and tries to identify which SNPs deviate the most from Hardy-Weinberg equilibrium. Remember that in an infinite population, where mating is random, there is no selection and no mutations, the proportion of people carrying the alleles \(A1A1\) should be approximately \(p_{11} = p^2\) (where \(p\) is the frequency of the first allele in the population \(p = p_{11} + \frac{1}{2} p_{12}\)), those carrying \(A1A2\) should be \(p_{12} = 2 p q\) (where \(q = 1-p\)) and finally those carrying \(A2A2\) should be \(p_{22} = q^2\). This is called the Hardy-Weinberg equilibrium.
We want to test this on a number of different SNPs. First, we write a function that takes as input the data and a given SNP, and computes the probability \(p\) of carrying the first allele.
Now we can test our function:
If the allele conformed to Hardy-Weinberg, we should find approximately \(p^2 \cdot n\) people with \(A1A1\), where \(n\) is the number of people sampled. Let’s see whether these assumptions are met by the data:
And test it:
Pretty good! This SNP seems very close to the theoretical expectation.
Let’s try another one
Because we have so many SNPs, we will surely find some that do not comply with the expectation. For example:
To find those with the largest deviations, we can compute for the statistic:
\[ \sum_i \frac{(e_i - o_i)^2}{e_i} \] In genetics, this is called \(\chi^2\) statistics, because if the data were to follow the assumptions, these quantities would follow the \(\chi^2\) distribution.
Now let’s compute the statistic for each SNPs:
To find the ones with the largest discrepancy, run
This example showed how a seemingly difficult problem can be decomposed in smaller problems that are easier to solve.
Documenting the code using knitr
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.
Donald E. Knuth, Literate Programming, 1984
When doing experiments, we typically keep track of everything we do in a laboratory notebook, so that when writing the manuscript, or responding to queries, we can go back to our documentation to find exactly what we did, how we did it, and possibly why we did it. The same should be true for computational work.
RStudio
makes it very easy to build a computational laboratory notebook. First, create a new R Markdown
file (choose File
-> New File
-> R Markdown
from the menu).
The gist of it is that you write a text file (.Rmd
). The file is then read by an interpreter that transforms it into an .html
or .pdf
file, or even into a Word document. You can use special syntax to render the text in different ways. For example, type
***********
*Test* **Test2**
# Very large header
## Large header
### Smaller header
## Unordered lists
* First
* Second
+ Second 1
+ Second 2
1. This is
2. A numbered list
You can insert `inline code`
-----------
The most important feature of R Markdown
, however, is that you can include blocks of code, and they will be interpreted and executed by R
. You can therefore combine effectively the code itself with the description of what you are doing.
For example, including
{r, eval=FALSE} print("hello world!")
will become
If you don’t want to run the R
code, but just display it, use {r, eval = FALSE}
; if you want to show the output but not the code, use {r, echo = FALSE}
.
You can include plots, tables, and even render equations using LaTeX. In summary, when exploring your data or writing the methods of your paper, give R Markdown
a try!
You can find inspiration in the notes for this class: all are written in R Markdown
.
Resources
There are very many excellent books and tutorials you can read to become a proficient programmer in R
. For example: