CSB_2019

Lecture notes and exercises for Computing Skills for Biologists --- Winter 2019

View project on GitHub

Review of Chapter 8

Assignment and data types:

Use the assignment operator <- (Alt + -); the equal sign = works as well, but is deprecated.

a <- 5 # create a variable called a and assign 5 to it
print(a)

To determine the type of a variable, use class(a):

class(a)
a <- "ciao" # note dynamic typing
class(a)

Basic data types in R:

  • character (strings)
  • numeric (real numbers)
  • integer (integer numbers)
  • complex (complex numbers)
  • logical (TRUE, FALSE)
  • factor (categorical values)

Operators

  • + - * / ^ work as expected
  • x %% y modulus
  • x %/% y integer division
  • x %in% y test for membership

Data structures

Vectors

Contain a one-dimensional array of values of the same type:

v <- c(1, 2, 3, 4) # combine
# R starts counting at 1 (different from Python)
v[1]
v[1:2]
v[-1] # without the first element (different from Python)
v[c(1, 3)] # non-adjacent
v[c(TRUE, FALSE, TRUE, TRUE)] # using logical
v[v %in% c(1, 3)] # using membership
v[v < 3] # using logical condition

You can operate on whole vectors:

sum(v)
prod(v)
v ^ 2
sqrt(v)

Length of a vector:

length(v)

Matrices and arrays

Two- or multi-dimensional arrays containing data of the same type:

m <- matrix(data = 1:9, nrow = 3, ncol = 3, byrow = TRUE)
m

Accessing rows/cols:

m[1,] # first row
m[,1] # first col
# note that R drops dimensions automatically
is.matrix(m[1,])
# to prevent it
is.matrix(m[1, , drop = FALSE])

Multidimensional:

aa <- array(data = 1:27, dim = c(3,3,3))
aa

Lists

Collection of objects indexed by position or name:

ll <- list(v = c(1,2,3), n = c("a", "b", "c"))
ll
ll$v # $ like . in Python
ll$n
ll[[1]] # note the double brackets
ll[[2]][c(1,3)]

Data frames

Possibly, the most used data structure in R. Store spreadsheet-like data:

df <- read.csv("Goldberg2010_data.csv",
               stringsAsFactors = FALSE, # by default, strings are treated as categorical values
               quote = "") 
# first few row --- tail(df) for the last few
head(df)
# structure
str(df)
# extract column
df$Species[1:2]
df[,"Species"][1:2]
df[1:2, 1]
# extract row by index
df[3:4,]
# extract row using logical operators
df[df$Species == "Acnistus_arborescens",]
df[df$Status == 2,]

Reading and writing data

For csv files, use read.table (space/tab separated), read.csv (comma-separated), or read.csv2 (semi-colon separated). write.table etc. write csv files.

Important options:

  • reading: stringsAsFactors = FALSE read strings as character instead of factors
  • writing: row.names = FALSE (do not write row numbers)

Conditional branching

if (condition == TRUE){
  # this is executed when the condition is true
} else {
  # this when the condition is false
}

Looping

for loop:

for (i in a_vector_or_list){
  do_something(i)
}

Example:

for (i in 2:10){
  print(c(i, i * (i - 1) / 2 ))
}

while loop:

while (a_contidion_is_true){
  do_something()
  # update condition!
}

User-defined functions

Anatomy:

my_func <- function(arg1 = "default_value", arg2){
  # ...
  # body of the function
  # ...
  # return statement
  return(my_result)
}

Warmup exercise: TED Talks

For our warmup, we are going to use a spreadsheet with information on 992 TED talks. The data were adapted from

Kinnaird, Katherine M. and John Laudun. 2018. TED Talks Data Set.

  • Plot an histogram for the number of views. Is the distribution approximately log-normal?

  • Transform the duration to seconds

Hint: Look here

  • Plot duration in seconds vs. log number of views: does duration correlate with views?

  • Count the number of days since publication, and plot against log views

Hint: Look here

  • Find the top 10 tags

  • For each top tags, add a column to the data frame specifying if the tag is present

Hint: you could use the function grepl

  • Build a linear model with
    • Response variable = log(views)
    • Explanatory variables = published_days, seconds, technology, science, culture, etc.
    • Which tags significantly increase views?

Hint: Look here

  • Add to the model the effect of the top 10 speakers by number of talks. Does this improve the fit?

Here’s a possible solution