# CSB_2019

## Review of Chapter 8

Assignment and data types:

Use the assignment operator `<-` (`Alt + -`); the equal sign `=` works as well, but is deprecated.

``````a <- 5 # create a variable called a and assign 5 to it
print(a)
``````

To determine the type of a variable, use `class(a)`:

``````class(a)
a <- "ciao" # note dynamic typing
class(a)
``````

Basic data types in `R`:

• `character` (strings)
• `numeric` (real numbers)
• `integer` (integer numbers)
• `complex` (complex numbers)
• `logical` (`TRUE`, `FALSE`)
• `factor` (categorical values)

Operators

• `+ - * / ^` work as expected
• `x %% y` modulus
• `x %/% y` integer division
• `x %in% y` test for membership

### Data structures

Vectors

Contain a one-dimensional array of values of the same type:

``````v <- c(1, 2, 3, 4) # combine
# R starts counting at 1 (different from Python)
v
v[1:2]
v[-1] # without the first element (different from Python)
v[c(TRUE, FALSE, TRUE, TRUE)] # using logical
v[v %in% c(1, 3)] # using membership
v[v < 3] # using logical condition
``````

You can operate on whole vectors:

``````sum(v)
prod(v)
v ^ 2
sqrt(v)
``````

Length of a vector:

``````length(v)
``````

Matrices and arrays

Two- or multi-dimensional arrays containing data of the same type:

``````m <- matrix(data = 1:9, nrow = 3, ncol = 3, byrow = TRUE)
m
``````

Accessing rows/cols:

``````m[1,] # first row
m[,1] # first col
# note that R drops dimensions automatically
is.matrix(m[1,])
# to prevent it
is.matrix(m[1, , drop = FALSE])
``````

Multidimensional:

``````aa <- array(data = 1:27, dim = c(3,3,3))
aa
``````

Lists

Collection of objects indexed by position or name:

``````ll <- list(v = c(1,2,3), n = c("a", "b", "c"))
ll
ll\$v # \$ like . in Python
ll\$n
ll[] # note the double brackets
ll[][c(1,3)]
``````

Data frames

Possibly, the most used data structure in `R`. Store spreadsheet-like data:

``````df <- read.csv("Goldberg2010_data.csv",
stringsAsFactors = FALSE, # by default, strings are treated as categorical values
quote = "")
# first few row --- tail(df) for the last few
# structure
str(df)
# extract column
df\$Species[1:2]
df[,"Species"][1:2]
df[1:2, 1]
# extract row by index
df[3:4,]
# extract row using logical operators
df[df\$Species == "Acnistus_arborescens",]
df[df\$Status == 2,]
``````

For `csv` files, use `read.table` (space/tab separated), `read.csv` (comma-separated), or `read.csv2` (semi-colon separated). `write.table` etc. write csv files.

Important options:

• reading: `stringsAsFactors = FALSE` read strings as `character` instead of factors
• writing: `row.names = FALSE` (do not write row numbers)

## Conditional branching

``````if (condition == TRUE){
# this is executed when the condition is true
} else {
# this when the condition is false
}
``````

## Looping

`for` loop:

``````for (i in a_vector_or_list){
do_something(i)
}
``````

Example:

``````for (i in 2:10){
print(c(i, i * (i - 1) / 2 ))
}
``````

`while` loop:

``````while (a_contidion_is_true){
do_something()
# update condition!
}
``````

### User-defined functions

Anatomy:

``````my_func <- function(arg1 = "default_value", arg2){
# ...
# body of the function
# ...
# return statement
return(my_result)
}
``````

### Warmup exercise: TED Talks

For our warmup, we are going to use a spreadsheet with information on 992 TED talks. The data were adapted from

Kinnaird, Katherine M. and John Laudun. 2018. TED Talks Data Set.

• Plot an histogram for the number of views. Is the distribution approximately log-normal?

• Transform the `duration` to seconds

Hint: Look here

• Plot duration in seconds vs. log number of views: does duration correlate with views?

• Count the number of days since publication, and plot against log views

Hint: Look here

• Find the top 10 tags

• For each top tags, add a column to the data frame specifying if the tag is present

Hint: you could use the function `grepl`

• Build a linear model with
• Response variable = log(views)
• Explanatory variables = published_days, seconds, technology, science, culture, etc.
• Which tags significantly increase views?

Hint: Look here

• Add to the model the effect of the top 10 speakers by number of talks. Does this improve the fit?

Here’s a possible solution