Review of Chapter 8
Assignment and data types:
Use the assignment operator <
(Alt + 
); the equal sign =
works as well, but is deprecated.
a < 5 # create a variable called a and assign 5 to it
print(a)
To determine the type of a variable, use class(a)
:
class(a)
a < "ciao" # note dynamic typing
class(a)
Basic data types in R
:
character
(strings)numeric
(real numbers)integer
(integer numbers)complex
(complex numbers)logical
(TRUE
,FALSE
)factor
(categorical values)
Operators
+  * / ^
work as expectedx %% y
modulusx %/% y
integer divisionx %in% y
test for membership
Data structures
Vectors
Contain a onedimensional array of values of the same type:
v < c(1, 2, 3, 4) # combine
# R starts counting at 1 (different from Python)
v[1]
v[1:2]
v[1] # without the first element (different from Python)
v[c(1, 3)] # nonadjacent
v[c(TRUE, FALSE, TRUE, TRUE)] # using logical
v[v %in% c(1, 3)] # using membership
v[v < 3] # using logical condition
You can operate on whole vectors:
sum(v)
prod(v)
v ^ 2
sqrt(v)
Length of a vector:
length(v)
Matrices and arrays
Two or multidimensional arrays containing data of the same type:
m < matrix(data = 1:9, nrow = 3, ncol = 3, byrow = TRUE)
m
Accessing rows/cols:
m[1,] # first row
m[,1] # first col
# note that R drops dimensions automatically
is.matrix(m[1,])
# to prevent it
is.matrix(m[1, , drop = FALSE])
Multidimensional:
aa < array(data = 1:27, dim = c(3,3,3))
aa
Lists
Collection of objects indexed by position or name:
ll < list(v = c(1,2,3), n = c("a", "b", "c"))
ll
ll$v # $ like . in Python
ll$n
ll[[1]] # note the double brackets
ll[[2]][c(1,3)]
Data frames
Possibly, the most used data structure in R
. Store spreadsheetlike data:
df < read.csv("Goldberg2010_data.csv",
stringsAsFactors = FALSE, # by default, strings are treated as categorical values
quote = "")
# first few row  tail(df) for the last few
head(df)
# structure
str(df)
# extract column
df$Species[1:2]
df[,"Species"][1:2]
df[1:2, 1]
# extract row by index
df[3:4,]
# extract row using logical operators
df[df$Species == "Acnistus_arborescens",]
df[df$Status == 2,]
Reading and writing data
For csv
files, use read.table
(space/tab separated), read.csv
(commaseparated), or read.csv2
(semicolon separated). write.table
etc. write csv files.
Important options:
 reading:
stringsAsFactors = FALSE
read strings ascharacter
instead of factors  writing:
row.names = FALSE
(do not write row numbers)
Conditional branching
if (condition == TRUE){
# this is executed when the condition is true
} else {
# this when the condition is false
}
Looping
for
loop:
for (i in a_vector_or_list){
do_something(i)
}
Example:
for (i in 2:10){
print(c(i, i * (i  1) / 2 ))
}
while
loop:
while (a_contidion_is_true){
do_something()
# update condition!
}
Userdefined functions
Anatomy:
my_func < function(arg1 = "default_value", arg2){
# ...
# body of the function
# ...
# return statement
return(my_result)
}
Warmup exercise: TED Talks
For our warmup, we are going to use a spreadsheet with information on 992 TED talks. The data were adapted from
Kinnaird, Katherine M. and John Laudun. 2018. TED Talks Data Set.

Plot an histogram for the number of views. Is the distribution approximately lognormal?

Transform the
duration
to seconds
Hint: Look here

Plot duration in seconds vs. log number of views: does duration correlate with views?

Count the number of days since publication, and plot against log views
Hint: Look here

Find the top 10 tags

For each top tags, add a column to the data frame specifying if the tag is present
Hint: you could use the function grepl
 Build a linear model with
 Response variable = log(views)
 Explanatory variables = published_days, seconds, technology, science, culture, etc.
 Which tags significantly increase views?
Hint: Look here
 Add to the model the effect of the top 10 speakers by number of talks. Does this improve the fit?
Here’s a possible solution