3 Data wrangling
3.1 Goal
Learn how to manipulate large data sets by writing efficient, consistent, and compact code. Introduce the use of dplyr
, tidyr
, and the “pipe” operator |>
. Effortlessly produce statistics for grouped data. Massage data into “tidy” form.
3.2 What is data wrangling?
As biologists living in the XXI century, we are often faced with tons of data, possibly replicated over several organisms, treatments, or locations. We would like to streamline and automate our analysis as much as possible, writing scripts that are easy to read, fast to run, and easy to debug. Base R
can get the job done, but often the code contains complicated operations, and a lot of $
signs and brackets.
We’re going to learn about the packages dplyr
and tidyr
, which are part of tidyverse
and can be used to manipulate large data frames in a simple and straightforward way. These tools are also much faster than the corresponding base R
commands, are very compact, and can be concatenated into “pipelines”.
We are going to use the data set penguins
from the package palmerpenguins
, which we have already seen last week.
3.3 A new data type, tibble
The data is stored in a “tibble”:
In fact, dplyr
ships with a new data type, called a tibble
. To convert a data.frame
into a tibble, use as_tibble
:
The nice feature of tbl
objects is that they will print only what fits on the screen, and also give you useful information on the size of the data, as well as the type of data in each column. Other than that, a tbl
object behaves very much like a data.frame
. In some rare cases, you want to transform the tbl
back into a data.frame
. For this, use the function as.data.frame(tbl_object)
.
We can take a look at the data using one of several functions:
head(dt)
shows the first few rowstail(dt)
shows the last few rowsglimpse(dt)
a summary of the data (similar tostr
in base R)View(dt)
open in spreadsheet-like window
3.4 Selecting rows and columns
There are many ways to subset the data, either by row (subsetting the observations), or by column (subsetting the variables). For example, let’s select only the rows with observations from the island Torgersen
:
We have 52 observations. We have used the command filter(tbl, conditions)
to select certain observations. We can combine several conditions, by listing them side by side, possibly using logical operators.
Exercise: what does this do?
filter(penguins, bill_length_mm > 40, bill_depth_mm > 20, sex == male)
We can also select particular variables (columns) using the function select(tbl, cols to select)
. For example, select species
and island
:
How many species
are represented in the data set? We can use the function distinct(tbl, cols to select)
to retain only the rows that differ from each other:
Showing that there are three species, once we removed the duplicates. There are many other ways to subset observations:
slice_sample(tbl, howmany, replace = TRUE)
samplehowmany
rows at random (with replacement)sample_sample(tbl, proportion, replace = FALSE)
sample a certain proportion (e.g.0.2
for 20%) of rows at random without replacementslice(tbl, 5:20)
extract the rows5
to20
slice_max(penguins, 10, body_mass_g)
extract the first10
rows, once ordered bybody_mass_g
More ways to select columns:
select(penguins, contains("mm"))
select all columns containing the stringmm
select(penguins, -year, -body_mass_g)
exclude the columnsyear
andbody_mass_g
select(penguins, matches("length|bill"))
select all columns whose names match a regular expression
3.5 Creating pipelines using |>
We’ve been calling nested functions, such as distinct(select(penguins, species))
. If you have to add another layer or two, the code would become unreadable. dplyr
allows you to “un-nest” these functions and create a “pipeline” in which you concatenate commands separated by a special operator, |>
. For example:
does exactly the same operations as the command above, but is much more readable. By concatenating many commands, you can create incredibly complex pipelines while retaining readability. It is also quite easy to add another piece of the pipeline in between commands, or to comment some of the pipeline out.
Another advantage of pipelines is that they help with name completion. In fact, RStudio
is running in the background your pipeline while you type it. Try typing dt |> filter(
and then start typing bill
and press Tab
: you will see the options to complete the column name; choose it with your arrows and hit Return
. The back tick-marks will be added automatically if needed (e.g., column names containing spaces, or starting with a digit).
3.6 Producing summaries
Sometimes we need to calculate statistics on certain columns. For example, calculate the average body mass of the penguins. We can do this using summarise
(you can use British or American spelling):
where we used na.rm = TRUE
to ignore missing values. This command returns a tbl
object with just the average body mass. You can combine multiple statistics (use first
, last
, min
, max
, n
[count the number of rows], n_distinct
[count the number of distinct rows], mean
, median
, var
, sd
, etc.):
3.7 Summaries by group
One of the most useful features of dplyr
is the ability to produce statistics for the data once subsetted by groups. For example, we would like to compute the average body mass by species and sex:
showing that male penguins are heavier for the three species considered.
Exercise: find the average
bill_depth_mm
andbill_length_mm
byspecies
andsex
. Filter the data to consider only observations for the year 2008.
3.8 Ordering the data
To order the data according to one or more variables, use arrange()
:
3.9 Renaming columns
To rename one or more columns, use rename()
:
3.10 Adding new variables using mutate
If you want to add one or more new columns, with the content being a function of other columns, use the function mutate
. For example, we are going to add a new column showing the z-score for the body mass of each individual:
We can pipe the results to ggplot
for plotting!
You can use the function transmute()
to create a new column and drop the original columns.
Most importantly, you can use mutate
and transmute
on grouped data. For example, let’s recompute the z-score of the body_mass_g
once the data is grouped by species and sex:
3.11 Data wrangling
Data is rarely in a format that is good for computing, and much effort goes into reading the data and wrestling with it to make it into a good format. As the name implies, tidyverse
strongly advocates for the use of data in tidy form. What does this mean?
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
This is often called narrow table format. Any other form of data (e.g., wide table format) is considered messy. However, often data are not organized in tidy form, or we want to produce tables for human consumption rather than computer consumption. The package tidyr
allows to accomplish just that. It contains only a few, very powerful functions. To explore this issue, we build a data set containing the average body mass by species and sex:
3.12 From narrow to wide
Our data is in tidy form. For a paper, we want to show the difference between males and females in a table:
where we have created new column names using the values found in sex
(hence, names_from
), and filled each cell with the corresponding value found in body_mass
(hence, values_from
). Similarly, if we want to show the data with species as column names, and sex as rows, we can use:
3.13 From wide to narrow
For a real-world example, we will make data from:
Tree-ring analysis for sustainable harvest of Millettia stuhlmannii in Mozambique, I.A.D.Remane M.D.Therrell, South African Journal of Botany Volume 125, September 2019, Pages 120-125
You can read a tab-separated file from:
Each column besides YEAR
represents a single tree, and each cell contains the diameter (in cm) of the tree when it was at a given age. To make this in tidy form, we first create the columns tree
and diameter
:
and then remove the NAs:
Now it is easy to plot the growth trajectory of each tree (as in Fig. 3 of the original paper):
3.14 Separate: split a column into two or more
The complement of separate
is called unite
.
3.15 Separate rows: from one row to many
To make it into tidy form, only one record per row:
3.16 Example: brown bear, brown bear, what do you see?
This exercise uses a dataset from GBIF, the Global Biodiversity Information Facility. You can download the latest version yourself by doing the following (but just skip ahead if you want to use the data provided by us).
- Go to GBIF and click on Occurrences.
- Under Scientific Name type in Ursus arctos (brown bear), and hit enter.
- To download the data, create an account on GBIF
- Then click on Download, and select Simple (which should have a tab-delimited .csv file)
- Save to the data folder in your working folder.
If you don’t want to go through all this, you can load this previously downloaded file called Ursus_GBIF.csv
from our GitHub repository. The code in the following chunk loads and displays the contents of the tibble:
You see there are 50 variables in the data set, so it may be useful to remove the ones we don’t need. For this exercise, our objective is to plot the occurrences of this species on the world map, so we need two variables for certain: decimalLatitude
and decimalLongitude
, as well as the BasisofRecord
for additional information. Use your tidyverse
skills to create a new tibble with only those variables. In addition, remove duplicate records from the tibble.
Now we can plot this data set on the world map, using the useful package maps. To plot, use the ggplot()
syntax with the following addition:
Note the warning message generated by ggplot
. Then consider the map with the locations of the brown bear specimens. Do any of them seem strange to you? What may be the explanation behind these strange data point? Now filter out the points that you identified as suspicious and print out their BasisofRecord. Does this suggest an explanation for the strangeness?
3.17 Resources
- R for Data Science
- A cool class at U of C in Social Sciences
- Data transformation cheat sheet
- Dealing with dates cheat sheet
- Data import cheat sheet