3  Data wrangling

3.1 Goal

Learn how to manipulate large data sets by writing efficient, consistent, and compact code. Introduce the use of dplyr, tidyr, and the “pipe” operator |>. Effortlessly produce statistics for grouped data. Massage data into “tidy” form.

3.2 What is data wrangling?

As biologists living in the XXI century, we are often faced with tons of data, possibly replicated over several organisms, treatments, or locations. We would like to streamline and automate our analysis as much as possible, writing scripts that are easy to read, fast to run, and easy to debug. Base R can get the job done, but often the code contains complicated operations, and a lot of $ signs and brackets.

We’re going to learn about the packages dplyr and tidyr, which are part of tidyverse and can be used to manipulate large data frames in a simple and straightforward way. These tools are also much faster than the corresponding base R commands, are very compact, and can be concatenated into “pipelines”.

We are going to use the data set penguins from the package palmerpenguins, which we have already seen last week.

3.3 A new data type, tibble

The data is stored in a “tibble”:

In fact, dplyr ships with a new data type, called a tibble. To convert a data.frame into a tibble, use as_tibble:

The nice feature of tbl objects is that they will print only what fits on the screen, and also give you useful information on the size of the data, as well as the type of data in each column. Other than that, a tbl object behaves very much like a data.frame. In some rare cases, you want to transform the tbl back into a data.frame. For this, use the function as.data.frame(tbl_object).

We can take a look at the data using one of several functions:

  • head(dt) shows the first few rows
  • tail(dt) shows the last few rows
  • glimpse(dt) a summary of the data (similar to str in base R)
  • View(dt) open in spreadsheet-like window

3.4 Selecting rows and columns

There are many ways to subset the data, either by row (subsetting the observations), or by column (subsetting the variables). For example, let’s select only the rows with observations from the island Torgersen:

We have 52 observations. We have used the command filter(tbl, conditions) to select certain observations. We can combine several conditions, by listing them side by side, possibly using logical operators.

Exercise: what does this do? filter(penguins, bill_length_mm > 40, bill_depth_mm > 20, sex == male)

We can also select particular variables (columns) using the function select(tbl, cols to select). For example, select species and island:

How many species are represented in the data set? We can use the function distinct(tbl, cols to select) to retain only the rows that differ from each other:

Showing that there are three species, once we removed the duplicates. There are many other ways to subset observations:

  • slice_sample(tbl, howmany, replace = TRUE) sample howmany rows at random (with replacement)
  • sample_sample(tbl, proportion, replace = FALSE) sample a certain proportion (e.g. 0.2 for 20%) of rows at random without replacement
  • slice(tbl, 5:20) extract the rows 5 to 20
  • slice_max(penguins, 10, body_mass_g) extract the first 10 rows, once ordered by body_mass_g

More ways to select columns:

  • select(penguins, contains("mm")) select all columns containing the string mm
  • select(penguins, -year, -body_mass_g) exclude the columns year and body_mass_g
  • select(penguins, matches("length|bill")) select all columns whose names match a regular expression

3.5 Creating pipelines using |>

We’ve been calling nested functions, such as distinct(select(penguins, species)). If you have to add another layer or two, the code would become unreadable. dplyr allows you to “un-nest” these functions and create a “pipeline” in which you concatenate commands separated by a special operator, |>. For example:

does exactly the same operations as the command above, but is much more readable. By concatenating many commands, you can create incredibly complex pipelines while retaining readability. It is also quite easy to add another piece of the pipeline in between commands, or to comment some of the pipeline out.

Another advantage of pipelines is that they help with name completion. In fact, RStudio is running in the background your pipeline while you type it. Try typing dt |> filter( and then start typing bill and press Tab: you will see the options to complete the column name; choose it with your arrows and hit Return. The back tick-marks will be added automatically if needed (e.g., column names containing spaces, or starting with a digit).

3.6 Producing summaries

Sometimes we need to calculate statistics on certain columns. For example, calculate the average body mass of the penguins. We can do this using summarise (you can use British or American spelling):

where we used na.rm = TRUE to ignore missing values. This command returns a tbl object with just the average body mass. You can combine multiple statistics (use first, last, min, max, n [count the number of rows], n_distinct [count the number of distinct rows], mean, median, var, sd, etc.):

3.7 Summaries by group

One of the most useful features of dplyr is the ability to produce statistics for the data once subsetted by groups. For example, we would like to compute the average body mass by species and sex:

showing that male penguins are heavier for the three species considered.

Exercise: find the average bill_depth_mm and bill_length_mm by species and sex. Filter the data to consider only observations for the year 2008.

3.8 Ordering the data

To order the data according to one or more variables, use arrange():

3.9 Renaming columns

To rename one or more columns, use rename():

3.10 Adding new variables using mutate

If you want to add one or more new columns, with the content being a function of other columns, use the function mutate. For example, we are going to add a new column showing the z-score for the body mass of each individual:

We can pipe the results to ggplot for plotting!

You can use the function transmute() to create a new column and drop the original columns.

Most importantly, you can use mutate and transmute on grouped data. For example, let’s recompute the z-score of the body_mass_g once the data is grouped by species and sex:

3.11 Data wrangling

Data is rarely in a format that is good for computing, and much effort goes into reading the data and wrestling with it to make it into a good format. As the name implies, tidyverse strongly advocates for the use of data in tidy form. What does this mean?

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

This is often called narrow table format. Any other form of data (e.g., wide table format) is considered messy. However, often data are not organized in tidy form, or we want to produce tables for human consumption rather than computer consumption. The package tidyr allows to accomplish just that. It contains only a few, very powerful functions. To explore this issue, we build a data set containing the average body mass by species and sex:

3.12 From narrow to wide

Our data is in tidy form. For a paper, we want to show the difference between males and females in a table:

where we have created new column names using the values found in sex (hence, names_from), and filled each cell with the corresponding value found in body_mass (hence, values_from). Similarly, if we want to show the data with species as column names, and sex as rows, we can use:

3.13 From wide to narrow

For a real-world example, we will make data from:

Tree-ring analysis for sustainable harvest of Millettia stuhlmannii in Mozambique, I.A.D.Remane M.D.Therrell, South African Journal of Botany Volume 125, September 2019, Pages 120-125

You can read a tab-separated file from:

Each column besides YEAR represents a single tree, and each cell contains the diameter (in cm) of the tree when it was at a given age. To make this in tidy form, we first create the columns tree and diameter:

and then remove the NAs:

Now it is easy to plot the growth trajectory of each tree (as in Fig. 3 of the original paper):

3.14 Separate: split a column into two or more

The complement of separate is called unite.

3.15 Separate rows: from one row to many

To make it into tidy form, only one record per row:

3.16 Example: brown bear, brown bear, what do you see?

This exercise uses a dataset from GBIF, the Global Biodiversity Information Facility. You can download the latest version yourself by doing the following (but just skip ahead if you want to use the data provided by us).

  1. Go to GBIF and click on Occurrences.
  2. Under Scientific Name type in Ursus arctos (brown bear), and hit enter.
  3. To download the data, create an account on GBIF
  4. Then click on Download, and select Simple (which should have a tab-delimited .csv file)
  5. Save to the data folder in your working folder.

If you don’t want to go through all this, you can load this previously downloaded file called Ursus_GBIF.csv from our GitHub repository. The code in the following chunk loads and displays the contents of the tibble:

You see there are 50 variables in the data set, so it may be useful to remove the ones we don’t need. For this exercise, our objective is to plot the occurrences of this species on the world map, so we need two variables for certain: decimalLatitude and decimalLongitude, as well as the BasisofRecord for additional information. Use your tidyverse skills to create a new tibble with only those variables. In addition, remove duplicate records from the tibble.

Now we can plot this data set on the world map, using the useful package maps. To plot, use the ggplot() syntax with the following addition:

Note the warning message generated by ggplot. Then consider the map with the locations of the brown bear specimens. Do any of them seem strange to you? What may be the explanation behind these strange data point? Now filter out the points that you identified as suspicious and print out their BasisofRecord. Does this suggest an explanation for the strangeness?

3.17 Resources