Welcome to Day 2 of our Introduction to R workshop! If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here :octicons-download-24:
Tidyverse is a collection of R packages designed to be used together to clean up data sets and make data exploration and visualization easier, and to make data “tidy”. Because all of the R commands and exercises below will involve functions available in tidyverse, you need to have tidyverse installed.
R packages are extensions to the base R environment. They typically contain code consisting of functions available in the package, documentation, and in some cases data sets. R packages are built following a standard format so R will interact with them in a reliable, consistent way.
Some packages are included with the base R installation. You can see the packages that come with R by default (or that you may have installed) if you click over to the Packages tab in your Rstudio window. However, many packages you will want to use for data analysis need to be downloaded. R packages generally come from one of two places:
How you install packages will depend on where you are getting the
package from. For CRAN, you can use the R function
install.packages() or the Rstudio Tools -> Install
Packages menu. For example, if you wanted to install the abc
package for doing Approximate Bayesian Computation, in the R console you
would simply type: install.packages(“abc”).
To install Bioconductor packages, you will need to first
install the Bioconductor package manager (BUT DON’T DO THIS NOW!):
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.16")
Once the package manager is installed, you can install a Bioconductor
project packages with the following syntax:
BiocManager::install(<name of Bioconductor R package>).
We won’t discuss Bioconductor more in this workshop, but it is a useful
resource to be aware of.
Packages frequently have other R packages as dependencies, and so you will often get the prompt from R asking if you want to update (the dependencies) to their current version, and if you want to compile packages that require compilation from source. Packages compiled from source can be particularly prone to installation failures, especially on M1/M2 Macs, and often require some searching to decode how to proceed. But today we will only use packages that are simple and easy to install.
So we mentioned that tidyverse is a collection of R packages. You can conveniently install all these related packages with one install command.
If you haven’t already installed tidyverse and the palmerpenguins dataset, do that now by executing the following code block.
install.packages("tidyverse", "palmerpenguins")
Note that installing a package is not the same thing as loading it
into your current R session. Once the package is installed, you have the
functions on your computer, but in order to use them in your R session,
you need to load the package, which we usually do with the
library function.
Then load tidyverse by running the following code block
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
##
## Attaching package: 'palmerpenguins'
##
## The following objects are masked from 'package:datasets':
##
## penguins, penguins_raw
For more information, go to tidyverse :octicons-link-external-24:{:target=“_blank”}.
Packages usually come with vignettes, which give you short introductions and tutorials for their use. You can find these on the web, but you can also browse them in R.
Run the following code block to see the vignettes for the dplyr package, which is part of the tidyverse
browseVignettes(package = "dplyr")
Today’s data set comes from… wait for it … penguins! In particular, we are using the Palmer penguins :octicons-link-external-24:{:target=“_blank”} data set contains measurement data for 3 species of penguins collected on three islands at the LTER site in the Palmer Archipelago, Antarctica.
You can view the data by typing
View(penguins)
This should open up a new window tab that will allow you to browse the penguins dataset. We chose this dataset because it is a well organized, comprised of both numerical and categorical observations, and of manageable size. It is also a great exemplar of tidy data.
The notion of tidy data has been around in various incarnations for a while but has more recently been formalized in this paper :octicons-download-24:{:target=“_blank”} by Hadley Wickham, the developer of tidyverse. Tidy data are, in short, data that have been organized in a consistent way that makes data exploration and visualization easier. Two key principles are that:
Organized this way, for analysis with software such as R, values for different variables can be linked within observations, e.g. rows of our example data are observations of individual penguins, and columns contain information about each individual (species, sex, island where recorded, body mass, etc.) such that one could explore how the weight or flipper length varies by sex and species. Often times data that one pulls down from a repository (or that is provided by a collaborator) are not tidy! For example, multiple variables may be stored in the same column, such as when two different treatment types are concatenated into a single string. Tidying such data would require separating each treatment into a separate column.
In most cases, tabular data are represented as a “data.frame”, where
the expectation is that, in line with the notion of “tidy” data, rows
are observations and columns are variables for which there are values
recorded for each observation, including the absence of an observation,
i.e. missing data. Historically in R and prior to tidyverse,one would
load a data frame from a file with a file reading function such as
read.csv() or read.table(), and one would have
to explicitly specify whether or not the first row represented the
variable names for each column, whether there were row names, etc.
Tidyverse has its own data load functions, read_csv() and
read_table() - note the underscore rather than the
period - which load tabular data as “tibbles”. tibbles are
effectively just data.frames with some improvements to how they
are displayed, how variables can be accessed and a few other minor,
subtle differences. But you can think of them as pretty data.frames.
Because the data set we are using today already comes built into the
palmerpenguins package, the penguins dataframe is
immediately available after loading the palmerpenguins R
package. In most use cases, your data are stored as a text file in some
sort of tabular format–space, tab or comma separated–and tidyverse has
file loading functions. To show how loading works, we will write the
penguins dataframe from palmerpenguins to a csv
file, the use the tidyverse read_csv function to load
it.
Run the following code block to write the penguins data frame to a file then load it as a tibble
library(palmerpenguins)
write.csv(penguins,file = "penguins.csv")
mypenguins <- read_csv("penguins.csv")
## New names:
## Rows: 344 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): species, island, sex dbl (6): ...1, bill_length_mm, bill_depth_mm,
## flipper_length_mm, body_mass_g...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Tidyverse file loading functions assume the first row of your data
contains column names, while the base R functions
(e.g. read.csv and read.table) do not.
Tidyverse file loaders return a tibble rather than a standard dataframe.
One of the first things you will notice is that upon loading you will
get a brief summary of the tibble contents, such as the tibble
dimensions, the column separator, and the names of variables (columns)
belonging to each type: dbl for numeric and chr for
character. We can also take a quick peek at the first few rows by simply
calling the tibble.
Run the following code block to see some basic information about this data.
mypenguins
## # A tibble: 344 × 9
## ...1 species island bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Adelie Torgersen 39.1 18.7 181
## 2 2 Adelie Torgersen 39.5 17.4 186
## 3 3 Adelie Torgersen 40.3 18 195
## 4 4 Adelie Torgersen NA NA NA
## 5 5 Adelie Torgersen 36.7 19.3 193
## 6 6 Adelie Torgersen 39.3 20.6 190
## 7 7 Adelie Torgersen 38.9 17.8 181
## 8 8 Adelie Torgersen 39.2 19.6 195
## 9 9 Adelie Torgersen 34.1 18.1 193
## 10 10 Adelie Torgersen 42 20.2 190
## # ℹ 334 more rows
## # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
In the R markdown, you will notice that in this view there are tabs at the bottom right for advancing to additional sets of rows. If there were too many columns to be displayed in the window, there would also be an arrow at the top right allowing you to advance to the “right” to see additional columns.
One theme of tidyverse is that it provides replacements for a lot of
base R functions with better formatting. A tibble is one example of
this; the glimpse() function is another; it is basically a
pretty version of str().
Run the following code block to get a “glimpse” of the data:
glimpse(mypenguins)
## Rows: 344
## Columns: 9
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We can of course use all our non-tidyverse (base R) functions to look
at data, like head and View, since tibbles are
just pretty data frames.
To get the first 10 lines of mypenguins, run the following code block:
head(mypenguins, 10)
## # A tibble: 10 × 9
## ...1 species island bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Adelie Torgersen 39.1 18.7 181
## 2 2 Adelie Torgersen 39.5 17.4 186
## 3 3 Adelie Torgersen 40.3 18 195
## 4 4 Adelie Torgersen NA NA NA
## 5 5 Adelie Torgersen 36.7 19.3 193
## 6 6 Adelie Torgersen 39.3 20.6 190
## 7 7 Adelie Torgersen 38.9 17.8 181
## 8 8 Adelie Torgersen 39.2 19.6 195
## 9 9 Adelie Torgersen 34.1 18.1 193
## 10 10 Adelie Torgersen 42 20.2 190
## # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
For the rest of the workshop, we will just use the
penguins tibble provided with palmerpenguins
rather than our newly created mypenguins tibble, as they
are identical. It is also worth noting that, if you are using other
tools in R that produce data frames, you can use the
as_tibble function to convert the dataframe to a tibble on
the fly. Similarly,if your collaborator has provided you with messy data
that is not immediately amenable to loading as a tibble, you can use a
standard data load, do some relevant manipulations, then use the
as_tibble function.
Today we’ll talk about a bunch of functions that are part of tidyverse and provide convenient ways to manipulate tibbles. While we’ll discuss this in the context of tibbles, everything here works on data frames too - remember a tibble is just a pretty data frame.
Tidyverse functions can get very complex, but they share a common grammar. We’ll see the real power of this on day 3 when we talk about the grammar of graphics using ggplot, but getting a handle on the basic grammar of the tidyverse can help you a lot with data manipulation as well.
We’ll start with something really simple: sorting a data frame. We do
this with the arrange() function.
To sort the tibble by species, run the following code block:
arrange(penguins, species)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
All tidyverse functions take the tibble or data frame to operate on as their first argument. The second argument (‘species’) is what makes tidyverse special. Note that we are using the syntax we’ve used before to refer to objects, here: a simple unquoted word or variable name.
What happens if we just type species in the
Console?
Tidyverse allows you to treat the variables in your current data frame as if they were objects. That is, within the arrange function (or almost any tidyverse function), you can reference the names of columns in your data frame and magically the function understands what you mean.
What if we want to sort by multiple values? For example, first sort by section, and then sort by year?
Many tidyverse functions have a special second argument that you will
see listed in the help pages as .... You can look at
?arrange as an example. This means that the second argument
to these functions can be any number of columns, and the function will
work on those columns in order. So to sort by year and then sex
(i.e. sex within year):
To sort the tibble by year, then by sex, run the following code block:
arrange(penguins, year, sex)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.5 17.4 186 3800
## 2 Adelie Torgersen 40.3 18 195 3250
## 3 Adelie Torgersen 36.7 19.3 193 3450
## 4 Adelie Torgersen 38.9 17.8 181 3625
## 5 Adelie Torgersen 41.1 17.6 182 3200
## 6 Adelie Torgersen 36.6 17.8 185 3700
## 7 Adelie Torgersen 38.7 19 195 3450
## 8 Adelie Torgersen 34.4 18.4 184 3325
## 9 Adelie Biscoe 37.8 18.3 174 3400
## 10 Adelie Biscoe 35.9 19.2 189 3800
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
In this case, sorting is done first by year, and then sex. Alternately, you could sort year in descending order and sex in ascending (alphabetic) order:
To sort the tibble in descending order by year, and ascending order by sex, run the following code block:
arrange(penguins, desc(year),sex)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 35 17.9 192 3725
## 2 Adelie Biscoe 37.7 16 183 3075
## 3 Adelie Biscoe 37.9 18.6 193 2925
## 4 Adelie Biscoe 38.6 17.2 199 3750
## 5 Adelie Biscoe 38.1 17 181 3175
## 6 Adelie Biscoe 38.1 16.5 198 3825
## 7 Adelie Biscoe 39.7 17.7 193 3200
## 8 Adelie Biscoe 39.6 20.7 191 3900
## 9 Adelie Torgersen 38.6 17 188 2900
## 10 Adelie Torgersen 35.7 17 189 3350
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
desc() here is a helper function that turns year into a
descending sort instead of ascending. Tidyverse has a lot of these,
although not many are applicable to arrange(). We’ll
introduce more as we work through today.
Note that so far we haven’t stored the sorted data. Let’s try that now.
Sorting exercise. Sort your
penguinstibble by species, then by year, then by body mass, and store the output in a new tibble calledsorted_penguins.
sorted_penguins <- arrange(penguins, species, year, body_mass_g)
Sometimes in a big data set, there are a lot of columns that are not
germane to the analysis at hand, such that one can simply select columns
that represent the variables you want to work with. To do this we use
the select() function.
For example, let’s say you are only interested in looking at the relationship between flipper length and body mass, irrespective of year and island, such that you only want to retain those two morphological measurements, sex, and species
To select these variables from the penguins tibble, run the following code block:
select(penguins, species, sex, flipper_length_mm, body_mass_g)
## # A tibble: 344 × 4
## species sex flipper_length_mm body_mass_g
## <fct> <fct> <int> <int>
## 1 Adelie male 181 3750
## 2 Adelie female 186 3800
## 3 Adelie female 195 3250
## 4 Adelie <NA> NA NA
## 5 Adelie female 193 3450
## 6 Adelie male 190 3650
## 7 Adelie female 181 3625
## 8 Adelie male 195 4675
## 9 Adelie <NA> 193 3475
## 10 Adelie <NA> 190 4250
## # ℹ 334 more rows
The first argument to select() is the tibble one is
subsetting from, and the rest of the arguments are the columns we wanted
to keep. We see that select produced a tibble with the same
number of rows as the penguins tibble, but only with the
columns we chose to include. Note that, the above implementation of
select() DOES NOT produce a new tibble: it
simply prints the result to the screen. If we want to store the output,
we need to redirect to a new tibble.
To produce a new tibbble consisting of species, sex, flipper_length_mm, body_mass_g, run the following code block:
flipper_mass_data <- select(penguins, species, sex, flipper_length_mm, body_mass_g)
flipper_mass_data
## # A tibble: 344 × 4
## species sex flipper_length_mm body_mass_g
## <fct> <fct> <int> <int>
## 1 Adelie male 181 3750
## 2 Adelie female 186 3800
## 3 Adelie female 195 3250
## 4 Adelie <NA> NA NA
## 5 Adelie female 193 3450
## 6 Adelie male 190 3650
## 7 Adelie female 181 3625
## 8 Adelie male 195 4675
## 9 Adelie <NA> 193 3475
## 10 Adelie <NA> 190 4250
## # ℹ 334 more rows
The select function, you should notice, works kind of
like indexes we talked about yesterday. But it is a lot easier to use,
because you can refer to the column names as variables in your command.
Tidyverse magic at work!
Alternatively, and this gets more useful when we have a lot of columns we want to keep and only a few we want to discard, we can dump columns rather than select for them. We can do this by applying a logical “NOT” by placing a ! in front of a vector of variables we want to filter out.
To exclude the island, bill_length_mm, and bill_depth_mm variables, run the following code block:
select(penguins, !c(island, bill_length_mm, bill_depth_mm))
## # A tibble: 344 × 5
## species flipper_length_mm body_mass_g sex year
## <fct> <int> <int> <fct> <int>
## 1 Adelie 181 3750 male 2007
## 2 Adelie 186 3800 female 2007
## 3 Adelie 195 3250 female 2007
## 4 Adelie NA NA <NA> 2007
## 5 Adelie 193 3450 female 2007
## 6 Adelie 190 3650 male 2007
## 7 Adelie 181 3625 female 2007
## 8 Adelie 195 4675 male 2007
## 9 Adelie 193 3475 <NA> 2007
## 10 Adelie 190 4250 <NA> 2007
## # ℹ 334 more rows
To convince yourself those variables actually got filtered out, run
glimpse()in the code block below:
glimpse(select(penguins, !c(island, bill_length_mm, bill_depth_mm)))
## Rows: 344
## Columns: 5
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Again, in this example we have NOT yet created a new tibble for
downstream data analysis, we have simply printed the result of
select() to the screen. We would have to redirect the
output of this function to a new tibble to save the result.
To generate the new tibble, run the following code block:
noyear_bill_island <- select(penguins, !c(island, bill_length_mm, bill_depth_mm))
Column selection exercise:
- As a prelude to quantifying the relationship within species between bill traits, construct a new tibble called bill_data that includes the following variables: species, sex, bill_length_mm, and bill_depth_mm:
bill_data <- select(penguins, species, sex, bill_length_mm, bill_depth_mm)
With large data frames with many columns, it can often be difficult
to type out a list of everything you want. Conveniently, select comes
with its own helper functions to allow you to pick columns. For example,
the starts_with() function generates a list of all the
columns that start with a particular word in the current data frame.
To see how
starts_with, works, run the following code block:
select(penguins, species, sex, starts_with("bill"))
## # A tibble: 344 × 4
## species sex bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Adelie male 39.1 18.7
## 2 Adelie female 39.5 17.4
## 3 Adelie female 40.3 18
## 4 Adelie <NA> NA NA
## 5 Adelie female 36.7 19.3
## 6 Adelie male 39.3 20.6
## 7 Adelie female 38.9 17.8
## 8 Adelie male 39.2 19.6
## 9 Adelie <NA> 34.1 18.1
## 10 Adelie <NA> 42 20.2
## # ℹ 334 more rows
There are lots more of these: ?select is your friend
here.
Up until now, we have been selecting columns to remove or keep. Often
times, one wants to restrict analysis and visualization to a subset of
the observations, that is, to select particular rows.
In the case of our data set, perhaps you are only interested in Adelie
penguins, or perhaps only males. The filter() function
allows you to select rows based upon values in multiple
columns, similar to how we talked about logical vectors yesterday. This
function works by only returning values where the condition is true. It
not only excludes those for which it is FALSE, it also filters out rows
where there is missing data (i.e. NA) for the column evaluated (because
any logical test evaluates to false when given NA).
Let us try creating a new tibble, with data restricted to Adelie penguins.
Create a new tibble, with data restricted to Adelie penguins by running the code block below:
adelie <- filter(penguins, species=="Adelie")
adelie
## # A tibble: 152 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Note the ==, as we discussed yesterday for logical
operations. We can add additional logical conditions for other variables
if we want to make more complex selections. For example, we might only
want to look at male Adelie penguins. Let’s do this:
To select rows for male Adelie penguins, run the following code block:
adelie_males <- filter(penguins, species=="Adelie", sex =="male")
adelie_males
## # A tibble: 73 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.3 20.6 190 3650
## 3 Adelie Torgersen 39.2 19.6 195 4675
## 4 Adelie Torgersen 38.6 21.2 191 3800
## 5 Adelie Torgersen 34.6 21.1 198 4400
## 6 Adelie Torgersen 42.5 20.7 197 4500
## 7 Adelie Torgersen 46 21.5 194 4200
## 8 Adelie Biscoe 37.7 18.7 180 3600
## 9 Adelie Biscoe 38.2 18.1 185 3950
## 10 Adelie Biscoe 38.8 17.2 180 3800
## # ℹ 63 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
When we provide more than one filtering criterion, the commas between those criteria are equivalent of AND in a logical statement. The filter command above effectively says: return rows where species == “Adelie” AND sex == “male”.
Similarly, we can use | to specify either in a logical
filtering step. For example if we wanted records for either Adelie OR
Gentoo penguins, we can specify two acceptable values for
species.
To make this type of OR selection, run the code block below:
adelie_gentoo <- filter(penguins, species=="Adelie" | species == "Gentoo")
adelie_gentoo
## # A tibble: 276 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 266 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Row selection exercise: Create a new tibble that only contains Chinstrap penguins, that are females, for years after 2007)
chinstrap_females_after2007 <- filter(penguins, species=="Chinstrap", sex =="female", year > 2007)
chinstrap_females_after2007
## # A tibble: 21 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Chinstrap Dream 50.5 18.4 200 3400
## 2 Chinstrap Dream 46.4 17.8 191 3700
## 3 Chinstrap Dream 40.9 16.6 187 3200
## 4 Chinstrap Dream 42.5 16.7 187 3350
## 5 Chinstrap Dream 47.5 16.8 199 3900
## 6 Chinstrap Dream 47.6 18.3 195 3850
## 7 Chinstrap Dream 46.9 16.6 192 2700
## 8 Chinstrap Dream 46.2 17.5 187 3650
## 9 Chinstrap Dream 45.5 17 196 3500
## 10 Chinstrap Dream 50.9 17.9 196 3675
## # ℹ 11 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Remember that in order to select a set of rows, one is performing a logical operation, effectively seeing if the values of a column or a set of columns in a row meets the specified criterion, constructing a logical vector of TRUEs and FALSEs depending upon whether the criterion is met, and then printing (or redirecting to a new tibble) all columns but only the rows that are TRUE, i.e. where the criterion is met.
Let’s try another, more complex example where we only want Adelie penguins, but only males from Biscoe or females from Torgersen. There is not a clear research question behind this sort of selection…but … it is a useful example for how to string together logical/conditional statements to subset data.
To do this, run the following code block:
filter(penguins, species %in% c("Adelie"), (island == "Biscoe" & sex == "male") | (island == "Torgersen" & sex=="female"))
## # A tibble: 46 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.5 17.4 186 3800
## 2 Adelie Torgersen 40.3 18 195 3250
## 3 Adelie Torgersen 36.7 19.3 193 3450
## 4 Adelie Torgersen 38.9 17.8 181 3625
## 5 Adelie Torgersen 41.1 17.6 182 3200
## 6 Adelie Torgersen 36.6 17.8 185 3700
## 7 Adelie Torgersen 38.7 19 195 3450
## 8 Adelie Torgersen 34.4 18.4 184 3325
## 9 Adelie Biscoe 37.7 18.7 180 3600
## 10 Adelie Biscoe 38.2 18.1 185 3950
## # ℹ 36 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
While we could simply have filtered on species==Adelie,
we demonstrate the %in% as a way to construct a logical
statement meaning “IN the following vector”, such that if there were
multiple values for species, rather than stringing together a series of
OR statements, we evaluate membership in the vector. In this
case, in the construction of a logical criterion for rows,
& is used to specify AND and | is used to
specify OR. It is VERY IMPORTANT to understand how R
interprets these logical operators. In general, AND is
stronger than OR, meaning the ANDs get interpreted
first.
To clarify what we mean by this, imagine the following vector:
test <- c(1, 2, 3)
Now look at the result of the following two logical operations:
3 %in% test | 5 %in% test & 4 %in% test
## [1] TRUE
3 %in% test & 5 %in% test | 4 %in% test
## [1] FALSE
In the first statement, the part after the | , i.e the
AND gets done first, and it is FALSE. The part before,
evaluting whether 3 is in test is TRUE. So, the logical
statement is saying TRUE | FALSE, and in logical statments,
if one of the options in an OR statement is TRUE, the statement returns
TRUE. In the second statement, the condition before the |
has an AND, so it is evaluated first. Both numbers aren’t in
test so it is FALSE. The part to the right of the
| is also FALSE, thus FALSE OR FALSE returns FALSE.
For purposes of making our R code clearer – and it is a good idea to do this in your R code – when writing complex tibble filtering operations, we put the criteria joined by AND inside parentheses. Thus the statement is saying: “select rows for which species is in the vector of names that only include Adelie, and for which island equals BISCO AND sex equals male… OR … island equals Torgersen and sex equals female.
of course, with a simple dataset such as the penguins data, we could
use OR instead of the %in%.
To use the OR statement to select multiple species with two different year-sex combinations, run the code block below:
filter(penguins, (species == "Adelie" | species == "Gentoo"), (sex == "male" & year == "2008") | (sex == "female" & year==2009))
## # A tibble: 94 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 40.1 18.9 188 4300
## 2 Adelie Biscoe 42 19.5 200 4050
## 3 Adelie Biscoe 41.4 18.6 191 3700
## 4 Adelie Biscoe 40.6 18.8 193 3800
## 5 Adelie Biscoe 37.6 19.1 194 3750
## 6 Adelie Biscoe 41.3 21.1 195 4400
## 7 Adelie Biscoe 41.1 18.2 192 4050
## 8 Adelie Biscoe 41.6 18 192 3950
## 9 Adelie Biscoe 41.1 19.1 188 4100
## 10 Adelie Torgersen 41.8 19.4 198 4450
## # ℹ 84 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Complex row selection exercise Select only observations for males from before 2009, including Adelie penguins with a body mass greater than 4043 and Gentoo penguins with a body mass greater than 5485. These values are the mean mass for males of each species.
filter(penguins,year< 2009, sex == "male", (species == "Adelie" & body_mass_g > 4043) | (species == "Gentoo" & body_mass_g > 5485))
## # A tibble: 44 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.2 19.6 195 4675
## 2 Adelie Torgersen 34.6 21.1 198 4400
## 3 Adelie Torgersen 42.5 20.7 197 4500
## 4 Adelie Torgersen 46 21.5 194 4200
## 5 Adelie Dream 39.2 21.1 196 4150
## 6 Adelie Dream 39.8 19.1 184 4650
## 7 Adelie Dream 44.1 19.7 196 4400
## 8 Adelie Dream 39.6 18.8 190 4600
## 9 Adelie Dream 42.3 21.2 191 4150
## 10 Adelie Biscoe 40.1 18.9 188 4300
## # ℹ 34 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Often times, we want to perform analyses on synthetic variables that are derived from the original variables in a data set. For example, one might want to look at one variable normalized by the mean value of that variable, or one variable divided by another.
In this data set, we have bill length (bill_length_mm) and bill depth
(bill_depth_mm). Let’s add a column variable to our tibble that is depth
divided by length, which might be viewed as a bill size-adjusted depth.
We’ll do this with the mutate() function.
To add a depth_sizeadj variable to penguins, run the code block below:
penguins <- mutate(penguins, depth_sizeadj = bill_depth_mm/bill_length_mm)
In redirecting back to the penguins tibble, we are changing the input data rather than writing a new tibble with this variable added. One should be aware (and careful!) when redirecting back to the original tibble, because if you create a new variable or transform an old variable, and assign a name to the new variable that already exists, it will overwrite the old variable.
New variables creation exercise 1: For male adelie penguins, we want to flag outlier observations where body mass is unusually large or small relative to the mean value. To do this, you will need to use the
mean()andsd()functions to calculate mean and standard deviation of body mass. Once you’ve done that, create a new column that is the variable mass_outlier which is a boolean variable that indicates whether body mass is greater than or less than 2 standard deviations away from the mean, and add this to the adelie_males tibble. Finally, creat e a new tibble called outliers that only contains male Adelie penguins with outlier-classified body mass values.
This is a complicated exercise, but there are hints in the comments below. Note there are many ways to do this exercise, and the hints are just one way. You might find ways to combined steps, for example.
If you missed the creation of the adelie_males table,
here’s the code to reproduce it:
adelie_males <- filter(penguins, species=="Adelie", sex =="male")
Create a new variable called mass_mean that stores the mean of the body_mass_g column; remember that you need to include the argument
na.rm = TRUEto tell the mean function to remove missing values first
mass_mean <- mean(adelie_males$body_mass_g, na.rm = TRUE)
Create a new variable called mass_sd that stores the standard deviation of the mass column. The function for this is sd(); like mean() you need to include the argument na.rm = TRUE to tell the sd function to remove missing values first
mass_sd <- sd(adelie_males$body_mass_g, na.rm = TRUE)
Create a new column in the adelie_males tibble that is TRUE if body mass is more than two standard deviations from the mean
adelie_males <- mutate(adelie_males, mass_outlier = ((body_mass_g > (mass_mean + 2 * mass_sd)) | (body_mass_g < (mass_mean-2 * mass_sd))))
Filter the adelie_males tibble to keep only rows where mass_outlier is TRUE
outliers <- filter(adelie_males, mass_outlier == TRUE)
You should have two observations in your outliers tibble.
mutate() to reformat missing valuesAnother important use of mutate() is to convert missing
values. Data sets are frequently not constructed with R in mind, such
that missing values might be formatted differently, e.g. -999 or “N/A”.
Of course, one wouldn’t want to generate plots that included the value
-999. While our example data already have missing data formatted in an
R-compliant way, here is an example of how one would convert -999 to NA.
We use mutate() combined with the na_if()
function for this.
The na_if() function is a tidyverse function that
converts the specified value to NA.
As an example of
na_if(), run the code block below:
penguins <- mutate(mypenguins, bill_length_mm = na_if(bill_length_mm, -999))
In other circumstances, one might want to replace a missing data
point with an expected value of that data point (assuming the dispersion
of that variable is not too large). For example, if we ignore for the
moment that morphological measurements are only missing for Adelie
penguins for which the value for sex is also missing, we can
take the adelie tibble and use the
replace_na() function to replace missing values for
body_mass_g with its median value, which is basically the
inverse of na_if().
To replace NAs for body_mass_mm with the median value in a new tibble, run the following code block:
nolengthNAs <- mutate(adelie, body_mass_g = replace_na(body_mass_g, median(body_mass_g, na.rm = TRUE)))
head(nolengthNAs)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA 3700
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
In this case, we have created a new tibble, as we probably don’t want
to overwrite the original data, in case we want to undo the conversion
of missing data to the median value. Notice, as with earlier use of
summary statistics functions, we include the na.rm=TRUE
statement. Otherwise, we would be replacing NA with NA, as applying
median() to a vector containing NAs returns NA as its
value! Of course, the calculation we just performed doesn’t make much
actual biological sense, because we are taking the median of
body_mass_g across both sexes in the tibble. In other
contexts where other group-labeling variables such as sex aren’t
missing, one can perform more complex operations, e.g. if sex wasn’t
missing we could replace missing values with sex-specific median values.
We will get into such more complex operations a bit later in the
workshop.
Let’s try to create a new variable that represents a Z score, i.e. a
transformation of a set of measurements that represents the number of
standard deviations it is away from the mean (and in what direction).
Above, we have seen the mean() and sd()
functions. A Z-score for an observation is simply (observation - mean of
observations) / standard deviation of observations, with the
distribution of Z being standard normal, i.e. mean == 0 and standard
deviation == 1. Using the adelie tibble, we can create a new
variable body_mass_Z that represents a Z-transformation of
body mass for male Adelie penguins. We will write the output to a new
tibble adelie_Z. There are really three ways to generate
the new Z-score variable. The first, is to write the calculation of the
Z-score from inside of the mutate() function. To do this
…
Run the code block below to create the Z-score variable:
adelie_Z <- mutate(adelie, body_mass_Z = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE))
So, we are calculating the mean and sd of body_mass_g once,
and using those values to apply the Z-score equation to each row of the
body_mass_g vector and storing the output in the new variable
body_mass_Z. Remember, unless you are certain there are no
missing values in the variable vectors being used for calculations, when
the option is available, it is best to invoke
na.rm=TRUE.
To convince yourself that Z is in fact a standard normal distribution (mean=0, sd=1), check the mean and standard deviation of this new variable. Also, remember, statistical operations need to filter out missing values otherwise they will return NA.
To get those summary statistics, run the code block below. Note that due to floating point errors, 0 may instead show up as a very small number.
mean(adelie_Z$body_mass_Z, na.rm = TRUE)
## [1] 2.845406e-16
sd(adelie_Z$body_mass_Z, na.rm = TRUE)
## [1] 1
New variables creation exercise 2: Can you think of another way to generate the Z-score column? Hint: you can create new variables for the constituent parts of the Z-score function, and then perform the calculation using those constituent parts. Try doing that, updating adelie_Z with the new variables. It should take three separate mutate commands. Write the Z-score to the new variable z2
adelie_Z <- mutate(adelie_Z, meanmass = mean(body_mass_g, na.rm = TRUE))
adelie_Z <- mutate(adelie_Z, sdmass = sd(body_mass_g, na.rm = TRUE))
adelie_Z <- mutate(adelie_Z, z2 = (body_mass_g - meanmass) / sdmass)
To confirm this produces the same output as the first implementation, calculate the mean and standard deviation of z2 (these should also be 0 and 1, respectively).
mean(adelie_Z$z2, na.rm = TRUE)
## [1] 2.845406e-16
sd(adelie_Z$z2, na.rm = TRUE)
## [1] 1
The distinction between these two approaches is that, in the first case, one is recycling the mean and standard deviation values (effectively vectors of size 1 each) in applying them to every element of body_mass_g in calculating the Z-score values in the respective rows. In the second case, because we have constructed mean and standard deviation variables of equal length to body_mass_g, we are doing mathematical operations on same-sized vectors. Of course, one might not want to store the same (e.g.mean) value over thousands of rows. The point here is to reinforce the idea that the same result can be achieved by constructing vectors of different lengths.
As a side note to this, it is important to know that mutate requires that the object passed to the new column name argument is a vector of length 1 or a vector of length equal to the number of rows in the tibble. If you try to recycle a vector of length 2 in a mutate command, you’ll get an error:
mutate(adelie, error = c(1, 2))
Often at the beginning of a project, you have data from multiple sources that you want to merge, but are stored in different tables. For example, you might have visual measurements like color for a group of animals in one table and quantitative measurements like weight for the same animals in another table. In the best case, every animal is represented in both tables, i.e. each row in one table is associated with a row in the other table. Of course, it is also possible that one table has missing data for some observations - maybe the weight measurements wasn’t taken for some animals - and that breaks the one-to-one relationship. In either case, we can merge tables by performing “mutating joins”. In tidyverse, joins are performed one pair of tables at a time, such that, if you have tables x,y, and z, you must first join x and y to produce a new table xy, then join xy and z to create xyz. As we shall see below, a requirement of such joins is that any two tables to be joined contain unique observations on the same variable.
Left joins are a common operation, where given tables x and y, you only want to retain all rows in x, regardless if there is matching data in y: missing data in y will get returned as NAs in the new table. This sort of join is called left join, because the 1st table (the one on your left) is having data added to it from the table on the right.
penguins <- mutate(penguins, obs_number = row.names(penguins))
Now, we’ll create some sub-tables with select
x <- select(penguins, obs_number, body_mass_g)
y <- select(penguins, obs_number, species, sex)
xy <- left_join(x, y)
## Joining with `by = join_by(obs_number)`
glimpse(xy)
## Rows: 344
## Columns: 4
## $ obs_number <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ sex <fct> male, female, female, NA, female, male, female, male, NA, …
Tidyverse determines on the fly that the two tables share a column name, and assumes that they contain the same data. If your collaborator used a different column name for observation_number, one will need to tell the join function which columns match. So, if we change the column name for the observation in table x, we can still perform the join as follows:
x <- rename(x, obs = obs_number)
xy <- left_join(x, y, by = join_by(obs == obs_number))
glimpse(xy)
## Rows: 344
## Columns: 4
## $ obs <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
## $ sex <fct> male, female, female, NA, female, male, female, male, NA, …
Other useful flavors of join are inner_join, which only keep observations observed in both tables, and full_join, which keeps all observations in both tables: data in x, but missing in y; data in y, but missing in x, data in both x and y.
While converting from wide to long is often a step for tidying untidy
data. While the penguins data.frame object was already
tidy, it can often be the case that a data table was created with
different column variables that represent different types of the same
class of measurement. For example, repeated weight measurements of a
laboratory organism undergoing an experimental treatment might be
recorded with a separate column for each measurement date. From a
statistical and plotting perspective, date is a factor, and it can be
more convenient to have all the weights in a single column, and a
separate column for measurement date. In addition, it can make it easier
to inspect data if there are fewer columns to look at … otherwise, you
are panning back and forth.
As we shall see in Part 3 of this workshop, having the actual measurements aggregated into a single column can make for easier plotting, if we also have an additional column that consists of a factor that describes which measurement it is. So, we are going to combine the three morphological measurements on bill length, bill depth, and flipper length into one column, making the table less wide and more long. Admittedly, with three columns converted to two, it isn’t a big gain in ease of viewing, but it demonstrates the principles that can be applied when far more columns need to be merged.
# First, we need to make each row have a unique ID. This is another way to do it
penguins <- mutate(penguins, id = row_number())
# Then, we use the function pivot_longer() to merge columns together
penguins_long <- pivot_longer(penguins, c(bill_length_mm, bill_depth_mm, flipper_length_mm), names_to = "morphtype", values_to = "mm")
In this operation, we specify the data frame being transformed, as well as a vector of column names for the columns we wish to merge. names_to is the new column that we are creating that will contain the old column name (of those column names merged), while values_to is the name of the new column that will store the actual measurement data.
Again, let’s take a look and see that it worked!
glimpse(penguins_long)
## Rows: 1,032
## Columns: 10
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, …
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, T…
## $ body_mass_g <int> 3750, 3750, 3750, 3800, 3800, 3800, 3250, 3250, 3250, NA…
## $ sex <fct> male, male, male, female, female, female, female, female…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
## $ depth_sizeadj <dbl> 0.4782609, 0.4782609, 0.4782609, 0.4405063, 0.4405063, 0…
## $ obs_number <chr> "1", "1", "1", "2", "2", "2", "3", "3", "3", "4", "4", "…
## $ id <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7,…
## $ morphtype <chr> "bill_length_mm", "bill_depth_mm", "flipper_length_mm", …
## $ mm <dbl> 39.1, 18.7, 181.0, 39.5, 17.4, 186.0, 40.3, 18.0, 195.0,…
One thing we can see is that the new values in morphtype are
unnecessarily long, because they all contain “_mm” embedded in the label
… but we know the data are measured in millimeters, hence the name of
the new column mm. So, we can use mutate() and
str_replace() to strip that suffix off of the values:
penguins_long <- mutate(penguins_long, morphtype = str_replace(morphtype, "_mm", ""))
In other scenarios, converting to long format is important for
tidying data. A good example is the USPersonalExpenditure
base R dataset.
head(USPersonalExpenditure)
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
This dataset is stored as a matrix object, in which the type of household expense is a row name, and there are columns for each year. Ideally, all the expense amounts would be in the same column and year would be a separate variable. To clean up this data, we can do this:
myUSPersonalExpenditure <- as.data.frame(USPersonalExpenditure)
myUSPersonalExpenditure$expenditure <- row.names(myUSPersonalExpenditure)
myUSPersonalExpenditure <- as_tibble(myUSPersonalExpenditure)
USPersonalExpenditure_long <- pivot_longer(myUSPersonalExpenditure,cols = starts_with("19"), names_to = "year", values_to = "amount")
USPersonalExpenditure_long
## # A tibble: 25 × 3
## expenditure year amount
## <chr> <chr> <dbl>
## 1 Food and Tobacco 1940 22.2
## 2 Food and Tobacco 1945 44.5
## 3 Food and Tobacco 1950 59.6
## 4 Food and Tobacco 1955 73.2
## 5 Food and Tobacco 1960 86.8
## 6 Household Operation 1940 10.5
## 7 Household Operation 1945 15.5
## 8 Household Operation 1950 29
## 9 Household Operation 1955 36.5
## 10 Household Operation 1960 46.2
## # ℹ 15 more rows
It should be noted, that there are other, more complicated ways of performing wide-to-long operations. For more information, see the tidyverse documentation for pivot_longer :octicons-link-external-24:{:target=“_blank”}, or use R help to get more information displayed inside Rstudio.
Thus far, we have demonstrated how to execute particular
tidyverse functions as distinct data-processing steps. However,
most of the time one wants to perform a sequence of data-processing
operations that require more than one R function. Without a means to
chain these steps together, one would have to deploy a particular R
function on a tibble, redirect the output to a new (or the same input)
tibble, then deploy the next R function on the new tibble, etc. Or, one
could try to nest sequential function executions within other function
executions,
e.g. new_tibble <-(function2(function1(original_tibble)))
which can get unwieldy and error-prone very quickly. The good news is,
tidyverse had a way of chaining together data processing steps
that is analogous to pipes in the bash shell … and guess what? They are
also called pipes. The general syntax of using pipes is:
tidy_data <- function1(data, args) %>% function2(args)
The %>% is the pipe, and it redirects the output of
function1 to function2.
The pipe implicitly fills the first unnamed argument of the right hand function with the output of the left hand function. So in this case, the output of function1 is implicitly assigned to the first argument of function2. Because tidyverse functions always take the input data as their first argument, this is very convenient.
In the above command structure the final output of data processing
with function1 and function2 gets directed to
tidy_data, a new cleaned tibble.
Now, as a more concrete example, let’s see how we can use
select() and filter() together with a
pipe.
To combine
select()andfilter()operations to create a new tibble callednew_data, run the code block below:
new_data <- select(penguins, year, species, sex, body_mass_g) %>% filter(species == "Adelie", year == 2008)
In this example, we pipe together the two function calls we executed earlier, but all in one command line.
pipe exercise:
- Try creating a more complex piped workflow, calling in sequence the
filter()andselect()commands above, followed by usingmutate()to extract Adelie penguins and then calculate a body size (i.e. mass) normalized flipper_length.
flipper_norm <- select(penguins, species, flipper_length_mm, body_mass_g) %>% filter(species == "Adelie") %>% mutate(norm_flipper_length = flipper_length_mm /body_mass_g)
- Try experimenting with different row and column selections, and new variable creations with these three functions. You can even switch the order, e.g. create a new synthetic variable then applying
filter()to it.
As a side note, you sometimes want to pipe the output of one function
to the input of another function when the second function DOESN’T use
the first argument for data. A common example of this is the function
lm(), which is used to fit a linear model to data. The
first argument to lm is the formula you want to fit, and the second
argument is the data. There are two common options to still use a pipe
here:
Because the output of the left hand function is piped to the first
unnamed argument, if you name every argument before data,
you’ll be okay:
penguins %>% filter_command() %>% lm(formula = x ~ y)
You can reference the piped output via the special variable
. in the right hand function:
penguins %>% filter_command() %>% lm(x ~ y, data = .)
Note that the way we have implemented pipes with %>%
is specific to tidyverse. If there are cases where you need to use pipes
in R outside of tidyverse, you would use |>. A trivial
example of such a pipe would be to calculate the sum of a vector of
numbers:
1:10 |> sum()
## [1] 55
In this case the vector of numbers 1 through 10 is piped to the sum function.
There are tidyverse flavors of standard R functions. For
example, to get the unique values for a column, instead of using
unique(), we can use the dplyr function
distinct().
To get the unique values for year, run the command below:
distinct(penguins, year)
## # A tibble: 3 × 1
## year
## <int>
## 1 2007
## 2 2008
## 3 2009
This function requires that you supply a tibble and column name for
which you want to get the unique values. It does not behave as you would
like if you simply supply the vector penguins$year.
Yesterday, we showed you how the summary function will
print out basic summary statistics for all columns in a data frame.
Today, we introduce the dyplr function
summarize() for generating specific statistics on one or
more variables. For example, to get the number of individuals for which
unique, individual ids are available (derived from pit tagging that
started only in 2007).
To generate this data summary, run the command below:
summarize(penguins, n = n_distinct(species))
## # A tibble: 1 × 1
## n
## <int>
## 1 3
This function call returns a 1x1 tibble n.
Ignoring the fact for the moment that the data set is comprised of three different species, one could also obtain a summary statistic on variables, for example, mean size and weight.
To generate a summary of mean bill length and body mass for Adelie penguins, run the command below:
summarize(adelie, meanbill = mean(bill_length_mm, na.rm = TRUE), meanmass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 1 × 2
## meanbill meanmass
## <dbl> <dbl>
## 1 38.8 3701.
Perhaps we are getting tired of always typing
na.rm = TRUE, and instead want to first filter our dataset
to remove rows that contain missing values. Tidyverse contains a nice
function, called drop_na(), that does just this. By default
it looks for missing values in every column in in our tibble,
but if you look at ?drop_na you’ll see it has one of those
... arguments, so we can pass it column names to check.
Exercise: Rewrite the summarize function above, but remove missing values before piping to summarize. Leave off the
na.rm=TRUEto make sure it worked!
drop_na(adelie) %>% summarize(meanbill = mean(bill_length_mm), meanmass = mean(body_mass_g))
## # A tibble: 1 × 2
## meanbill meanmass
## <dbl> <dbl>
## 1 38.8 3706.
You may have noticed that these two ways of handling NAs lead to different results from summarize. This is due to the fact that the first approach will compute statistics from all values that aren’t NAs, regardless if other variables for that observation have missing data. In the second, we drop all observations that have missing data for any variable. Depending upon the particular data set and planned downstream analyses, there may be good reasons for choosing one approach over the other.
group_by() with summarize()In most data sets, summarizing across all observations (rows) may not
be particularly informative, because what one usually wants to do is to
get information specific to different groups of observations, e.g. those
in different habitats, exposed to different experimental treatments,
etc. Thus, in most circumstances one will want to define those groups
for a tibble. One can do this using the group_by()
function, which takes as it’s first argument the name of the data frame,
and the following arguments are variables to group by; these can then be
followed by other keyword arguments. An obvious grouping for the
penguins tibble is species.
To obtain this grouping, run the command block below:
by_species <- group_by(penguins, species)
by_species
## # A tibble: 344 × 11
## # Groups: species [3]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 5 more variables: sex <fct>, year <int>, depth_sizeadj <dbl>,
## # obs_number <chr>, id <int>
Notice that in the header of the by_species tibble there is
now an indication that there are four groups. It is important to note
that group_by() does not change the data in the tibble:
there is no column that specifies the groups. But wait, we know there
are only three species in the data set. Let’s check to see what the
values are for species.
A feature of the penguins data is that there are a number of entries
for which the sex and other measurements were not taken, i.e. are
missing. While in this data set, if sex is missing the other
measurements tend to be missing as well, there are cases where
individual measurements might be missing and using the omnibus
drop_na function might discard useful data. In which case
one can filter on a particular missing column.
So, if we look at the sex variable:
distinct(adelie, sex)
## # A tibble: 3 × 1
## sex
## <fct>
## 1 male
## 2 female
## 3 <NA>
we see that NA is one of the values. We can fix this, so that, were we to group by sex, we won’t create a group for NA.
To generate grouped data for Adelie penguins by sex after removing rows with no sex information, run the following code block:
adelie_by_sex <- filter(adelie, is.na(sex)==FALSE) %>% group_by(sex)
adelie_by_sex
## # A tibble: 146 × 8
## # Groups: sex [2]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ℹ 136 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
… and, as a sanity check:
distinct(adelie_by_sex, sex)
## # A tibble: 2 × 1
## # Groups: sex [2]
## sex
## <fct>
## 1 male
## 2 female
Now, let’s calculate the mean by species for body_mass_g and
bill_length_mm. If you’re unsure whether there are NA values in
these two columns, you can use the drop_na() function to
remove rows which contain ANY NA values. To see how you might do this in
one command using pipes, we’ll start from penguins.
Run the following code block:
# starting with the penguins table
penguins %>%
# select only the columns we want
select(species, body_mass_g, bill_length_mm) %>%
# drop the rows with NA in ANY selected column
drop_na() %>%
# group table by species
group_by(species) %>%
# summarize the mean body mass, bill length, and number of observations in each group
summarize(mean_weight = mean(body_mass_g), mean_bill_length = mean(bill_length_mm), num=n())
## # A tibble: 3 × 4
## species mean_weight mean_bill_length num
## <fct> <dbl> <dbl> <int>
## 1 Adelie 3701. 38.8 151
## 2 Chinstrap 3733. 48.8 68
## 3 Gentoo 5076. 47.5 123
We have now produced mean statistic by species.
group_by()exercise:Try building a tibble grouped by species and sex, then use summarize to get the mean and standard deviation of body mass. But first, get rid of missing values for both of these variables (hint: use
drop_na()) and DON’T make changes to the penguins tibble.
# start with the penguins table
penguins %>%
# we only want the columns species, section, and weight_g
select(species, sex, body_mass_g) %>%
# remove NAs
drop_na() %>%
# group by species and section
group_by(species, sex) %>%
# summarize mean weight, standard deviation of weight, and number of observations in each group
summarize(mean_mass = mean(body_mass_g), std_weight = sd(body_mass_g), num=n())
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups: species [3]
## species sex mean_mass std_weight num
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie female 3369. 269. 73
## 2 Adelie male 4043. 347. 73
## 3 Chinstrap female 3527. 285. 34
## 4 Chinstrap male 3939. 362. 34
## 5 Gentoo female 4680. 282. 58
## 6 Gentoo male 5485. 313. 61
You will notice that this has produced a warning message :
summarize() has grouped output by ‘species’. You can
override using the .groups argument. The output of
summarize is itself a tibble, which is grouped. However, we can’t group
it by both species and sex, since these are now unique combinations with
only one row per pair. So summarize() is telling us it has chosen to
group the output tibble by the species variable not the
section variable.