The goal of this workshop is to teach the grammar of graphics in R, with a focus on ggplot2. The consistent grammar implemented in ggplot2 is advantageous both because it is easily extendible - that is you can both produce simple plots, but then develop them into complex publication-ready figures. In addition to the basic ggplot2 R package, many extensions for different types of data have been written using the same standardized grammar.

ggplot2 is part of the tidyverse package, and to make it easier to load our dataset and manipulate it prior to plotting, we will load the entire tidyverse package.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

To demonstrate how the different plots work, let’s import in a dataset into our environment. This is a dataset containing plumage color information for a family of birds called the tanagers.

setwd("~/Dropbox/Informatics/Workshops/VisualizationR/")
tanager_color <- read_csv("tanagercolordata.csv")
## Parsed with column specification:
## cols(
##   species = col_character(),
##   sex = col_character(),
##   subfamily = col_character(),
##   habitat = col_character(),
##   foraging_stratum = col_character(),
##   avg_span = col_double(),
##   volume = col_double(),
##   avg_brill = col_double(),
##   avg_chroma = col_double(),
##   avg_sex_dich = col_double()
## )
tanager_color
## # A tibble: 688 x 10
##                     species    sex   subfamily habitat foraging_stratum
##                       <chr>  <chr>       <chr>   <chr>            <chr>
##  1      Acanthidops_bairdii Female Diglossinae       F                U
##  2      Acanthidops_bairdii   Male Diglossinae       F                U
##  3 Anisognathus_igniventris Female  Thraupinae       F              U/C
##  4 Anisognathus_igniventris   Male  Thraupinae       F              U/C
##  5  Anisognathus_lacrymosus Female  Thraupinae       F                C
##  6  Anisognathus_lacrymosus   Male  Thraupinae       F                C
##  7 Anisognathus_melanogenys Female  Thraupinae       F              U/C
##  8 Anisognathus_melanogenys   Male  Thraupinae       F              U/C
##  9   Anisognathus_notabilis Female  Thraupinae       F                C
## 10   Anisognathus_notabilis   Male  Thraupinae       F                C
## # ... with 678 more rows, and 5 more variables: avg_span <dbl>,
## #   volume <dbl>, avg_brill <dbl>, avg_chroma <dbl>, avg_sex_dich <dbl>

You notice that our dataset is currently tidy, and in long-format. Each row represents an observation, in this case, information on plumage coloration for each sex of each species (avg_span = average contrast among color patches, volume = the volume of the avian tetrahedral space, larger values are more colorful, avg_brill = average brilliance across color patches, avg_chroma = average chroma, or saturation of each color patch, and avg_sex_dich = measures of sexual dichromatism, or difference in coloration between males and females for that species). As ggplot2 is part of the tidyverse, it works best with long-format data. In addition to these color measurements, we also have information on which subfamily each species belongs to, habitat preferences (F = closed habitat, or forest, N = open habitat), and where birds forage (U = understory, C = canopy, U/C = both). If you would like to know more, this dataset comes from this 2017 Evolution paper: http://onlinelibrary.wiley.com/doi/10.1111/evo.13196/full.

Basic grammar of ggplot

There are three key components that make up every ggplot: 1. data 2. aesthetic mappings (which variables in your data map to which visual properties) 3. geometric object (geom) function (a layer describing how to render each observation)

There are other optional components that control the visualization of the plots, but for now, we will focus on getting these three key elements down. The basic formula for these options is:

ggplot(data=<dataset>, aes(<mappings>)) + <geom_function>()

Let’s make a basic plot using this grammar. You can see that this is really not much more complicated than the base R plot function.

ggplot(data=tanager_color, aes(x=avg_brill, y=avg_chroma)) + 
  geom_point()

Notice that like when we use piping (%>%) with dplyr and can make our code easier to read by putting each element on a new line, we can do the same with ggplot after the +. That + is specifying that we want to add a layer to the plot. As you will see going forward, we can add more than one layer to add other data, statistical summaries, or metadata to make more complex plots.

Also, here we specified what x and y are, but they are used so commonly, that ggplot2 will always assume that the first aes() argument supplied is x, and the second is y, so we do not need to specify them moving forward.

Geoms

Geoms are the building blocks of ggplot2, and specify what type of plot you will be drawing. For example, you saw above that geom_point draws a scatterplot, or associates the x and y values with points.

We can divide geoms up into several different categories, such as individual geoms, which map each observation you provide to an element of the plot. Alternatively, collective geoms will group your points to summarize aspects of your data (e.g. a boxplot). Additionally, we can categorize geoms into how many dimensions they are mapping from 1D to 2D, or even 3D.

Another nice feature of ggplot2 is that we can save the data and aesthetic mappings to an object, and then call this object with different geoms or other layers. This can be very useful if you want to explore how different geoms work to visualize your data, or change subtle aspects of the plot.

1-D geoms

First, we will explore some of the geoms available for single variables. To facilitate comparing different plot types, let’s create a base ggplot2 object called brill with an aesthetic mapping the avg_brill values.

brill <- ggplot(tanager_color,aes(avg_brill))

A simple way to explore the distribution of this variable is with a histogram.

brill + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We likely will want to change the default values for the number of breaks, or the binwidth.

brill + geom_histogram(bins=15)

brill + geom_histogram(binwidth=0.005)