I. Preliminaries

The goal of this workshop is to provide an introduction to differential expression analyses using RNA-seq data. While there are now many published methods for tackling specific steps, as well as full-blown pipelines, we will focus on two different approaches that have been show to be top performers with respect to controlling the false discovery rate. The first is using kallisto to estimate expression, in conjuction with sleuth to assess evidence for differential expression. This pipeline is based upon a recently developed approach that doesn’t rely on traditional read alignment. The second relies on read mapping with bowtie2, implemented by the gold standard expression estimation tool RSEM, followed by differential expression analysis with limma voom.

For today’s workshop, the expression estimates fed into sleuth and limma voom were created with an annotated genome. Time permitting, at the end of this workshop I will talk about how one would carry out such analyses without a genome (i.e. with a de novo transcriptome), and some of the considerable challenges one faces when conducting such analyses.

Finally, a broader goal of this workshop is to demonstrate how to analyze complex designs that are not the standard one treatment, two condition analyses that are prevalent in the literature.

Sample data

Our sample data comprises 12 paired-end RNA-seq libraries for whole body samples of Drosophila melanogaster from two geographic regions (Panama and Maine), with two temperature treatments (“low” ane “high”) for each region, featuring three biological replicates for each region x treatment combination. Previously, these data were used to look for parallel gene expression patterns between high and low latitude populations (Zhao et al, 2015, PLoS Genetics)

Loading required R libraries

First, load all the R libraries that will be used for today’s analyses:

library(limma)
library(sleuth)
## Loading required package: ggplot2
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(edgeR)
library("biomaRt")

II. Implementing the kallisto/sleuth pipeline

Kallisto is an extremely fast tool that uses pseudo-alignments to determine the transcript from which a read originated. Sleuth is the differential expression tool that can accept directly the outputs of kallisto. Both tools came out of Lior Pachter’s lab at CalTech. While kallisto is “fast”, it is not fast enough to make it efficient to run it to completion for our workshop. Thus, you should have copied the kallisto directory that I pointed you to to a directory you have created on your notebook called kallisto_sleuth.

  1. Set your R working directory to DEworkshop:
    Note: this assumes you are already in a directory called DEworkshop to which you have copied the workshop data.
setwd("/Users/adamfreedman/Dropbox/informatics/DEworkshop")

File and data management

  1. Grab sample ids, i.e. names of kallisto run folders for each of the 12 samples:
sample_id <- dir(file.path("kallisto_sleuth/kallisto"))
  1. Get the directories where the kallisto runs live:
kal_dirs <- file.path("kallisto_sleuth/kallisto", sample_id)
  1. Make sure that the table that associates sample IDs and treatments (dme_elev_samples.csv) is in your kallisto_sleuth directory, then load that table:
s2c<-read.table("kallisto_sleuth/dme_elev_samples.tab",header = TRUE, stringsAsFactors=FALSE)

If you are using RStudio, then this should pop up as a table pane in your top left window.

  1. Add file paths to s2c table:
s2c <- mutate(s2c, path = kal_dirs)

Create the sleuth object

  1. Run sleuth_prep. This will take a few minutes …
so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE,read_bootstrap_tpm=TRUE)
## Warning in check_num_cores(num_cores): It appears that you are running Sleuth from within Rstudio.
## Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
## If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
## reading in kallisto results
## dropping unused factor levels
## ............
## normalizing est_counts
## 28095 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## ............

Note, we do boostrap_summary = TRUE so we can do plots of bootstrap replicates to see how robust our differential expression calls are, and set read_bootstrap_tpm=TRUE to allow for later visualizatons in TPM units to be available.

  1. At this point, you can do some quick and dirty, hard-coded visualzations of the expression data to make sure nothing looks wonky, although I’ll show you shortly how to do something fancier by launching a local Shiny server to interactively query your data. You can do a PCA of your samples, and look at count densities by treatment, which ideally should not vary by treatment!
plot_pca(so, color_by = 'temp')

plot_group_density(so, use_filtered = TRUE, units = "est_counts",trans = "log", grouping = "temp", offset = 1)  

A vanilla two-treatment DE analysis (ignoring the effects of population)

  1. We first fit a full model that includes a paramter for the condition (in this case temp)
so <- sleuth_fit(so, ~temp, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
  1. Next, we fit a reduced model that only includes the intercept
so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
  1. For each transcript, we perform a likelihood ratio test to determine whether the full model that includes a parameter for temperature fits the data significantly better than the reduced model. Significantly better fit is indicative of differential expression.
so <- sleuth_lrt(so, 'reduced', 'full')
  1. You can look in more detail at the models by doing:
models(so)
## [  full  ]
## formula:  ~temp 
## coefficients:
##  (Intercept)
##      templow
## [  reduced  ]
## formula:  ~1 
## coefficients:
##  (Intercept)
  1. Make a table of the results via:
sleuth_table <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
  1. And make a table that only includes the significantly DE transcripts
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant)
##     target_id test_stat         pval         qval       rss   sigma_sq
## 1 FBtr0304667  45.20728 1.772446e-11 2.079999e-07 107.06048  9.6624276
## 2 FBtr0310675  46.63708 8.542836e-12 2.079999e-07 129.37903 11.6354121
## 3 FBtr0347171  44.76493 2.221668e-11 2.079999e-07  23.25386  2.0799216
## 4 FBtr0083919  44.08045 3.151521e-11 2.212919e-07   8.56118  0.7776422
## 5 FBtr0331205  43.56226 4.106835e-11 2.306973e-07 107.07949  9.6299604
## 6 FBtr0072815  41.91419 9.536802e-11 2.932643e-07 123.40663 11.0885041
##       tech_var mean_obs    var_obs sigma_sq_pmax smooth_sigma_sq
## 1 0.0703430864 2.292547  9.7327707     9.6624276      0.52771050
## 2 0.1263175923 2.573383 11.7617297    11.6354121      0.39125124
## 3 0.0340659677 4.080669  2.1139876     2.0799216      0.06979544
## 4 0.0006469001 7.600675  0.7782891     0.7776422      0.01915183
## 5 0.1045384609 2.285655  9.7344988     9.6299604      0.53155374
## 6 0.1302805575 2.473686 11.2187846    11.0885041      0.43543260
##   final_sigma_sq degrees_free
## 1      9.6624276            1
## 2     11.6354121            1
## 3      2.0799216            1
## 4      0.7776422            1
## 5      9.6299604            1
## 6     11.0885041            1

If one does dim(sleuth_table) and dim(sleuth_significant), you will see that 2237/28095 transcripts (7.96%) are significantly DE.From here, you could write the table to a file using write.table or write.csv. E.g.

write.csv(sleuth_table,"sleuth_vanilla.csv",row.names=FALSE,quote=FALSE)

You can also generate plots, based upon the bootstraps, to examine the variation among samples.
14. Plot variation in units of estimated counts

plot_bootstrap(so, "FBtr0304667", units = "est_counts", color_by = "temp")  
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")