The goal of this workshop is to provide an introduction to differential expression analyses using RNA-seq data. While there are now many published methods for tackling specific steps, as well as full-blown pipelines, we will focus on two different approaches that have been show to be top performers with respect to controlling the false discovery rate. The first is using kallisto to estimate expression, in conjuction with sleuth to assess evidence for differential expression. This pipeline is based upon a recently developed approach that doesn’t rely on traditional read alignment. The second relies on read mapping with bowtie2, implemented by the gold standard expression estimation tool RSEM, followed by differential expression analysis with limma voom.
For today’s workshop, the expression estimates fed into sleuth and limma voom were created with an annotated genome. Time permitting, at the end of this workshop I will talk about how one would carry out such analyses without a genome (i.e. with a de novo transcriptome), and some of the considerable challenges one faces when conducting such analyses.
Finally, a broader goal of this workshop is to demonstrate how to analyze complex designs that are not the standard one treatment, two condition analyses that are prevalent in the literature.
Our sample data comprises 12 paired-end RNA-seq libraries for whole body samples of Drosophila melanogaster from two geographic regions (Panama and Maine), with two temperature treatments (“low” ane “high”) for each region, featuring three biological replicates for each region x treatment combination. Previously, these data were used to look for parallel gene expression patterns between high and low latitude populations (Zhao et al, 2015, PLoS Genetics)
First, load all the R libraries that will be used for today’s analyses:
library(limma)
library(sleuth)
## Loading required package: ggplot2
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(edgeR)
library("biomaRt")
Kallisto is an extremely fast tool that uses pseudo-alignments to determine the transcript from which a read originated. Sleuth is the differential expression tool that can accept directly the outputs of kallisto. Both tools came out of Lior Pachter’s lab at CalTech. While kallisto is “fast”, it is not fast enough to make it efficient to run it to completion for our workshop. Thus, you should have copied the kallisto directory that I pointed you to to a directory you have created on your notebook called kallisto_sleuth.
setwd("/Users/adamfreedman/Dropbox/informatics/DEworkshop")
sample_id <- dir(file.path("kallisto_sleuth/kallisto"))
kal_dirs <- file.path("kallisto_sleuth/kallisto", sample_id)
s2c<-read.table("kallisto_sleuth/dme_elev_samples.tab",header = TRUE, stringsAsFactors=FALSE)
If you are using RStudio, then this should pop up as a table pane in your top left window.
s2c <- mutate(s2c, path = kal_dirs)
so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE,read_bootstrap_tpm=TRUE)
## Warning in check_num_cores(num_cores): It appears that you are running Sleuth from within Rstudio.
## Because of concerns with forking processes from a GUI, 'num_cores' is being set to 1.
## If you wish to take advantage of multiple cores, please consider running sleuth from the command line.
## reading in kallisto results
## dropping unused factor levels
## ............
## normalizing est_counts
## 28095 targets passed the filter
## normalizing tpm
## merging in metadata
## summarizing bootstraps
## ............
Note, we do boostrap_summary = TRUE so we can do plots of bootstrap replicates to see how robust our differential expression calls are, and set read_bootstrap_tpm=TRUE to allow for later visualizatons in TPM units to be available.
plot_pca(so, color_by = 'temp')
plot_group_density(so, use_filtered = TRUE, units = "est_counts",trans = "log", grouping = "temp", offset = 1)
so <- sleuth_fit(so, ~temp, 'full')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_fit(so, ~1, 'reduced')
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
so <- sleuth_lrt(so, 'reduced', 'full')
models(so)
## [ full ]
## formula: ~temp
## coefficients:
## (Intercept)
## templow
## [ reduced ]
## formula: ~1
## coefficients:
## (Intercept)
sleuth_table <- sleuth_results(so, 'reduced:full', 'lrt', show_all = FALSE)
sleuth_significant <- dplyr::filter(sleuth_table, qval <= 0.05)
head(sleuth_significant)
## target_id test_stat pval qval rss sigma_sq
## 1 FBtr0304667 45.20728 1.772446e-11 2.079999e-07 107.06048 9.6624276
## 2 FBtr0310675 46.63708 8.542836e-12 2.079999e-07 129.37903 11.6354121
## 3 FBtr0347171 44.76493 2.221668e-11 2.079999e-07 23.25386 2.0799216
## 4 FBtr0083919 44.08045 3.151521e-11 2.212919e-07 8.56118 0.7776422
## 5 FBtr0331205 43.56226 4.106835e-11 2.306973e-07 107.07949 9.6299604
## 6 FBtr0072815 41.91419 9.536802e-11 2.932643e-07 123.40663 11.0885041
## tech_var mean_obs var_obs sigma_sq_pmax smooth_sigma_sq
## 1 0.0703430864 2.292547 9.7327707 9.6624276 0.52771050
## 2 0.1263175923 2.573383 11.7617297 11.6354121 0.39125124
## 3 0.0340659677 4.080669 2.1139876 2.0799216 0.06979544
## 4 0.0006469001 7.600675 0.7782891 0.7776422 0.01915183
## 5 0.1045384609 2.285655 9.7344988 9.6299604 0.53155374
## 6 0.1302805575 2.473686 11.2187846 11.0885041 0.43543260
## final_sigma_sq degrees_free
## 1 9.6624276 1
## 2 11.6354121 1
## 3 2.0799216 1
## 4 0.7776422 1
## 5 9.6299604 1
## 6 11.0885041 1
If one does dim(sleuth_table) and dim(sleuth_significant), you will see that 2237/28095 transcripts (7.96%) are significantly DE.From here, you could write the table to a file using write.table or write.csv. E.g.
write.csv(sleuth_table,"sleuth_vanilla.csv",row.names=FALSE,quote=FALSE)
You can also generate plots, based upon the bootstraps, to examine the variation among samples.
14. Plot variation in units of estimated counts
plot_bootstrap(so, "FBtr0304667", units = "est_counts", color_by = "temp")
## Warning: 'switch' is deprecated.
## Use 'strip.position' instead.
## See help("Deprecated")