Welcome to today’s workshop on single-cell RNA-seq. Single cell genomics is nothing if not a field moving at warp speed: papers on new data analysis methods and tools are being published all the times, not the mention the myriad empirical papers. In short, don’t pay attention for a few months, and you may find yourself racing to play catch up. This workshop is not intended–nor would it be possible in an hour–to cover current methodology. Instead, it is intended to be a gentle introduction to scRNA-seq technology and to impart a broad understanding of the foundations for scRNA-seq analysis. Specifically we will cover:
Single-cell RNA-seq (scRNAseq) is a powerful new approach that has arisen and matured over the last decade. In contrast to bulk RNAseq– in which sequencing libraries are constructed from tissues or cell populations in a manner that is entirely agnostic to the identities of and expression variation among the constituent cells–scRNAseq involved isolating and labelling individual cells and their constituent RNA molecules. This enables the association of inidiviual cDNAs or RNAs with their cell of origin, such that cell level expression profiles can be constructed. Early scRNAseq technologies were relatively low throughput, producing expression profiles for dozens to hundreds of cell. In contrast, contemprary methodologies now enable the sequencing of thousands to hundreds of thousands of cells.
A standard scRNAseq workflow involves several data cleaning steps, that then enable a variety of downstream analyses. A broad view of the steps in an analysis is exemplified by:
Adapted from Heumos et al. 2023, *Nature Biotechnology”
scRNA-seq has the potential to offer deeper insights into a number of research areas including:
While these examples highlight the potential value of scRNAseq in biomedical research and developmental/cell biology, scRNA-seq also has considerable potential for organismal biologists who have historically used bulk RNA-seq to understand expression variation across enviornments and populations, and even among species. Phenotypes observed in laboratory organisms and natural populations are, with the exception of single-celled organism, the product of an assortment of cell types. As a result, both the statistical power of bulk RNA-seq studies and the generality of biological inference are constrained by the inability to account for cell composition variation among samples, even if they are derived from the same tissue. Contrasting differential expression (DE) analysis for bulk and scRNAseq is a useful demonstration of this principle:
The 10x Genomics 3’assay has become an industry standard, using a gel-droplet technology that allows isolation of indiviual cells in droplets, so that cDNA fragments can be associated with a cell barcode reflecting the cell of origin, and individual RNAs can be associated with unique molecular barcodes (UMIs), which enable computational deduplication to avoid PCR amplification bias.
Adapted from “10x 3’ library construction user guide”
Because the cDNA sequence readout is only from the 3’ end, a workflow based upon this technology is only able to produce expression estimates (per cell) at the gene-level. In other words, there is not enough unique sequence to enable alignment of a particular sequencing fragment to a particular alternatively spliced isoform.
Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have recently released competing technologies to achieve high throughput full-length isoform sequencing at single cell resolution. Generally speaking, both approaches take 10x 3’ scRNA-seq libraries as input, concatenate the sequencing fragments into larger library fragments, then sequence those fragments on their respective instruments. These approaches embed sequences between the ligated 10x reads, such that after sequencing, the long reads can be computationally segmented into the original (putatively full length) transcript fragments. * The PacBio MAS-Seq method is described in Al’Khafaji et al. 2024, Nature Biotechnology, with workflow instructions for generating expression matrices can be found here. * Similarly, ONT has a nextflow workflow that we have implemented successfully on the Cannon cluster that is detailed here.
Despite major advances in scRNA-seq library preparation and sequencing, there remain technical aspects of the technology that generate artefacts and noise that need to be cleaned up prior to conducting any downstream analyses. In an ideal world, each gel droplet from which a sequencing library is constructed will contain a single intact cell and no contaminant RNA. In reality, there are common deviations from that ideal world:
In summary one can observe:
An ideal workflow would thus:
These filtering steps must be conducted before proceeding with any downstream normalization, clustering, marker gene identification or other biologically informative analyses.
Typical metrics reported as part of cellranger count or other instrument-related software include:
Because sequencing effort is spread across thousands of cells for which expression is estimated for tens of thousands of features, scRNAseq matrices are sparse, meaning that there are lots of small counts, and more importantly, lots of zero counts. The resulting count matrices are often referred to as “zero-inflated”, leading to debate about the sources of zero-inflation, and whether the counts are in fact zero-inflated. Relevant papers to look at are Jiang et al. 2022, Genome Biology, and Svensson 2022, Nature Biotechnology. Wherever the consensus ultimately lands on the statistical front, pratically speaking, the count matrices are comprised of rows that represent features (either gene ids or isoform ids), and columns that represent cells. Additional data files that typically accompany the matrix are:
We will walk through the workflow represented by the diagram below
that takes 10x scRNA-seq raw and “filtered” count
matrices output by CellRanger count and does further
filtering and count correction.
library(Seurat)
library(patchwork)
library(sctransform)
library(tidyverse)
library(SoupX)
library(cowplot)
library(glmGamPoi)
library(scDblFinder)
library(scater)
data_dir=getwd()
sc_data_raw <- Seurat::Read10X(file.path(data.dir = data_dir,"raw_gene_bc_matrices","GRCh38"))
sc_data_filtered <- Seurat::Read10X(file.path(data.dir = data_dir,"filtered_gene_bc_matrices","GRCh38"))
If we take a quick look at the dimensions of the filtered matrix, we see that:
dim(sc_data_filtered)
## [1] 33694 4340
there are 33694 genes an 4340 cells in the data set.
Tools to correct counts for droplets deemed to contain cells involve two steps: * estimating the ambient “soup” from putatively empty droplets, and * using the estimate of the soup to subtract it from droplets containing cells.
We are using SoupX to detect and correct counts for contamination for ambient RNA. SoupX corrects contamination as follows:
NOTE: Another option for ambient RNA removal is cellbender, which takes a different approach from SoupX in that it uses a generative model of the process by which scRNA-seq is generated. In side-by-side comparisons this method has been shown to work quite well, although we have some questions about how it will perform in cases where the fit between the model and the data is not optimal. Cellbender is run as a command line tool that we have successfully run on the Cannon cluster. Because it take a good bit of memory and time to run, and because it cannot be run from within R, we do not use that method in today’s workshop.
Because SoupX requires information on clusters, we must perform a provisional clustering analysis, knowing that we will re-cluster the data once all filtering and cleanup of the data has been finished. Another important thing to note is that rather than the traditional way of normalizing count data, were are using a new and improved normalization method called SCtransform, described in Hafemeister and Satija 2019, Genome Biology, conveniently available in Seurat.
seurat_obj <- CreateSeuratObject(counts = sc_data_filtered)
seurat_obj <- PercentageFeatureSet(seurat_obj, pattern = "^MT-", col.name = "percent.mt")
seurat_obj <- SCTransform(seurat_obj, vars.to.regress = "percent.mt", verbose = FALSE)
seurat_obj <- RunPCA(seurat_obj, verbose = FALSE)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
## 13:12:37 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:12:37 Read 4340 rows and found 30 numeric columns
## 13:12:37 Using Annoy for neighbor search, n_neighbors = 30
## 13:12:37 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:12:37 Writing NN index file to temp file /var/folders/y4/5qbzd1h11vdb2lww93vpl_pc0000gq/T//RtmpcYlTwp/file10bb37fc29078
## 13:12:37 Searching Annoy index using 1 thread, search_k = 3000
## 13:12:37 Annoy recall = 100%
## 13:12:38 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:12:39 Initializing from normalized Laplacian + noise (using RSpectra)
## 13:12:39 Commencing optimization for 500 epochs, with 182084 positive edges
## 13:12:39 Using rng type: pcg
## 13:12:41 Optimization finished
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
seurat_obj <- FindClusters(seurat_obj)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4340
## Number of edges: 156541
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8698
## Number of communities: 15
## Elapsed time: 0 seconds
unfiltered_umap_plot<-DimPlot(seurat_obj, label = TRUE) + ggtitle("Unfiltered")
unfiltered_umap_plot
Metadata in a Seurat object comprise information regarding the cells,
i.e. barcodes recovered in the sequencing experiment. Default metadata
are:
1. nCount_RNA, the number of unique RNAs in the cell (equivalent to the
number of UMIs), and
2. nFeature_RNA, which is the number of genes for 3’ 10x sequencing.
One can add additional metadata such as the percent of UMIs that originate from mtDNA, which we will do later, but there is no need since this isn’t the final version of the data. But, notice that normalization and clustering has led to additional metadata being added:
head(seurat_obj@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt nCount_SCT
## AAACCTGAGAAGGCCT-1 SeuratProject 1738 748 6.386651 3461
## AAACCTGAGACAGACC-1 SeuratProject 3240 1052 5.462963 3627
## AAACCTGAGATAGTCA-1 SeuratProject 1683 739 7.367796 3451
## AAACCTGAGCGCCTCA-1 SeuratProject 2319 875 3.837861 3442
## AAACCTGAGGCATGGT-1 SeuratProject 2983 951 2.246061 3587
## AAACCTGCAAGGTTCT-1 SeuratProject 4181 1248 2.224348 3967
## nFeature_SCT SCT_snn_res.0.8 seurat_clusters
## AAACCTGAGAAGGCCT-1 758 13 13
## AAACCTGAGACAGACC-1 1040 0 0
## AAACCTGAGATAGTCA-1 767 0 0
## AAACCTGAGCGCCTCA-1 862 2 2
## AAACCTGAGGCATGGT-1 939 1 1
## AAACCTGCAAGGTTCT-1 1232 3 3
Note: one can also access the metadata with
seurat_obj[[]].
# create soup channel
soup_channel <- SoupX::SoupChannel(tod = sc_data_raw,toc=sc_data_filtered,
is10X = TRUE)
# add sc_data_raw as an accessible data frame in the soup channel
soup_channel$tod<-sc_data_raw
# specify clusters from the seurat_obj value for Idents
soup_channel <- SoupX::setClusters(soup_channel,
clusters = as.factor(Idents(seurat_obj)))
# manually specify dimension reduction
soup_channel <- setDR(soup_channel,
DR=Seurat::Embeddings(seurat_obj, "umap"))
# estimate contamination
soup_channel <- autoEstCont(soup_channel)
## 822 genes passed tf-idf cut-off and 428 soup quantile filter. Taking the top 100.
## Using 998 independent estimates of rho.
## Estimated global rho of 0.06
The labelling on the above plot is slighty confusing because Bayesian methods are not used to estimate rho, the contamination fraction. The “prior” is actually the combined density of all the non-expressed gene x cell cluster estimates of the contaminant fraction, while the “posterior” is the joint likelihood function fitted to the data.
We can update the counts to the corrected ones, then save the raw counts to a raw_counts attribute.
corrected_counts <- adjustCounts(soup_channel,roundToInt=TRUE)
## Expanding counts from 15 clusters to 4340 cells.
seurat_obj_filt <- CreateSeuratObject(counts = corrected_counts)
seurat_obj_filt <- PercentageFeatureSet(seurat_obj_filt, pattern = "^MT-", col.name = "percent.mt")
From a recent review, the top-performing doublet detection method was found to be scDblFinder. Like most doublet detection methods, scDblFinder simulates doublets, performs dimensionality reduction on the joint (simulated doublets + real data), and classifies droplets as singlets or multiplets based upon their neighborhood relationships with simulated doublets in the low-dimensional embedding space: cells with many simulated doublet neighbors are classified as doublets, those distant from the simulated doublets in that space are classified as singlets.
scDblFinder cannot take a seruat object as input.
The object must first be converted to a
SingleCellExperiment object.
sce_obj_filt <- as.SingleCellExperiment(seurat_obj_filt)
scDblFinder has parameter options you can change, and
particularly relevant is dbr, which is the expected doublet
rate. However, the developers indicate “For 10x data, it is usually
safe to leave the dbr empty, and it will be automatically estimated. (If
using a chip other than the standard 10X, you might have to adjust it or
the related dbr.per1k argument.”
sce_obj_filt <- scDblFinder(sce_obj_filt)
## Creating ~3472 artificial doublets...
## Dimensional reduction
## Evaluating kNN...
## Training model...
## iter=0, 544 cells excluded from training.
## iter=1, 547 cells excluded from training.
## iter=2, 528 cells excluded from training.
## Threshold found:0.433
## 221 (5.1%) doublets called
The doublet classification information is now stored in
scDblFinder.class in the SCE object.
head(colData(sce_obj_filt))
## DataFrame with 6 rows and 9 columns
## orig.ident nCount_RNA nFeature_RNA percent.mt
## <factor> <numeric> <integer> <numeric>
## AAACCTGAGAAGGCCT-1 SeuratProject 1653 732 6.53358
## AAACCTGAGACAGACC-1 SeuratProject 3062 1004 5.55193
## AAACCTGAGATAGTCA-1 SeuratProject 1609 721 7.39590
## AAACCTGAGCGCCTCA-1 SeuratProject 2189 837 3.88305
## AAACCTGAGGCATGGT-1 SeuratProject 2821 902 2.19780
## AAACCTGCAAGGTTCT-1 SeuratProject 3930 1168 2.18830
## ident scDblFinder.class scDblFinder.score
## <factor> <factor> <numeric>
## AAACCTGAGAAGGCCT-1 SeuratProject singlet 0.000338863
## AAACCTGAGACAGACC-1 SeuratProject singlet 0.001126437
## AAACCTGAGATAGTCA-1 SeuratProject singlet 0.000406255
## AAACCTGAGCGCCTCA-1 SeuratProject singlet 0.021840787
## AAACCTGAGGCATGGT-1 SeuratProject singlet 0.000286932
## AAACCTGCAAGGTTCT-1 SeuratProject singlet 0.011666683
## scDblFinder.weighted scDblFinder.cxds_score
## <numeric> <numeric>
## AAACCTGAGAAGGCCT-1 0.1089213 0.038705186
## AAACCTGAGACAGACC-1 0.1576065 0.088725122
## AAACCTGAGATAGTCA-1 0.0000000 0.000629113
## AAACCTGAGCGCCTCA-1 0.3870337 0.019500514
## AAACCTGAGGCATGGT-1 0.0793885 0.026678087
## AAACCTGCAAGGTTCT-1 0.4515635 0.025072564
Three types of information are provided on droplets:
If desired, the scores faciliate fine-grained tuning of thresholds after iterations of data exploration. For the purposes of this workshop, we will simply add the scDblFinder.class data to the seurat object and filter in on that classification.
seurat_obj_filt$scDblFinder.class <- colData(sce_obj_filt)$scDblFinder.class
seurat_obj_filt_singlets <- subset(seurat_obj_filt, subset = scDblFinder.class == "singlet")
cat(paste("The number of cells before doublet removal: ",dim(seurat_obj_filt)[2],"\n"))
## The number of cells before doublet removal: 4340
cat(paste("The number of cells after doublet removal: ",dim(seurat_obj_filt_singlets)[2]))
## The number of cells after doublet removal: 4119
While computational methods to remove empty droplets and detect doublets enable flagging and filter out of barcodes that can negatively impact downstream analysis, it is likely that a number of droplets will remain in the data set that should probably be removed, in particular:
Low quality cells are often indicated by both low UMI (and feature) counts, particularly after ambient RNA removal, and a high percentage of UMIs assigned to mtDNA genes.
Prior to filtering, it is straightforward (and useful!) to visualize UMI and feature (gene) counts and the percentage of mtDNA UMIs in droplets. For example, a large fraction of cells with high mtDNA% would indicate that a high fraction of cells may have been dying during at the commencement of library construction.
Because clustering information is now embedded in the seurat object, it is not possible using default Seurat functions to produce a global violin plot without it being grouped by cluster id. Therefore, we can just use tidyverse functions to make some prettier plots!
violindata<-tibble(nCount_RNA = seurat_obj_filt@meta.data$nCount_RNA, nFeature_RNA = seurat_obj_filt@meta.data$nFeature_RNA, percent.mt = seurat_obj_filt@meta.data$percent.mt)
colnames(violindata)<-c("nCount_RNA","nFeature_RNA","percent.mt")
ncount_violin<- violindata %>% ggplot(aes(x=1,y=nCount_RNA)) + geom_violin() + xlab("")
nfeat_violin<- violindata %>% ggplot(aes(x=1,y=nFeature_RNA)) + geom_violin() + xlab("")
mt_violin<- violindata %>% ggplot(aes(x=1,y=percent.mt)) + geom_violin() + xlab("")
plot_grid(ncount_violin,nfeat_violin,mt_violin, ncol=3, nrow=1)
We can also make biplots of the metrics:
mtvsnc<- violindata %>% ggplot(aes(x=log10(nCount_RNA),y=percent.mt)) + geom_point(alpha=0.2,size=1)
nfvsnc<- violindata %>% ggplot(aes(x=log10(nCount_RNA),y=nFeature_RNA)) + geom_point(alpha=0.2,size=1)
plot_grid(mtvsnc,nfvsnc, ncol=2, nrow=1)
There are two approaches for post hoc filtering:
Following the Orchestrating Single-Cell Analysis with Bioconductor, aka OSCA, for outlier detection we perform adaptive filtering using the median absolute deviation (MAD) statistic, filtering out:
For the seurat object, we have to:
perCellQCMetricsisOutlier functionisOutlieras.Seurat function that
converts an SCE looks for these counts in the data slot and will throw
an error if it can’t find themsce_filt_singlets <-as.SingleCellExperiment(seurat_obj_filt_singlets)
## Warning: Layer 'data' is empty
## Warning: Layer 'scale.data' is empty
mito_genes <- grep("^mt-", rownames(seurat_obj_filt_singlets), value = TRUE)
qc <- perCellQCMetrics(sce_filt_singlets, subsets = list(Mito = mito_genes))
# 1. High %mtDNA outliers
high_mito <- isOutlier(qc$subsets_Mito_percent, type = "higher") # TRUE for outliers (high %mtDNA)
cat(paste("The number of cells with high mtDNA %: ",sum(high_mito)))
## The number of cells with high mtDNA %: 0
# 2. Low UMI count outliers (library size is 'sum')
low_counts <- isOutlier(qc$sum, type = "lower") # TRUE for outliers (low total UMI count)
cat(paste("The number of cells with low UMI counts %: ",sum(low_counts)))
## The number of cells with low UMI counts %: 0
# 3. Low feature count outliers ('detected')
low_features <- isOutlier(qc$detected, type = "lower") # TRUE for outliers (low detected features)
cat(paste("The number of cells with low # of genes recovered %: ",sum(low_features)))
## The number of cells with low # of genes recovered %: 0
While this data set did not contain any statistically detectable outliers based upon MAD, if they had been detected, one would proceed by:
cells_to_keep from
the discard listsubset function to specify which cells
to keepdiscard <- high_mito | low_counts | low_features
cells_to_keep <- colnames(seurat_obj_filt_singlets)[!discard]
seurat_obj_filt_singlets <- subset(seurat_obj_filt_singlets, cells = cells_to_keep)
Beyond what outliers may have been removed with the MAD approach, one can then apply additional manual filters that typically used.
seurat_obj_filt_singlets_posthocfilt <- subset(seurat_obj_filt_singlets,subset = nFeature_RNA > 200 & nCount_RNA > 500 & percent.mt < 5)
We reduced the data set to 3612 cells,filtering out approximately 17% of the input cells. We can now perform clustering on the filtered data.
seurat_obj_filt_singlets_posthocfilt <- SCTransform(seurat_obj_filt_singlets_posthocfilt, vars.to.regress = "percent.mt", verbose = FALSE)
seurat_obj_filt_singlets_posthocfilt <- RunPCA(seurat_obj_filt_singlets_posthocfilt, verbose = FALSE)
seurat_obj_filt_singlets_posthocfilt <- RunUMAP(seurat_obj_filt_singlets_posthocfilt, dims = 1:30)
## 13:13:10 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:13:10 Read 3612 rows and found 30 numeric columns
## 13:13:10 Using Annoy for neighbor search, n_neighbors = 30
## 13:13:10 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:13:11 Writing NN index file to temp file /var/folders/y4/5qbzd1h11vdb2lww93vpl_pc0000gq/T//RtmpcYlTwp/file10bb33c811450
## 13:13:11 Searching Annoy index using 1 thread, search_k = 3000
## 13:13:11 Annoy recall = 100%
## 13:13:11 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:13:12 Initializing from normalized Laplacian + noise (using RSpectra)
## 13:13:12 Commencing optimization for 500 epochs, with 147830 positive edges
## 13:13:12 Using rng type: pcg
## 13:13:15 Optimization finished
seurat_obj_filt_singlets_posthocfilt <- FindNeighbors(seurat_obj_filt_singlets_posthocfilt, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
seurat_obj_filt_singlets_posthocfilt <- FindClusters(seurat_obj_filt_singlets_posthocfilt)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3612
## Number of edges: 126825
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8755
## Number of communities: 16
## Elapsed time: 0 seconds
saveRDS(seurat_obj_filt_singlets_posthocfilt, file = "pbmc4k_seurat_obj_filt_singlets_posthocfilt.rds")
filtered_umap_plot<-DimPlot(seurat_obj_filt_singlets_posthocfilt, label = TRUE) + ggtitle("Filtered")
plot_grid(unfiltered_umap_plot,filtered_umap_plot,ncol=2,nrow=1)
# Define the Jaccard similarity function
jaccard_similarity <- function(set1, set2) {
intersect_length <- length(intersect(set1, set2))
union_length <- length(set1) + length(set2) - intersect_length
return((intersect_length / union_length))
}
clusters_unfilt<-tibble(cellbarcode=row.names(seurat_obj@meta.data),clusterid_unfilt=seurat_obj@meta.data$seurat_clusters)
clusters_filt<-tibble(cellbarcode=row.names(seurat_obj_filt_singlets_posthocfilt@meta.data),clusterid_filt=seurat_obj_filt_singlets_posthocfilt@meta.data$seurat_clusters)
clusters_merged <- full_join(clusters_unfilt,clusters_filt,by="cellbarcode")
unique_unfilt <- unique(clusters_merged$clusterid_unfilt)
unique_filt <- unique(clusters_merged$clusterid_filt)
# Precompute indices for each cluster to avoid recalculating
unfilt_indices <- split(1:nrow(clusters_merged), clusters_merged$clusterid_unfilt)
filt_indices <- split(1:nrow(clusters_merged), clusters_merged$clusterid_filt)
# Create an empty list to store results
jaccard_list <- list()
# Calculate Jaccard distances and store them directly in a list
for (i in names(unfilt_indices)) {
for (j in names(filt_indices)) {
set1 <- unfilt_indices[[i]]
set2 <- filt_indices[[j]]
similarity <- jaccard_similarity(set1, set2)
# Append result as a named list
jaccard_list[[length(jaccard_list) + 1]] <- list(
unfilt = i,
filt = j,
jaccard_similarity = similarity
)
}
}
# Convert the list to a data frame
jaccard_df <- do.call(rbind, lapply(jaccard_list, as.data.frame))
# Convert columns to appropriate types
jaccard_df <- type.convert(jaccard_df, as.is = TRUE)
unfiltVsfilt_heatmap_plot <- ggplot(data = jaccard_df, aes(unfilt, filt, fill = jaccard_similarity)) +
geom_tile(color="black") +
#theme_classic() +
scale_fill_gradient(low ="white",high = "firebrick",breaks=seq(0,1,0.2)) +
scale_y_continuous(breaks=seq(0,sum(is.na(unique_filt)==FALSE)-1,1)) +
scale_x_continuous(breaks=seq(0,sum(is.na(unique_unfilt)==FALSE)-1,1)) +
labs(title = "Filtered vs. unfiltered cluster similarity",fill = "Jaccard Similarity") +
xlab("Unfiltered clusters") +
ylab("Filtered clusters") +
theme_minimal() +
theme(
panel.grid.major = element_blank(), # Remove major grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
panel.background = element_blank() # Remove any panel background color
) +
theme(axis.text.x = element_text(size=10),axis.text.y = element_text(size=10)) +
theme(
axis.text.x = element_text(margin = margin(t = -8)), # Move x-axis labels closer
axis.text.y = element_text(margin = margin(r = -12)) # Move y-axis labels closer
) +
theme(axis.title.x=element_text(size=14),axis.title.y=element_text(size=14))
unfiltVsfilt_heatmap_plot
A few notable observations can be made about the similarity plot:
In summary, filtering matters!
In our subsequent scRNA-seq workshop next week, we will dive into downstream analysis including: