Here we present an example analysis of 65k peripheral blood mononuclear blood cells (PBMCs) using the R package Seurat. This tutorial is meant to give a general overview of each step involved in analyzing a digital gene expression (DGE) matrix generated from a Parse Biosciences single cell whole transcription experiment.
Most of the steps in this tutorial will mirror those found in the official Seurat PBMC tutorial, as they provide an excellent explanation for each step, but we've also taken care to change or add areas that we feel need additional information.
Text taken verbatim from the official Seurat tutorial is indented in a block quote like this.
Please refer to the following resources for more information about Seurat and single cell analysis in general:
- Seurat homepage
- Seurat PBMC tutorial
- Current best practices in single‐cell RNA‐seq analysis: a tutorial
Loading libraries and setting location paths
library(Seurat)
library(dplyr)
library(Matrix)
library(ggplot2)
rm(list = ls())
data_path <- "/volume-general/analysis/data/pbmc/"
fig_path <- "/volume-general/analysis/figures/pbmc/seurat/"
Below we've included a few convenience functions for saving images and reading/writing your Seurat object to disk. When you're working with larger datasets, it's usually a good idea to save your progress after computationally intensive steps so you can back track if you wish to do so.
# Convenience functions
SaveFigure <- function(plots, name, type = "png", width, height, res){
if(type == "png") {
png(paste0(fig_path, name, ".", type),
width = width, height = height, units = "in", res = 200)
} else {
pdf(paste0(fig_path, name, ".", type),
width = width, height = height)
}
print(plots)
dev.off()
}
SaveObject <- function(object, name){
saveRDS(object, paste0(data_path, name, ".RDS"))
}
ReadObject <- function(name){
readRDS(paste0(data_path, name, ".RDS"))
}
Reading in data
The Seurat function ReadParseBio()
provides a convenient way to read your expression matrix into R using the DGE folder path as input. In Seurat versions >=4.1
, ReadParseBio()
assumes the gene list in your DGE directory is named "all_genes.csv" (Parse pipeline versions >= 0.9.6
). For Parse pipeline versions <= 0.9.3
, you'll need to rename the file to "genes.csv".
mat_path <- "/volume-general/analysis/combined/all-sample/DGE_filtered"
mat <- ReadParseBio(mat_path)
# Check to see if empty gene names are present, add name if so.
table(rownames(mat) == "") rownames(mat)[rownames(mat) == ""] <- "unknown"
# Read in cell meta data
cell_meta <- read.csv(paste0(mat_path, "/cell_metadata.csv"), row.names = 1)
# Create object
pbmc <- CreateSeuratObject(mat, min.genes = 100, min.cells = 100,
names.field = 0, meta.data = cell_meta)
When we create our Seurat object the plate well numbers (column names in the expression matrix) from the experiment will automatically be assigned to the cell identity slot. In other words, the program assumes this how we want to initially classify our cells. In general, we would like to avoid this behavior so there isn't a bias towards a particular cell class when removing outliers.
# Setting our initial cell class to a single type, this will changer after clustering.
pbmc@meta.data$orig.ident <- factor(rep("pbmc", nrow(pbmc@meta.data)))
Idents(pbmc) <- pbmc@meta.data$orig.ident
SaveObject(pbmc, "seurat_obj_before_QC")
pbmc <- ReadObject("seurat_obj_before_QC")
Cell quality control
In this step we'll perform cell quality control to prevent outlier cells from influencing downstream analyses. Cells that have unusually high transcript or gene counts are very likely to be multiplets, which is a term for two or more cells that have identical barcodes. Conversely, cells that have very low transcript or genes counts are a consequence of barcoding cells with damage membranes, or barcoding ambient RNA.
Filtering outliers can be accomplished by generating a violin plot for each cell feature and manually selecting the threshold we wish to remove cells. We'll also add another important cell feature that shows the percentage of mitochondrial genes expressed in each cell. Cells with high mitochondrial gene percentages should be removed, as they are likely to have lost cytoplasmic RNA from lysis or may have increased apoptosis (Luecken and Theis 2019)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
plot <- VlnPlot(pbmc, pt.size = 0.10,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
SaveFigure(plot, "vln_QC", width = 12, height = 6)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
SaveFigure((plot1 + plot2),"scatter_QC", width = 12, height = 6, res = 200)
# Perform the filtering
pbmc <- subset(pbmc, subset = nFeature_RNA < 5000 & nCount_RNA < 20000 & percent.mt < 15)
Normalizing the data
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization methodLogNormalize
that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored inpbmc[["RNA"]]@data
.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Identification of highly variable features
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. Our procedure in Seurat3 is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
SaveFigure((plot1 + plot2), "var_features", width = 12, height = 6)
Scaling the data
Next, we apply a linear transformation ('scaling') that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function:
- Shifts the expression of each gene, so that the mean expression across cells is 0
- Scales the expression of each gene, so that the variance across cells is 1
- This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
- The results of this are stored in
pbmc[["RNA"]]@scale.data
Warning! Generating a scaled data matrix can require a significant amount of RAM for large datasets and potentially crash your session. We therefore chose to use the default setting which only uses the variable gene set in the previous step. It's important to note that scaling the variable gene subset does not affect clustering; it only limits the number of genes available for plotting in the DoHeatMap
function.
pbmc <- ScaleData(pbmc)
Perform linear dimensional reduction
Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features
argument if you wish to choose a different subset.
Seurat provides several useful ways of visualizing both cells and features that define the PCA, includingVizDimReduction
,DimPlot
, andDimHeatmap
pbmc <- RunPCA(pbmc)
# SaveObject(pbmc, "seurat_obj_after_PCA")
pbmc <- ReadObject("seurat_obj_after_PCA")
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
Positive: SLC8A1, LRP1, SLC11A1, IFI30, SPI1
Negative: RORA, CD247, BCL11B, CDC14A, TGFBR3
PC_ 2
Positive: BANK1, PAX5, EBF1, AFF3, FCRL1
Negative: CD247, RORA, AOAH, ARHGAP26, TGFBR3
PC_ 3
Positive: GNLY, LINC00298, NCAM1, C1orf21, AOAH
Negative: CAMK4, INPP4B, IL7R, SERINC5, ANK3
PC_ 4
Positive: AC104809.2, LINC02085, TCF7L2, FMNL2, MS4A6E
Negative: FLT3, CD36, AC023590.1, DACH1, SCN9A
PC_ 5
Positive: FCAR, VCAN, STAB1, CSF3R, FCN1
Negative: AC023590.1, EPHB1, CUX2, COL24A1, PTPRS
plot <- VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
SaveFigure(plot, "viz_PCA_loadings", width = 10, height = 8)
plot <- DimPlot(pbmc, reduction = "pca", group.by = "orig.ident")
SaveFigure(plot, "pc1_2_scatter", width = 8, height = 6)
In particularDimHeatmap
allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Settingcells
to a number plots the 'extreme' cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.
# Image doesn't save as png unless fast = FALSE
plot <- DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE, fast = FALSE)
SaveFigure(plot, "dim_heatmap1", width = 8, height = 6)
plot <- DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE, fast = FALSE)
SaveFigure(plot, "dim_heatmap1_15", width = 12, height = 18)
Determine the 'dimensionality' of the dataset
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a 'metafeature' that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? 10? 20? 100?
In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a 'null distribution' of feature scores, and repeat this procedure. We identify 'significant' PCs as those who have a strong enrichment of low p-value features.
An alternative heuristic method generates an 'Elbow plot': a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot
function). In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.
Below we've included the code for estimating principal components using both the "Elbow" and "JackStraw" plots.
Option 1: Estimate PCs using the "Elbow Plot"
For this particular dataset, we chose not to use the jackstraw plot for PC selection due to the amount of time it takes for 65k cells, but instead opted to use the elbow plot where we selected PCs 1-30. We've included the code for the JackStraw plot.
plot <- ElbowPlot(pbmc,ndims = 50)
SaveFigure(plot, "PC_elbow_plot", width = 8, height = 10)
Option 2: Estimate PCs using the "JackStraw Plot"
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
Identifying the true dimensionality of a dataset -- can be challenging/uncertain for the user. We therefore suggest these three approaches to consider. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.
We chose 10 here, but encourage users to consider the following:
- Dendritic cell and NK aficionados may recognize that genes strongly associated with PCs 12 and 13 define rare immune subsets (i.e. MZB1 is a marker for plasmacytoid DCs). However, these groups are so rare, they are difficult to distinguish from background noise for a dataset of this size without prior knowledge.
- We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
- We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.
Cluster the cells
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected 'quasi-cliques' or 'communities'.As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the
FindNeighbors
. function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The
FindClusters
function implements this procedure, and contains a resolution parameter that sets the 'granularity' of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents function.
pbmc <- FindNeighbors(pbmc, dims = 1:30)
pbmc <- FindClusters(pbmc, resolution = 0.30)
Reorder clusters according to their similarity
This step isn't explicitly required, but can ease the burden of merging cell clusters (discussed further in the section "Merging clusters and labeling cell types") by reassigning each cluster by their position on a phylogenetic tree.
pbmc <- BuildClusterTree(pbmc, reorder = TRUE, reorder.numeric = TRUE)
Run non-linear dimensional reduction (UMAP/tSNE)
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
pbmc <- RunUMAP(pbmc, dims = 1:30)
plot <- DimPlot(pbmc, reduction = "umap")
SaveFigure(plot, "umap_louvain_res_p3", width = 9, height = 8)
You can save the object at this point so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.
# SaveObject(pbmc, "seurat_obj_clustered")
pbmc <- ReadObject("seurat_obj_clustered")
Differential gene expression (finding cluster markers)
Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of features that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly deferentially expressed features will likely still rise to the top.
pbmc_markers <- FindAllMarkers(pbmc, min.pct = 0.25, logfc.threshold = 0.25)
pbmc_markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
We include several tools for visualizing marker expression. VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploringRidgePlot
,CellScatter
, andDotPlot
as additional methods to view your dataset.
plot <- VlnPlot(pbmc, features = c("MS4A1", "CD79A"), group.by = "tree.ident")
SaveFigure(plot, "vln_exp1", width = 16, height = 8)
# you can plot raw counts as well
plot <- VlnPlot(pbmc, features = c("NKG7", "GNLY"), slot = "counts", log = TRUE)
SaveFigure(plot, "vln_exp2", width = 16, height = 8)
Visualizing the top n genes per cluster
The Seurat PBMC tutorial makes use of the function DoHeatmap
for visualizing the top n genes per cluster in a single figure. This method works well for a few thousand cells, but loses resolution as the number of cells increase because individual columns have to be interpolated. The DotPlot
function addresses this by displaying the average expression for all cells in a cluster.
top5 <- pbmc_markers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_log2FC)
to_plot <- unique(top5$gene)
plot <- DotPlot(pbmc, features = to_plot, group.by = "tree.ident") + coord_flip()
SaveFigure(plot, "dplot_top5", width = 9, height = 20)
Merging clusters and labeling cell types
A common goal of single cell RNA-seq analysis is to eventually classify all clusters into a cell "type". But what constitutes a cell type? Does a cluster represent a cell type, or a cell in a temporary state as it transitions from one type to the next? Because the appropriate number of clusters depends on the nature of the dataset and specific analysis goals, it's usually a good idea to experiment with models that vary in cluster number.
However, there may be situations where one can't seem to get the "right" number of clusters. That is, you may have a cluster that perfectly describe cells in one population, but another population may have too many subdivisions. In the following example we'll demonstrate how to merge multiple clusters into a single unit to classify basic PBMC cell types, but it's important to note that this strategy can be used in any scenario where one wishes to customize cell classification.
First we'll take a look at the markers from the official tutorial and see which genes correspond to cell type, and then plot each gene on a UMAP plot to see which cluster they localize with.
Cluster ID | Markers | Cell Types |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | IL7R, S100A4 | Memory CD4+ |
2 | CD14, LYZ | CD14+ Mono |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
Note that the platelet marker PPBP isn't present in this dataset. We also have a couple unclassified clusters that aren't present in the official tutorial. We'll name these two clusters after their top marker genes, IGHA2 and ZNF385D.
# Plotting PBMC markers
markers <- c("IL7R", "CCR7", "S100A4", "CD14", "LYZ", "MS4A1", "CD8A",
"FCGR3A", "MS4A7", "GNLY", "NKG7", "FCER1A", "CST3", "PPBP", "IGHA2", "ZNF385D")
pbmc_markers <- FeaturePlot(pbmc, features = markers)
SaveFigure(pbmc_markers, "pbmc_markers", width = 30, height = 30)
Next, lets generate a UMAP plot with the clusters explicitly labeled so it's a easier to see which cluster corresponds with our PBMC markers. You'll notice that the cluster numbers tend to follow an ordered pattern (e.g. the green clusters on the bottom left contain 5, 6, and 7). This is a consequence of running the BuildClusterTree
function and should make it easier to assign them to a single group (in the next step).
plot <- DimPlot(pbmc, reduction = "umap", label = TRUE) + NoAxes() + NoLegend
SaveFigure(plot, "umap_louvain_nolegend", width = 8, height = 7)
The last step involves generating a list where each cell type is assigned multiple clusters so they can be merged. You may wonder why the cell names in the list look different the label names. It's because R list names don't accept spaces or special characters like "+".
# New IDs from official tutorial
new_ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T",
"FCGR3A+ Mono", "NK", "DC", "IGHA", "ZNF385D")
new_id_list <- list(Naive.CD4.T = c(17,19,20), Memory.CD4.T = c(15,16,18),
CD14.Mono = c(2), B = c(5:7), CD8.T = 13, FCGR3A.Mono = 1, NK = c(10:12,14),
DC = c(3,8), IGHA = 4, ZNF385D = 9)
for (i in 1:length(new_id_list)) {
ind <- which(pbmc@meta.data$tree.ident %in% new_id_list[[i]])
pbmc@meta.data$collapsed[ind] <- names(new_id_list)[i]
}
pbmc@meta.data$collapsed <- factor(
pbmc@meta.data$collapsed, levels = names(new_id_list), ordered = TRUE)
Idents(pbmc) <- pbmc@meta.data$collapsed
names(new_ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new_ids)
plot <- DimPlot(pbmc, reduction = "umap", label = TRUE) + NoAxes() + NoLegend()
SaveFigure(plot, "umap_louvain_nolegend_names", width = 8, height = 7)
Our final output