The Data Processing page of TrailmakerTM is where the necessary filtering and cleanup is performed. During steps 1-5, background, dead cells, poor quality cells and doublets are excluded. In step 6, multi-sample datasets are integrated to remove batch effects, and dimensionality reduction is performed. Finally, in step 7, the embedding is configured (e.g. UMAP or t-SNE) and clustering is applied. Together, these essential data processing steps ensure that the data are high quality and return accurate results during downstream analysis, and enable complex data to be easily visualized and interpreted.
A detailed explanation of each individual step is provided in the user guide. The goal of this article is to provide a brief guided walkthrough of a good quality dataset, and to direct you towards additional resources when you want to adjust settings or troubleshoot failures. This walkthrough uses the “Human PBMCs - Evercode v3” dataset that’s available in Trailmaker’s datasets repository.
Within each filtering step (steps 1-5, accessed via the dropdown menu [A in screenshot]), the samples are listed vertically. Samples can be selected to view/hide using the selection box [B in screenshot]. Note that hiding samples from the data processing filter view does not exclude them from the analysis - all samples present in the project are included in the analysis. A plot is shown for each sample [C in screenshot], with the filtering parameters and options for that sample available in the Filtering Settings menu [D in screenshot]. Below each sample plot is a statistics table that reports metrics before and after the currently viewed filter step is applied [E in screenshot]. A status indicator displays the current status of the data processing pipeline [F in screenshot].
This first run of data processing uses automatically determined default settings for filtering individual samples. These filtering settings are set according to best practices in the field, and may be constant across all samples (such as in filter 1) or are set dynamically per sample and therefore will differ between samples (filters 2-5). Standard settings are used for integration and clustering (steps 6-7).
All settings can be manually adjusted, using the Filtering Settings menu, and filtering steps can be disabled. In these cases, you will be prompted to re-run data processing to apply your changes.
Note that automatic and recommended adjustments to data processing steps and parameters based on technology type, data type (cells or nuclei), and species are detailed in the following article: How to adjust data processing settings to fit your dataset and troubleshoot data processing failures.
Let’s take a brief look at what happens in each step of data processing, together with example plots:
Step 1, Classifier filter: The Classifier filter removes background and/or empty droplets from the dataset and is specifically developed for droplet-based technologies. This filter uses the ‘emptyDrops’ method to calculate the False Discovery Rate (FDR), a statistical value which represents the probability that a droplet is empty. The default filtering threshold is an FDR value of 0.01 for all samples. Note that this filter is disabled by default for Parse Biosciences data.
Step 2, Cell size distribution filter: The Cell size distribution filter can be used to further remove background from your dataset. For Parse Biosciences data, this is the main filter that is used to exclude background. For other technologies, this filter is disabled by default though you can choose to enable it. This filter sets a hard threshold on the minimum number of transcripts, based on the steepest point (or knee) of the barcode rank plot. The filtering threshold is dynamically set per sample according to the spread of the data, and will differ across samples within your dataset.
Step 3, Mitochondrial content filter: The mitochondrial content filter removes dead and dying cells based on the percentage of mitochondrial reads. The default threshold is determined as 3 median absolute deviations above the median, and is calculated per sample.
Step 4, Number of genes vs transcripts filter: This filter works on the principle that the number of unique transcripts increases with the number of genes. Outliers are typically either (i) cells that contain many genes but few transcripts which suggests the transcripts were not amplified well; or (ii) cells that contain few genes but many transcripts which indicates that the few transcripts that exist are over-amplified. For Parse data, a third order spline (cubic polynomial) model is applied by default ('spline' option in the menu controls), whereas for other technology types a linear fit model is applied by default. The ‘prediction interval’ (R function ‘predict’) is the stringency for defining and excluding outliers, which fall outside of the two dotted red lines.
Step 5, Doublet filter: The doublet filter calculates the doublet probability for all cells using the scDblFinder algorithm, and filters out cells with a high probability of being a doublet. The filtering threshold is dynamically set on a per sample basis, and represents the maximum doublet score assigned to cells that are classified as singlets.
Step 6, Data integration: The Data integration step removes batch effects and reduces the dimensionality of the data. The Harmony method of integration is applied to all datasets by default. You can choose to select a different integration method, or indeed no integration, using the ‘method’ dropdown within the Data integration settings menu.
Data that is well integrated would usually be expected to have all samples distributed evenly across all clusters, which can be visualized by coloring the UMAP embedding by sample (A) or by viewing the frequency plot colored by sample (B). The elbow plot view (C) is also available, which maps the percentage contribution of each Principle Component (PC) to the total variation in the dataset and can be used to determine the optimal number of PCs to use. The default setting for the number of PCs is defined as the number of PCs that explains 90% of the variation (if less than 30 PCs), or 30 PCs.
Step 7, Configure embedding: In step 7 the integrated data is further reduced into a 2-dimensional embedding, either UMAP or t-SNE. Clustering is then applied, using either the Leiden (optimal for Parse Biosciences technology) or Louvain (optimal for all other technologies) methods. The default view of this step is a UMAP embedding showing either Leiden or Louvain clustering:
The violin plot option within step 7 of data processing is useful for visualizing quality control metrics across all samples in your dataset, and can help to identify outliers. Below, the number of genes is plotted in a violin plot, showing similar distribution across samples:
Examples of poorer quality data processing plots (so-called ‘red flags’), together with guidance on how to adjust data processing parameters to fit your dataset and on how to troubleshoot data processing failures is available in the following article:
How to adjust data processing settings to fit your dataset and troubleshoot data processing failures
Finally, all data processing settings can be downloaded in a text file (.txt) from the Insights Project Details page. This file can be useful when preparing the methods section for a publication.
Key links
- Access Trailmaker
- Trailmaker user guide
- Introduction to Trailmaker video
- Free Mastering Single Cell RNA-seq Data Analysis course