After single cell RNA-seq (scRNA-seq) data has been processed, filtered and integrated, cells are typically visualized in a 2-dimensional plot such as a UMAP or t-SNE embedding plot. Cells with similar gene expression profiles cluster together when unsupervised clustering methods like Leiden or Louvain are applied. These plotting and clustering steps can be performed and customized in TrailmakerTM, within the Data Processing tab of the Insights module. The next step of data analysis is then to assign cell type annotations to those clusters. This article provides guidance on performing cell type annotation within Trailmaker, covering the two main methods of automatic and manual annotation. We’ll discuss how to best utilize and combine these methods to generate accurate cell type annotations for your dataset.
Table of Contents
-
Automatic annotation
- Background on the ScType method of annotation that’s available in Trailmaker
- Running ScType automatic annotation on a dataset in Trailmaker
- What to do if the tissue type or species you require is not available
- Dealing with “unknown” cells
- Checking the ScType annotations
- How accurate are the ScType annotations?
- Manual annotation
- Combining automatic and manual annotation methods
- Conclusion
Automatic annotation
Background on the ScType method of annotation that’s available in Trailmaker
Trailmaker offers an automatic annotation feature which uses the ScType marker gene-based method. Running ScType on your dataset is usually a good starting point for understanding which cell types are present within your dataset.
ScType works by leveraging a curated open source database of known cell type markers (the database can be found in the ScTypeDB_full.xlsx file within the ScType GitHub page). Each cell type in the database is associated with a specific set of marker genes that serve as its unique "signature." The ScType marker genes database contains cell-specific markers for human and mouse cells. ScType operates by transforming the gene expression data of each cell in a way that emphasizes the expression of cell type-specific marker genes. An ScType score then reflects how well the cells in the cluster match the expression profile of each cell type, and the cell type with the highest ScType score is assigned to that cluster. Further explanation of the steps involved in ScType annotation are provided in our free “Mastering Single Cell RNA-seq Data Analysis with Trailmaker” course under the lesson on “Data Exploration - part 4 - Cell type annotation.”
Running ScType automatic annotation on a dataset in Trailmaker
To run ScType on your dataset in Trailmaker, select ‘Annotate clusters’ from the ‘Cell sets and Metadata’ tile within the Data Exploration tab, then input your tissue type and species, and click ‘Compute’. The ScType annotations will appear in the Cell sets and Metadata tile.
It’s worth noting that ScType is based on the default clustering method in Trailmaker (Leiden or Louvain). Therefore, changing the clustering resolution of Leiden/Louvain clusters in step 7 of the Data Processing tab will enable you to increase or decrease the granularity of your ScType annotations. This should allow you to select a resolution that makes sense for your dataset, based on biological knowledge of the cell types and states present and the research question you have in mind.
Note that any time that you re-run the Data Processing pipeline with changed parameters (such as filtering settings), we recommend that you also re-run the ScType annotation as the number of cells, clustering and UMAP embedding may have changed.
What to do if the tissue type or species you require is not available
The ScType options for tissue and species selection reflect the supported tissues and species in the ScType database. The authors of the ScType method publication suggest that the research community can propose new marker genes for cell type annotation. These can potentially be included in future versions of the ScType database. They recommend using GitHub’s “Pull Request” feature or directly emailing the authors. As Trailmaker utilizes the latest version of the ScType marker gene database, any updates to the database would also become available within Trailmaker.
As ScType's marker database is primarily based on human and mouse data, its performance may be less reliable when annotating clusters from other species. Therefore, if you are working with a different species in Trailmaker, you can still opt to try the “human” or “mouse” option. The quality of the resulting annotation depends on the degree of gene similarity with human or mouse cell type markers.
Dealing with “unknown” cells
If a cluster has a low ScType score (less than a quarter of the number of cells in the cluster) or a negative ScType score, it is labeled as an "unknown" cell type. A cluster could be labeled as “unknown” for several reasons, such as the presence of transitional or intermediate cell states, rare or novel cell types not yet characterized, or even technical artifacts. By labeling these cells as "unknown", ScType avoids forcing a potentially inaccurate label on these cells.
If some cells or clusters within your dataset have been labeled as “unknown” by the ScType annotation, they can be manually investigated and labeled. You should first check on the embedding plot to see if the unknown cells cluster together, and therefore could contain a single cell type, or whether they are dispersed and could contain multiple cell types. If the unknown cells cluster together, one option would be to perform Differential Expression, comparing the unknown cluster to all other cells in the dataset. This will generate a list of genes that distinguish the unknown cell population from other cells in the dataset.
When ordered by descending order of log fold change (logFC), which is the default order in Trailmaker, the differential expression results provide marker genes for the unknown cluster. These marker genes can be investigated by clicking on the gene names to open the GeneCard page where you can find comprehensive information about the gene's function, associated cell types, and more. Alternatively, you can compare the differentially expressed gene list to relevant literature in your field/tissue of interest, or use the pathway analysis tool to gain insight into the function of those cells.
It’s also possible that the “unknown” cluster could represent low quality cells. Indications of this would be the lack of strong marker genes for the cluster (check the differential expression results or marker heatmap in the Data Exploration tab) and/or low number of genes or transcripts, high proportion of mitochondrial content reads or high doublet score (check the embedding plot in step 7 of the Data Processing tab). If low quality cells are the explanation for the unknown cluster, we recommend that you adjust the relevant filter(s) in order to exclude those cells from the downstream analysis.
Checking the ScType annotations
Even when the ScType method does assign a cell type annotation to your clusters, it is strongly recommended that you perform your own sanity check to ensure that the annotations applied correctly reflect the cell types present. The fastest way to do this is to use the batch differential expression tool within the Plots and Tables tab of Trailmaker’s Insights module to generate and download a full list of marker genes for the ScType assigned cell sets.
After the batch differential expression is completed and the files are downloaded and unzipped, the marker genes for each ScType cluster can be explored by opening the csv files and ordering by descending logFC.
Another way to check the ScType annotations is to search for known marker genes in the Data Exploration tab and then color the embedding plot with each gene. An example is shown below for CD3E, a known marker for T cells. By hovering over the cells in the embedding plot, we can see that the unknown cells in this example express CD3E so are likely to be T cells.
How accurate are the ScType annotations?
The authors of the ScType method carried out a systematic benchmarking of ScType together with other related annotation methods across 6 scRNA-seq datasets. In their benchmarking study, ScType showed a 98.6% accuracy rate.
The question of ScType accuracy within your specific dataset is a little harder to answer. The accuracy of any automatic annotation method can depend on data quality as well as the annotation method and, in the case of ScType, the gene marker database itself. Therefore, we always recommend that you perform cell type annotation of your dataset in multiple ways, such as using the automatic and manual methods in Trailmaker, and then compare and check the results.
Manual annotation
Manual cell type annotation is the process by which known marker genes are used to assign cell clusters to a specific cell type or state. Marker genes are genes that are known to be specifically expressed in certain cell types or states. For example, the CD3E gene is a well-known marker for T cells, while the CD19 gene is a marker for B cells. Manual annotation typically requires a good understanding of the cell types and states that are expected to be present in the dataset, which can depend on the species, tissue,and disease or treatment status.
In Trailmaker, the process of manual cell type annotation using marker genes is facilitated by several features in the Data Exploration tab within the Insights module of the platform.
Identify marker genes for each cluster using the heatmap
On the Data Exploration page, a heatmap shows the marker genes for Leiden or Louvain clustering by default (depending on your clustering method selection in the Data Processing tab). Marker genes are identified using a function from the presto package in R called wilcoxauc, which identifies genes that are differentially expressed between different groups of cells. In the heatmap, you can zoom in to a specific cluster of interest on the heatmap to see the gene symbols on the left side. The genes that are represented in yellow on the heatmap are the ones that are highly expressed. These highly expressed genes are marker genes for the corresponding cluster.
Visualize marker gene expression on the UMAP or t-SNE embedding plot
The embedding plot on the Data Exploration page shows a UMAP or t-SNE plot and the clustering according to the parameters selected in the Data Processing tab. You can search for a known marker gene of a particular cell type or state using the gene list, and then plot that gene on the embedding using the ‘eye’ icon. This can help to identify clusters that highly express the given gene.
Perform differential expression analysis
Another approach to identify relevant marker genes for a specific cluster is to perform a differential expression analysis between a cluster of interest and all the other cells in the dataset. The differential expression option within the Data Exploration tab or the batch differential expression tool within the Plots and Tables tab can be used, as outlined in the previous sections above.
Performing cell type annotation outside of Trailmaker
The flexibility of Trailmaker import and export options mean that it is possible to perform cell type annotation outside of the platform. To do this, simply download the Seurat object from the Insights project details page, perform your annotation in R, and then re-import the Seurat object to the Trailmaker Insights module. This option allows you to take advantage of other automatic or manual methods of annotation.
Combining automatic and manual annotation methods
For some datasets where the automatic annotation has worked well for some but not all clusters, you might want to use a combination of the automatic and manual methods. To do this, we recommend that you make use of the ‘Custom cell sets’ family.
Copying ScType annotated cell sets to ‘Custom cell sets’
For clusters where the ScType annotation has correctly assigned the cell type, those clusters can be copied over to the ‘Custom cell sets’ family of clusters. To do this, select one ScType cluster that you want to copy, and then click the second of the button options that appears, which can combine or, in this case, copy the selected cluster to the Custom cell sets family. This process can be repeated to copy over multiple ScType annotated clusters.
Adding further Custom cell sets
More Custom cell sets can be created using the lasso selection tool on the embedding plot (A), by copying or combining from Leiden (or Louvain) clusters (B), creating ‘complement’ clusters (C), or based on the expression of one or more genes (D).
These different approaches for generating Custom cell sets can be combined to create an optimal family of clusters that correctly identify and annotate the cell types and states within your dataset to a level that makes sense to the tissue type and can support your endeavor to answer biological questions about your data.
Conclusion
Trailmaker offers flexibility in methods for performing cell type annotation. To dive deeper into different annotation methods and to understand the advantages and disadvantages of each, we recommend signing up to our free data analysis course, and in particular exploring the lesson entitled “Data Exploration - part 4 - Cell type annotation.”