How to annotate cell types in Trailmaker – Support Suite - Parse Biosciences

After single cell RNA-seq (scRNA-seq) data has been processed, filtered and integrated, cells are typically visualized in a 2-dimensional plot such as a UMAP or t-SNE embedding plot. Cells with similar gene expression profiles cluster together when unsupervised clustering methods like Leiden or Louvain are applied. These plotting and clustering steps can be performed and customized in Trailmaker^TM, within the Data Processing tab of the Insights module. The next step of data analysis is then to assign cell type annotations to those clusters. This article provides guidance on performing cell type annotation within Trailmaker, covering the two main approaches of automatic and manual annotation. We’ll discuss the available methods in Trailmaker and how to best utilize and combine automatic and manual approaches to generate accurate cell type annotations for your dataset.

Automatic annotation
Manual annotation
Combining automatic and manual annotation methods
- Copying cell sets from automatic annotations to Custom cell sets
- Adding further Custom cell sets
Conclusion

Automatic annotation

Background on automatic annotation methods in Trailmaker

Trailmaker offers an automatic annotation feature to help identify the cell types present in your dataset. The automatic annotation methods on offer depend on whether your project uses the Seurat or Scanpy workflow (see Data Processing step 6 for details). Seurat projects offer the ScType marker gene-based method, while Scanpy projects offer the Decoupler signature enrichment-based method and the CellTypist machine learning-based method. Running automatic annotation is usually a good starting point for understanding which cell types are present within your dataset.

ScType works by leveraging a curated open source database of known cell type markers (the database can be found in the ScTypeDB_full.xlsx file within the ScType GitHub page). Each cell type in the database is associated with a specific set of marker genes that serve as its unique "signature." The ScType marker genes database contains cell-specific markers for human and mouse cells. ScType operates by transforming the gene expression data of each cell in a way that emphasizes the expression of cell type-specific marker genes. An ScType score then reflects how well the cells in the cluster match the expression profile of each cell type, and the cell type with the highest ScType score is assigned to that cluster. Further explanation of the steps involved in ScType annotation are provided in our free “Mastering Single Cell RNA-seq Data Analysis with Trailmaker” course under the lesson on “Data Exploration - part 4 - Cell type annotation.”

Decoupler infers functional enrichment scores using Over-Representation Analysis (ORA) on scRNA-seq data given cell-type marker gene sets from PanglaoDB. For each cluster, decoupler tests enrichment of those markers and calculates Activity-by-Cluster (ACT) scores based on the ORA results. ACT scores are then scaled for visualization and used to rank cell-type enrichments, allowing to annotate clusters by the most significantly enriched cell types.

CellTypist uses logistic-regression classifiers trained on curated reference datasets to predict cell types from gene expression. It offers pre-trained models and returns predicted labels for each cell. In Trailmaker, CellTypist returns a single predicted label per cell, while probabilities are not shown. As with any reference-based method, accuracy depends on how well your data match the training reference, so it’s good practice to sanity-check labels with marker-gene expression or differential expression.

Running automatic annotation on a dataset in Trailmaker

To run automatic annotation on your dataset in Trailmaker, select ‘Annotate clusters’ from the ‘Cell sets and Metadata’ tile within the Data Exploration tab:

The available options will depend on whether your project is Seurat or Scanpy. Seurat projects will have the ScType option available (left), while Scanpy projects will have the Decoupler and CellTypist options available (right):

To compute the automatic annotations, complete the species and tissue selection and, if relevant, the model dataset (CellTypist only), and click ‘Compute’. The automatic annotations will be computed, and will then appear in the Cell sets and Metadata tile.

It’s worth noting that ScType and Decoupler are based on the default clustering method in Trailmaker (Leiden or Louvain). Therefore, changing the clustering resolution of Leiden/Louvain clusters in step 7 of the Data Processing tab will enable you to increase or decrease the granularity of your ScType or Decoupler annotations. This should allow you to select a resolution that makes sense for your dataset, based on biological knowledge of the cell types and states present and the research question you have in mind. Changing the default clustering resolution will not impact CellTypist annotation granularity.

Note that any time that you re-run the Data Processing pipeline with changed parameters (such as filtering settings), we recommend that you also re-run the automatic annotation(s) as the number of cells, clustering and UMAP embedding may have changed.

What to do if the tissue type or species you require is not available

ScType and Decoupler

The ScType options for tissue and species selection reflect the supported tissues and species in the ScType database. The authors of the ScType method publication suggest that the research community can propose new marker genes for cell type annotation. These can potentially be included in future versions of the ScType database. They recommend using GitHub’s “Pull Request” feature or directly emailing the authors. As Trailmaker utilizes the latest version of the ScType marker gene database, any updates to the database would also become available within Trailmaker in due course.

Similarly, Decoupler utilizes the Panglao database of cell type markers. New markers can be added to the list through this submission form.

As both ScType and Decoupler marker databases are based on human and mouse data, its performance may be less reliable when annotating clusters from other species. Therefore, if you are working with a different species in Trailmaker, you can still opt to try the “human” or “mouse” option. The quality of the resulting annotation depends on the degree of gene similarity with human or mouse cell type markers.

CellTypist

Details of the CellTypist models are available on the CellTypist Model list. Instructions on how to train and contribute your own model as reference for the CellTypist tool are available on the CellTypist website.

Dealing with “unknown” cells

In some cases, the automatic annotation methods can return “unknown” cells:

If a cluster has a low ScType score (less than a quarter of the number of cells in the cluster) or a negative ScType score, it is labeled as an "unknown" cell type.
CellTypist in Trailmaker uses the ‘best match’ approach, which can in some cases return “heterogeneous” cells which would be equivalent to unknown. If the model itself contains "unknown" cells, the "unknown" label can also be returned.
Note that Decoupler does not automatically return “unknown” cell types. Therefore, it’s especially important to check the annotations are correct.

A cluster could be labeled as “unknown” for several reasons, such as the presence of transitional or intermediate cell states, rare or novel cell types not yet characterized, or even technical artifacts. By labeling these cells as "unknown", the automatic annotation tool avoids forcing a potentially inaccurate label on these cells.

If some cells or clusters within your dataset have been labeled as “unknown” by an automatic annotation method, they can be manually investigated and labeled. You should first check on the embedding plot to see if the unknown cells cluster together, and therefore could contain a single cell type, or whether they are dispersed and could contain multiple cell types. If the unknown cells cluster together, one option would be to perform Differential Expression, comparing the unknown cluster to all other cells in the dataset. This will generate a list of genes that distinguish the unknown cell population from other cells in the dataset.

When ordered by descending order of log fold change (logFC), which is the default order in Trailmaker, the differential expression results provide marker genes for the unknown cluster. These marker genes can be investigated by clicking on the gene names to open the GeneCard page where you can find comprehensive information about the gene's function, associated cell types, and more. Alternatively, you can compare the differentially expressed gene list to relevant literature in your field/tissue of interest, or use the pathway analysis tool to gain insight into the function of those cells.

It’s also possible that the “unknown” cluster could represent low quality cells. Indications of this would be the lack of strong marker genes for the cluster (check the differential expression results or marker heatmap in the Data Exploration tab) and/or low number of genes or transcripts, high proportion of mitochondrial content reads or high doublet score (check the embedding plot in step 7 of the Data Processing tab). If low quality cells are the explanation for the unknown cluster, we recommend that you adjust the relevant filter(s) in order to exclude those cells from the downstream analysis.

Checking the automatic annotations

Even when an automatic annotation method does assign a cell type annotation to your clusters, it is strongly recommended that you perform your own sanity check to ensure that the annotations applied correctly reflect the cell types present. The fastest way to do this is to use the batch differential expression tool within the Plots and Tables tab of Trailmaker’s Insights module to generate and download a full list of marker genes for the cell types assigned by your selected automatic annotation method.

After the batch differential expression is completed and the files are downloaded and unzipped, the marker genes for each annotated cluster can be explored by opening the csv files and ordering by descending logFC.

Another way to check the automatic annotations is to search for known marker genes in the Data Exploration tab and then color the embedding plot with each gene. An example is shown below for CD3E, a known marker for T cells. By hovering over the cells in the embedding plot, we can see that the unknown cells in this example express CD3E so are likely to be T cells.

How accurate are the automatic annotations?

The question of accuracy of any selected automatic annotation method within your specific dataset is hard to answer. The accuracy can depend on the quality of your uploaded data. In addition, the quality of the annotation method itself and, in the case of ScType and Decoupler, the marker gene databases, and in the case of CellTypist, the reference models, should also be taken into consideration. Therefore, we always recommend that you perform cell type annotation of your dataset in multiple ways, such as using the automatic and manual methods in Trailmaker, and then compare and check the results.

Manual annotation

Manual cell type annotation is the process by which known marker genes are used to assign cell clusters to a specific cell type or state. Marker genes are genes that are known to be specifically expressed in certain cell types or states. For example, the CD3E gene is a well-known marker for T cells, while the CD19 gene is a marker for B cells. Manual annotation typically requires a good understanding of the cell types and states that are expected to be present in the dataset, which can depend on the species, tissue, and disease or treatment status.

In Trailmaker, the process of manual cell type annotation using marker genes is facilitated by several features in the Data Exploration tab within the Insights module of the platform.

Identify marker genes for each cluster using the heatmap

On the Data Exploration page, a heatmap shows the marker genes for Leiden or Louvain clustering by default (depending on your clustering method selection in the Data Processing tab). Marker genes are calculated using a Wilcoxon rank-sum test (wilcoxauc from the presto package in Seurat or sc.tl.rank_genes_groups in Scanpy), which identifies genes that are differentially expressed between different groups of cells. In the heatmap, you can zoom in to a specific cluster of interest on the heatmap to see the gene symbols on the left side. The genes that are represented in yellow on the heatmap are the ones that are highly expressed. These highly expressed genes are marker genes for the corresponding cluster.

Visualize marker gene expression on the UMAP or t-SNE embedding plot

The embedding plot on the Data Exploration page shows a UMAP or t-SNE plot and the clustering according to the parameters selected in the Data Processing tab. You can search for a known marker gene of a particular cell type or state using the gene list, and then plot that gene on the embedding using the ‘eye’ icon. This can help to identify clusters that highly express the given gene.

Perform differential expression analysis

Another approach to identify relevant marker genes for a specific cluster is to perform a differential expression analysis between a cluster of interest and all the other cells in the dataset. The differential expression option within the Data Exploration tab or the batch differential expression tool within the Plots and Tables tab can be used, as outlined in the previous sections above.

Performing cell type annotation outside of Trailmaker

The flexibility of Trailmaker import and export options mean that it is possible to perform cell type annotation outside of the platform. To do this, simply download the Seurat (for R projects) or AnnData (for Scanpy projects) object from the Insights project details page and perform your annotation in R or Python. If using Seurat in R, the Seurat object could then be re-uploaded to the Trailmaker Insights module for further exploration. This option allows you to take advantage of other automatic or manual methods of annotation.

Combining automatic and manual annotation methods

For some datasets where the automatic annotation has worked well for some but not all clusters, you might want to use a combination of the automatic and manual methods. To do this, we recommend that you make use of the ‘Custom cell sets’ family.

Copying cell sets from automatic annotations to Custom cell sets

For clusters where the automatic annotation has correctly assigned the cell type, those clusters can be copied over to the ‘Custom cell sets’ family of clusters. To do this, select one annotated cluster that you want to copy, and then click the second of the button options that appears, which can combine or, in this case, copy the selected cluster to the Custom cell sets family. This process can be repeated to copy over multiple annotated clusters.

Adding further Custom cell sets

More Custom cell sets can be created using the lasso selection tool on the embedding plot (A), by copying or combining from Leiden (or Louvain) clusters (B), creating ‘complement’ clusters (C), or based on the expression of one or more genes (D).

These different approaches for generating Custom cell sets can be combined to create an optimal family of clusters that correctly identify and annotate the cell types and states within your dataset to a level that makes sense to the tissue type and can support your endeavor to answer biological questions about your data.

Conclusion

Trailmaker offers flexibility in approaches for performing cell type annotation as well as three specific methods of automatic annotation. With the guidance outlined in this article, you should be able to accurately annotate your cell types in Trailmaker.

Table of Contents