Guided walkthrough: Insights Data Exploration – Support Suite - Parse Biosciences

The Data Exploration page of Trailmaker^TM has a wide variety of features for in-depth exploration of your data. Using this module, users can identify which cell types are represented by their cell sets, fully customize cell set selection, and generate insight into the dataset using gene expression visualization, differential expression and pathway analysis.

Full explanation of all features and options within Data Exploration is available in the user guide. The goal of this article is to provide a guided walkthrough on how you might use features within the Data Exploration page to gain biological insights into your dataset. This walkthrough uses the “Human PBMCs - Evercode v3” dataset that’s available in Trailmaker’s datasets repository.

The default view of the Data Exploration page is shown below:

The default view for the Data Exploration page consists of 4 tiles:

1. The UMAP or t-SNE embedding is colored by the default clustering method (either Leiden or Louvain), as determined in Data Processing.

2. The Cell sets and Metadata tile lists the default clusters (either Leiden or Louvain), samples, and metadata groups, together with any custom cell sets that have been created and any automatic annotation that has been performed.

3. The gene list shows the full list of genes present in the dataset ordered by dispersion, which is a measure of variability in gene expression across the dataset.

4. The heatmap shows marker genes for your selected clustering method (Leiden or Louvain), which have been calculated using a Wilcoxon rank-sum test (wilcoxauc from the presto package in Seurat or sc.tl.rank_genes_groups in Scanpy). The number of genes shown per cluster varies depending on how many clusters you have in your dataset.

Cells with similar gene expression profiles cluster together when unsupervised clustering methods like Leiden or Louvain are applied. One of the first tasks you are likely to want to perform in the Data Exploration page is to understand what cell types are represented by the clusters. This step is typically referred to as cell type annotation. Our support article How to annotate cell types in Trailmaker provides detailed information on how to annotate cell sets using the different methods available in Trailmaker. In this brief overview, we’ll discuss only two annotation methods in Trailmaker.

Trailmaker offers automatic annotation within the Cell sets and Metadata tile. The automatic annotation methods on offer depend on whether your project uses the Seurat or Scanpy workflow (see Data Processing step 6 for details). Seurat projects offer the ScType marker gene-based method, while Scanpy projects offer the Decoupler signature enrichment-based method and the CellTypist machine learning-based method.

Running automatic annotation on your dataset is usually a good starting point for understanding which cell types are present within your dataset. Automatic annotation can be computed within the ‘Annotate clusters’ tab of the Cell sets and Metadata tile. In the Seurat project example below, ScType is performed by selecting the relevant tissue and species. The automatic annotation results appear in the Cell sets tab of the same tile:

Cell types can be validated using the marker heatmap, for example by zooming in and/or hovering over a specific cell set to find the names of the marker genes for that cell set. In the example below, zooming into the marker heatmap reveals LYZ as a marker gene for the pink cluster, which has been annotated using ScType automatic annotation as Classical Monocytes, which makes sense!

Alternatively, cell types can be validated using the differential expression tool in the Genes tile by selecting the first toggle option, to compare cell sets within a sample/group. This selection generates a full list of marker genes for the selected cell set, as shown here for the population that was annotated using ScType automatic annotation as Classical Monocytes:

Note that the Batch Differential Expression Table in the Plots and Tables page is useful for computing differentially expressed gene lists in bulk, for example, for all cell sets of a given family (Leiden, Louvain or automatic annotation cell sets).

Expression levels of individual genes can be viewed on the UMAP plot, such as the Classical Monocyte marker, LYZ:

The Custom cell sets family is another useful way to store and potentially finalize your annotated cell types. There are multiple ways to generate a Custom cell set, including:

Using the lasso tool to select a population of cells on the UMAP (A)
Copying or combining cell sets from other families such as the default Leiden or Louvain, or any of the automatic annotation cell sets (B)
Creating ‘complement’ clusters (C)
Based on the expression of one or more genes (D)

Further instructions on how to use these tools are provided in the user guide and in the cell type annotation article.

Once your cell types are annotated and validated within Trailmaker, the next step you are likely to take is to explore gene expression differences in populations of interest. In the example below, we’re comparing the Classical Monocyte population between our metadata groups F (female) and M (male). In the real world, you can imagine the metadata groups of interest might be Control versus Treated or Healthy versus Disease. Note that the resulting list of differentially expressed genes can be re-ordered by ascending or descending log fold change (logFC) or adjusted p-value (adj p-value).

Multiple differentially expressed genes can be selected and plotted on the heatmap. In the example below, the top 12 differentially expressed genes comparing F to M in Classical Monocytes have been plotted. The heatmap metadata tracks have been updated to show Sex and automatic annotations, and the heatmap has been grouped by Sex. This can be a very effective way of visualizing differentially expressed genes:

Finally, lists of differentially expressed genes can be sent for pathway enrichment analysis. Pathway analysis identifies biological pathways that are enriched in the differentially expressed gene list more than would be expected by chance. The goal is to give the list of genes across different phenotypes a biological context, by condensing down a potentially long list of genes into a few select biological pathways.

To perform pathway enrichment analysis in Trailmaker, we recommend first using the Advanced Filtering function to select a subset of differentially expressed genes, such as only the most significant or only the up- or down-regulated. The filtered gene list can then be sent to external services PantherDB or Enrichr for enrichment analysis.

By the end of the Insights Data Exploration page, you should have identified and annotated the cell types, and hopefully found some interesting biological insight into your dataset!

Key links

Key links

Other articles in the Trailmaker guided walkthrough series

Related articles