Understanding the Seurat object that is downloadable from Trailmaker Insights module – Support Suite - Parse Biosciences

How to download a Seurat object from Trailmaker Insights module

Downloading a Seurat object from the Trailmaker^TM Insights module is a simple process that allows you to further analyze your Seurat projects using R. Here's how you can do it:

1. Log in to your Trailmaker account, navigate to the Insights module.

2. From the Projects list, select the project from which you want to download the object.

3. On the Project Details page, click on the ‘Download’ button. If the Project uses Seurat, you'll be able to download the Seurat object by selecting the ‘Processed Seurat object (.rds)’ option within the Download menu.

This initiates the download of the Seurat object in a format that is compatible with R. Once downloaded, you can load this Seurat object into your R environment for advanced analysis and visualization.

Note that Projects that use Scanpy have the 'Processed Anndata object (.h5ad)' option which allows the download of the Anndata (Scanpy) object. Seurat objects cannot be downloaded from Scanpy Projects and vice versa.

What is an .rds file and how do you use it?

An .rds file is a file format used in R to save a single R object, allowing for easy storage and retrieval. The .rds extension stands for "R Data Serialization." It is used to serialize R objects - such as data frames, lists, or custom objects - into a binary format that can be saved to disk and reloaded later.

In the context of Trailmaker, the .rds file you download contains a processed Seurat object, which is a comprehensive representation of your single-cell RNA sequencing data. This file encapsulates all the processed data and analyses you've performed on the dataset within Trailmaker. This includes normalized data, variable features, dimensionality reductions, clustering results, cell type annotations and more.

By loading this file into RStudio, you can pick up right where you left off, allowing for further customization, additional analyses, or integration with other datasets. To work with your Seurat object in RStudio, you can load the .rds file using the readRDS() function. For example:

scdata <- readRDS("your_seurat_object.rds")

Note: Trailmaker currently uses Seurat v4 for data processing and analysis. Since the structure of Seurat v5 objects differs slightly, we recommend using Seurat v4 for compatibility when working with the Seurat object downloaded from Trailmaker.

What does the Seurat object downloaded from Trailmaker contain?

The Seurat object you download from Trailmaker is a processed dataset that contains only the cells retained after all filtering steps have been applied during Data Processing. Additionally, cells with fewer than 10 transcripts are removed from the Seurat object. This ensures that the data you're working with is clean, high-quality, and ready for downstream analysis. The cells that were filtered out during Data Processing in Trailmaker are not contained within the object.

Structure of the Seurat object

A Seurat object is a specialized data structure in R designed for single-cell RNA sequencing data. It efficiently organizes data and results, allowing you to perform complex analyses and visualizations. The Seurat object comprises several key components, which are called slots. The most commonly used ones are:

assays: Contains the raw (“counts” slot) and processed (“data” and “scale.data” slots) expression data.
meta.data: Stores cell-level metadata, such as cell identifiers and experimental conditions.
reductions: Holds dimensionality reduction results like PCA, UMAP, or t-SNE.
commands: Records the processing steps and parameters used during analysis.
misc: Can contain custom additional data. The Seurat object downloaded from Trailmaker includes gene annotations; we will talk more about this in the section “How to retrieve gene symbols from the Seurat object” of this article.

The Seurat object from Trailmaker contains specific metadata columns within the meta.data slot, providing detailed information about each cell:

barcode: The unique identifier for each cell detected in the dataset.
orig.ident: Indicates the original identity or source of each cell. This is particularly useful when data from multiple samples, conditions, or experimental groups are combined, as it helps trace each cell back to its origin.
nCount_RNA: The total number of RNA molecules (counts) detected in each cell. This reflects the cellular RNA content and can be used for quality control.
nFeature_RNA: The number of genes with non-zero counts observed in each cell. This metric indicates the complexity of the transcriptome captured in each cell.
percent.mt: The proportion of transcripts mapping to mitochondrial genes. High percentages may indicate cell stress or damage, serving as a quality control metric.
cells_id: An internal identifier assigned to each cell within Trailmaker for tracking purposes.
samples: Indicates the sample to which each cell belongs.
seurat_clusters: The cluster assignments for each cell, calculated using the Leiden or Louvain algorithms during the Configure Embedding step of Data Processing in Trailmaker.
custom_cellset-*: If any custom cell sets were created in Trailmaker, they will be present in the meta.data slot of the Seurat object as columns named “custom_cellset-” followed by the name you defined in Trailmaker (e.g., custom_cellset-MyCellSet)
ScType-*: If cell type annotation was performed in Trailmaker, the annotations will be present in the meta.data slot of the Seurat object as columns named “ScType-Tissue-Species” (e.g., ScType-Liver-human).

Columns derived from EmptyDrops (Classifier filter - step 1 of Data Processing in Trailmaker)
- emptyDrops_Total: Total read count for each barcode.
- emptyDrops_LogProb: Negative log-probability of observing the barcode's count vector under the null model.
- emptyDrops_PValue: P-values calculated by permutation testing using the Monte Carlo method.
- emptyDrops_Limited: True/false values that indicate whether a lower p-value could be obtained by increasing the number of permutations.
- emptyDrops_FDR: Droplets with significant deviations from the ambient profile are detected at a specified FDR threshold.

The EmptyDrops algorithm is designed to distinguish between empty droplets and true cells by testing each barcode's expression profile against the ambient RNA profile. It identifies droplets that show significant deviations from the ambient RNA, suggesting the presence of actual cells. For more details, refer to the EmptyDrops vignette or check out the Classifier Filter lesson in our free online data analysis course.

Important note: The EmptyDrops columns will still appear in the Seurat object even if the Classifier Filter (which uses EmptyDrops) is disabled during Data Processing. This is because the EmptyDrops algorithm is always executed in the background to provide you with comprehensive data, but its results are not used to filter out cells when the filter is disabled. This allows you to access all relevant metrics for your analysis, regardless of the filtering options selected.

Columns derived from the Doublet Filter - step 5 of Data Processing
- doublet_scores: Quantifies the likelihood that a cell is a doublet - a cell that may contain RNA from two or more cells.
- doublet_class: Categorizes each cell as "singlet" or "doublet" based on the doublet score.

These fields represent the score and classification assigned to each cell by the scDblFinder algorithm. If the Doublet filter in Data Processing was enabled, you should only see cells classified as "singlet" since doublets are removed during this step. For more information, refer to the scDblFinder documentation or to the Doublet Filter lesson in our online course.

Important note: You might notice that some cells have NA values in the doublet_scores and doublet_class columns. This occurs because the scDblFinder algorithm sometimes excludes certain cells from the doublet scoring process. This exclusion can happen if cells have low total counts (determined by a dynamic threshold) or don't meet specific criteria, even if their counts are high. These cells are considered less likely to be doublets, and thus, they receive NA values. In Trailmaker, these cells are retained to ensure that potentially valid data is not discarded without justification.

How to retrieve gene symbols from the Seurat object

When working with the Seurat object downloaded from Trailmaker, you might notice that the raw and processed count matrices use Ensembl IDs instead of gene symbols. While Ensembl IDs are precise and unique identifiers for genes, many downstream analysis and visualization functions - such as FindAllMarkers(), FeaturePlot(), or DotPlot() - often require gene symbols to function properly or for easier interpretation.

Within your Seurat object, there's a mapping between Ensembl IDs and gene symbols stored in the “misc” slot, specifically in scdata@misc$gene_annotations. This mapping can be used to replace the Ensembl IDs with gene symbols in the row names of your count matrices. This can be useful in cases where some Seurat functions expect gene symbols and may not work correctly with Ensembl IDs. Also, gene symbols are generally more recognizable and interpretable than Ensembl IDs.

Below is an R script that you can use to replace Ensembl IDs with gene symbols in your Seurat object's count matrices.

# Load the Seurat object
scdata <- readRDS("/your/path/to/your_seurat_object.rds")

# Access gene annotations from the misc slot
gene_annotations <- scdata@misc$gene_annotations

# Create a mapping between Ensembl IDs and gene symbols
gene_name_mapping <- setNames(gene_annotations$name, rownames(gene_annotations))

# Extract current row names (Ensembl IDs) from the counts matrix
current_features <- rownames(scdata@assays$RNA@counts)

# Map Ensembl IDs to gene symbols
new_features <- gene_name_mapping[current_features]

# Replace NA values with original Ensembl IDs (if a gene symbol is missing)
new_features[is.na(new_features)] <- current_features[is.na(new_features)]

# Update the row names for the counts matrix
rownames(scdata@assays$RNA@counts) <- new_features

# Update the row names for the data matrix (log-normalized data)
rownames(scdata@assays$RNA@data) <- new_features

# If scale.data is used (after scaling or SCTransform), update those row names as well
if (!is.null(scdata@assays$RNA@scale.data)) {
  current_scale_features <- rownames(scdata@assays$RNA@scale.data)
  new_scale_features <- gene_name_mapping[current_scale_features]
  new_scale_features[is.na(new_scale_features)] <- current_scale_features[is.na(new_scale_features)]
  rownames(scdata@assays$RNA@scale.data) <- new_scale_features
}

# Verify that the row names have been updated
head(rownames(scdata@assays$RNA@counts))

# Save the updated Seurat object
saveRDS(scdata, file="/your/path/to/updated_seurat_object.rds")

Remember to replace "/your/path/to/your_seurat_object.rds" with the path to your downloaded .rds file.

Note that some Ensembl IDs may not have corresponding gene symbols. The script retains the original Ensembl IDs in such cases to prevent data loss.

Summary

In this article, we've guided you through the process of downloading and using a Seurat object from the Trailmaker Insights module. We explained how to obtain the Seurat .rds file and load it into RStudio, and provided a detailed explanation of the Seurat object structure.

Additionally, we addressed how to retrieve gene symbols by mapping Ensembl IDs using the provided gene annotations within the Seurat object.

By following these steps, you can leverage the rich data contained in the Seurat object and perform advanced downstream analysis outside of Trailmaker.