Introduction to Trailmaker™
Getting started with your single cell RNA-sequencing (scRNA-seq) data analysis doesn’t have to be daunting! Trailmaker guides you through the end-to-end analysis of your Evercode™ data, taking you from FASTQ files to figures in just a few simple clicks. You can gain unprecedented insight into the cellular heterogeneity of complex biological systems and drive forward your research programme.
Visit our website or watch an in-depth demo of Trailmaker in action.
Trailmaker is available for free to all Parse customers and to all academic users.
Table of Contents
- Getting started
- Sample Loading Table module
- Pipeline module
- Automatic integration of Pipeline & Insights modules
- Insights module
-
Insights module - Data Processing tab
- How it works
- Automated data processing
- Data processing status indicator
- Navigating through the data processing steps
- Data processing plots and statistics
- Data Processing steps
- Data Processing for immune repertoire analysis
- Adjusting a data processing setting
- Data Processing failures
- Saving a processed project
- Exporting the data processing plots
- Downloading the data processing settings
- Summary of Insights Data Processing tab
-
Data Exploration
- Navigation
- Cell sets and Metadata tile
- Performing automatic annotation
- Creating custom cell sets
- Subset selected cell sets to a new project
- UMAP or t-SNE embedding tile
- Heatmap
- Gene list
- Differential expression analysis
- Pathway enrichment analysis
- PantherDB
- Enrichr
- Data Exploration for immune repertoire analysis
- Plots and Tables
- Citing Trailmaker
Getting started
Creating an account and logging in
Access Trailmaker at: https://app.trailmaker.parsebiosciences.com/
If you already have an account, simply input your email address and password to log in.
If you do not have an account already, click “Sign Up” to create one. Creating an account takes only a few minutes. Provide your email address and name and set a password. You’ll need to verify your email address by clicking the link in the email that is automatically sent to you during the signup process.
Navigation
On your first visit to Trailmaker, you’ll be directed to a landing page that outlines the major use cases of the platform and directs users to the relevant module:
- The Sample Loading Table module is where you can complete the sample loading table to help accurately load your round 1 barcoding plate(s). This module is currently in beta mode and supports Evercode WT mini, WT, and WT mega kits with standard fixation and/or Integra automation capabilities. Sample loading tables for WT Mega 384 and WT Penta and Penta 384 kits, and all low input fixation experiments, are available to customers on our support suite.
- The Dataset Repository is where you can start exploring the platform features and functionality with public data.
- The Pipeline module is where you can upload your Parse Biosciences Evercode FASTQ files for alignment to your selected genome. The Pipeline module outputs include reports, downloadable count matrices and automated integration with downstream Insights module.
- The Insights module is where you can upload files that have already been pre-processed in order to conduct downstream analysis and visualization. Supported file types include: output files from the Pipeline module, count matrices of multiple technology types including Parse Biosciences, 10x Chromium™ and BD Rhapsody™, H5 files, and Seurat objects. Alternatively, you can choose to explore one of the demo datasets from the datasets repository.
The first time you access the Pipeline or Insights modules within Trailmaker, you’ll be prompted to agree to the Trailmaker privacy policy and terms of use. Acceptance of these terms is mandatory to use the Pipeline and Insights modules in Trailmaker. Note that the Sample Loading Table module can be used without agreeing to the terms. You can view the policies and terms of use at any time by accessing your Account Settings.
When you’re logged into Trailmaker, you can navigate between the modules using the navigation bar on the left side. Note that the tabs available within the Pipeline and Insights modules collapse when those modules are not selected. The images below show the navigation panel view when the Sample Loading table (left), Pipeline (middle) and Insights (right) modules are selected:
Navigation to some modules may be restricted by the current state of your analysis. For example, navigation to the Pipeline Output tab of the Pipeline module is dependent on a Pipeline Run being triggered. Similarly, navigation to the tabs within the Insights module (Data Processing, Data Exploration and Plots & Tables) is dependent on successful data upload to this module.
Account settings
The account settings menu can be found in the bottom left corner. In the account settings menu, you can change your name and password, and access the Trailmaker terms of use agreements.
Note that it is not possible for the Parse Biosciences team to change the email address associated with your Trailmaker account. If your email address changes, we recommend that you sign up for a Trailmaker account with your new email address and then transfer the ownership of all Runs and Projects in your existing Trailmaker account to the account registered with your new email address.
TrailGuide chatbot
On the bottom right corner of Trailmaker, clicking the compass icon allows you to access the TrailGuide chatbot to support your data analysis.
TrailGuide is an AI-powered chatbot that is trained on the public-facing Trailmaker support resources, including the user guide. The chatbot can assist with how to use Trailmaker, what features are available, and technical questions about how the platform works. However, the chatbot does not have access to your data and it cannot help with finding biological insights. None of your scientific or personal data is used to train TrailGuide.
Raw sequencing data compatibility with Trailmaker
Raw sequencing output files are usually in the form of FASTQ or raw binary base call (BCL) files. BCL files require conversion to FASTQ format for most downstream analysis protocols. FASTQ files are large data files containing raw sequence data and quality scores.
Which raw sequencing files are compatible with Trailmaker?
- BCL files are not directly compatible with Trailmaker.
- FASTQ files generated using Parse Biosciences’ Evercode kits can be directly uploaded to the Pipeline module of Trailmaker. Specifically, Trailmaker supports FASTQ files generated using the following Parse Biosciences Evercode kits: Whole Transcriptome (WT) Mini, WT, WT Mega, WT Mega 384, WT FFPE Mini, WT FFPE, WT FFPE Mega, TCR Mini, TCR, TCR Mega, BCR Mini, BCR and BCR Mega.
- FASTQ files generated using other single cell technologies need processing to count matrices in order to be compatible with Trailmaker. Contact your single cell sequencing technology provider for more information on the relevant FASTQ file processing pipelines.
- Trailmaker only supports FASTQ files from short read libraries. Long reads are not currently supported.
These data files should be uploaded to Trailmaker’s Pipeline module.
Uploaded data are stored on Amazon Web Services (AWS) web servers located in Ireland, within the European Union (EU).
Processed data compatibility with Trailmaker
Trailmaker Insights module supports processed data files in a variety of formats and from several single cell sequencing technologies. This includes:
- Count matrices generated using Parse Biosciences’ Evercode Whole Transcriptome technology that have been processed using the Parse Biosciences Pipeline. For WT data, you should have 3 data files per sample: all_genes.csv, cell_metadata.csv, and count_matrix.mtx or DGE.mtx. For immune profiling (TCR or BCR) data, you should have 3 files: clonotype_frequency.tsv, barcode_report.tsv and either bcr_annotation_airr.tsv or tcr_annotation_airr.tsv. Immune profiling data can be uploaded with or without WT parent files.
- Count matrices generated using 10x Chromium technology that have been processed using Cell Ranger. You should have 3 data files per sample: barcodes.tsv, features.tsv or genes.tsv, and matrix.mtx.
- Data generated using BD Rhapsody in the expression_data.st file format.
- Seurat v4 or v5 objects in the .rds file format.
- H5 files in the matrix.h5 file format, such as those output from Cell Ranger.
These data files should be uploaded to Trailmaker’s Insights module.
Uploaded data are stored on Amazon Web Services (AWS) web servers located in Ireland, within the European Union (EU).
Unsupported data formats
Trailmaker's Pipeline module does not process BAM or CRAM files. These file formats would need to be converted into FASTQ files in order to be processed in Trailmaker. A reminder that the Trailmaker Pipeline module only supports the processing of FASTQ files generated using Parse Biosciences Evercode technology.
Trailmaker's Insights module does not currently support the upload of AnnData objects in the h5ad file format. However, there are resources available online to convert the AnnData (.h5ad file format) object to Seurat (.rds file format) object - see here - which enables you to proceed with your analysis in Trailmaker™ via the upload of a Seurat object.
Trailmaker's Insights module does not currently support the upload of combined matrix file in the CSV or TSV format (note that this is different to the standard Parse or 10x formats which have 3 files per sample, which are supported by Trailmaker). Guidance on converting a combined matrix file in the CSV or TSV format is available here.
Guidance on extracting the unfiltered count matrices in a Trailmaker-compatible format from a Seurat object are available here.
Note that Trailmaker is specifically designed for single cell RNA-seq data and does not support bulk RNA-seq data.
Data limits
In terms of file size limits for upload to Trailmaker, the only limitations are:
- Individual FASTQ files uploaded to the Pipeline module must be less than 5TB in size.
- Seurat objects uploaded to the Insights module must be less than 15GB in size.
There are no limits to the number of samples, Pipeline module runs or Insights module projects that a user can upload to Trailmaker.
For Insights module projects that contain more than 1 million cells post-processing (i.e. in the Data Exploration and Plots & Tables pages), it is advisable to select Scanpy. See Step 6: Data integration for further details.
Sample Loading Table module
The Sample Loading Table module supports the generation and export of sample loading tables for the following Parse Biosciences Evercode kits:
- Whole Transcriptome (WT) Mini, WT, WT Mega
- TCR Mini, TCR, TCR Mega
- BCR Mini, BCR, BCR Mega
Sample loading tables for WT Mega 384, WT Penta, WT Penta 384 and WT FFPE kits are available to customers on our support suite.
Note: The new Sample Loading Table module is currently available in beta mode. The functionality mimics the functionality provided in the Excel file versions that are available to customers on our support suite. To provide feedback on this module, email support@parsebiosciences.com.
On the first visit to the Sample Loading Table module, click 'Create Sample Loading Table' to get started.
Input a name for the new sample loading table and click 'Create'.
The new sample loading table will appear in your list of tables and will be selected by default. The Sample Loading Table Details tile on the right side of the page shows section 1 'Samples & Cells' of the selected sample loading table.
Section 1 'Samples & Cells': Input the experiment information including kit type, kit chemistry, number of samples and target number of barcoded cells. When the kit type is input, further fields appear automatically. Note that the default target number of barcoded cells is preset as the recommended maximum cell load for each kit size.
When you are ready to continue, click ‘Next Step’.
Section 2 'Sample Information': Input the sample names together with the percent of library allocation and stock concentration (in cells/uL) for each sample. If you are running samples of unequal proportions, input samples into the sample loading table from highest to lowest percentage allocation.
When you are ready to continue, click ‘Next Step’. Two buttons will appear:
- The 'Save to Trailmaker' button allows you to save the sample loading table within Trailmaker. Saved sample loading tables can be used as inputs to the FASTQ file processing pipeline in the Pipeline module of Trailmaker.
- The 'Download Spec List File' button allows you to download the Sample Specification List file. This file specifies which samples are in which wells and can serve as an input for running the FASTQ processing pipeline outside of Trailmaker.
Section 3 ‘Table Data, Volumes, and Plate Configurations’: The Sample Loading Table and Plate Configuration can be viewed and downloaded.
- The Sample Loading Table provides the required volumes of sample stock and dilution buffer to prepare the sample working dilutions for loading into the Round 1 barcoding plate(s) for each selected kit. Use the indicated volumes to prepare the appropriate working dilution for each sample prior to loading.
- The Plate Configuration illustrates which samples should be added to which wells in the round 1 plate. At the top is a 96-well plate layout, with the color-coordinated sample reference table shown underneath. The sample names automatically populate the sample reference table. The number of wells used per sample is automatically populated, with each sample assigned a color. The colored plate configuration has been arranged to correspond to the appropriate samples and number of wells.
These files are easy to download and print in order to take to the bench for reference when loading the round 1 barcoding plate.
Section 4 'Integra Tables & Downloads': If required, the Integra tables can be generated and downloaded in Section 4 by clicking ‘Generate Tables and Files’.
The Integra Worklist File can be downloaded using the button. This file serves as an input for running an Integra workflow.
The Integra Loading Table can be viewed by clicking the relevant arrow to maximize the table.
The Diluent Volumes can be viewed by clicking the relevant arrow to maximize the diluent volumes table.
If you have unsaved changes in your sample loading table, the following warning message will appear if you navigate away:
To save your sample loading table, click the 'Save to Trailmaker' button after section 2:
Note that low volumes of <2ul trigger a warning to appear.
Pipeline module
Overview
The Pipeline module is available for the processing of FASTQ files generated using Parse Biosciences’ Evercode technology.
This module supports FASTQ files generated using the following Parse Biosciences Evercode kits:
- Whole Transcriptome (WT) Mini, WT, WT Mega, WT Mega 384
- TCR Mini, TCR, TCR Mega
- BCR Mini, BCR, BCR Mega
This FASTQ file processing module handles essential tasks such as barcode correction, read alignment, read deduplication, and transcript quantification. These quantified transcripts are then used to generate a cell-by-gene count matrix used for downstream analyses.
Further details about the Parse Biosciences pipeline are available to customers on the support suite.
Pipeline Run Details page
See also: Guided walkthrough: Pipeline module set-up
Create a new run
When you first navigate to the Pipeline module of Trailmaker, your list of Pipeline Runs will be empty. To start your first Pipeline Run with Parse Biosciences data, select the ‘Create New Run’ button:
This action opens the Pipeline module wizard, which guides you through Run creation, experimental information input, and data upload.
In the first step of the wizard, provide the new Run with a name and a description (optional).
Experimental setup
Next, provide the details of the experimental setup using the dropdown menus. Specifically, select the Parse Biosciences technology that you used (WT Mini, WT, WT Mega, WT Mega 384, TCR Mini, TCR, TCR Mega, BCR Mini, BCR, BCR Mega), and the chemistry version of your kit (v1, v2, v3 or v4). Note that the chemistry field is dynamic, with the options tailored to the available options for the selected kit.
If your dataset contains FFPE samples processed using an Evercode WT FFPE kit, change the toggle selection to 'on'. Note that the FFPE option is only available when v4 chemistry is selected.
When you have input the kit type, another field will appear for you to select the number of sublibraries that you would like to process in the current pipeline run. Note that this field is dynamic, with the range determined by the kit choice. In the example below, the WT Mini kit is selected, and therefore the number of sublibraries can be 1 or 2.
If you selected an immune profiling (TCR or BCR) kit, an additional toggle will appear in this step of the wizard. With this toggle, you can select to run the pipeline with paired whole transcriptome data when the toggle is set to "on". If you only have immune profiling data with no parent WT FASTQ files, set the toggle to "off".
When you are ready to continue, click ‘Next’.
Sample loading table
In the next step, specify your sample loading table by importing from the Sample Loading Table module or by uploading a file from your local storage.
The first tab, 'Select from cloud', enables import of a sample loading table from the Sample Loading Table module. Simply select a saved sample loading table from the dropdown menu and click 'Import'.
Alternatively, the second tab 'Upload from local storage' allows you to upload your sample loading table from your computer. Supported file formats include the Sample Specification List (.txt) file that can be downloaded from the Sample Loading Table module, and the Excel (.xlsm) file format that is available to download from our support suite. In both cases, the official Parse Biosciences sample loading table template for the relevant kit is required. Files in .txt or .xlsm formats that are not generated from the approved template will not work.
Simply drag and drop the file into the box, and click ‘Upload’. The file will upload in just a few seconds. Once uploaded, you can view the file name, upload date/time, as well as the number of samples and the sample names.
Note that if your selected sample loading table contains duplicate sample names, the following warning will appear. Duplicate sample names should be reviewed before continuing. If the duplicated sample names are biological replicates, you should edit your sample loading table to assign unique sample names, and re-upload or re-import it.
When you are ready to continue, click ‘Next’.
Reference genome
In the next step, select the reference genome for aligning your whole transcriptome data using the dropdown menu.
If the genome you require is not available in the dropdown menu list, select the ‘Create custom genome’ tab. Provide a name and description for your custom genome, adhering to the character limitations provided in the information tooltips. These will be used to create the genome name that will then appear in the reference genome dropdown menu list, as “name: description”.
Then, select or drop your FASTA and annotation files. You must drop two files together: one FASTA file (*.fa/.fasta/.fna[.gz]) and one annotation file (*.gtf/.gff3[.gz]). Upload one matched FASTA/annotation pair per drop. You can add multiple pairs by dropping again.
Further guidance on appending reporter genes, viral genes, or custom gene sequences to your species of interest is available to Parse Biosciences customers in our article Adding Custom Sequences and Gene Annotation File Formatting.
If you are working with mixed species, reach out to support@parsebiosciences.com for help.
Once your matched FASTA/annotation file pair(s) are dropped, click ‘Upload’ to upload the files. Uploaded files are listed at the bottom of the modal.
Custom genomes are built when the Pipeline Run is initiated. Once built successfully, your custom genome will be available to select in other Pipeline Runs within your Trailmaker account.
Your custom genome is only available to you, except in cases where you share a Pipeline Run with another Trailmaker user.
- If you share a successfully completed Pipeline Run with a custom genome, then the custom genome will be available to both the original Run owner and the newly added user.
- If you transfer ownership of a Pipeline Run with an unbuilt custom genome (before the Pipeline Run has ran successfully), then the custom genome will be transferred to the new owner together with the Pipeline Run.
Immune database
If you selected an immune profiling (TCR or BCR) kit, the next step of the wizard is where you select the immune database. For TCR kits, the options are Human or Mouse, and for BCR kits the options are Human, Mouse or Transgenic mouse.
When you are ready to continue, click ‘Next’.
FASTQ file upload
In the final step of the wizard, FASTQ files are uploaded. Trailmaker offers two options for FASTQ file upload: by drag and drop into the current step of the wizard via your web browser or via console (command line) upload. The instructions for FASTQ file upload via the web browser are shown by default.
Note that the on-screen instructions for FASTQ file upload depend on the kit type selected:
- For Whole Transcriptome kits, you'll see the following instructions.
When uploading FASTQ files, you must provide paired R1 and R2 files. You can provide one or multiple pairs of FASTQ files per sublibrary. In cases where you have multiple pairs of FASTQ files per sublibrary, such as where sublibraries were split over multiple sequencing lanes, concatenation is NOT required. All FASTQ file pairs can be uploaded to this modal.
Drag and drop the FASTQ files to the box, then click Upload.
- For immune profiling (TCR or BCR) kits with paired whole transcriptome data, you'll see the following instructions.
Upload your whole transcriptome (WT) FASTQ file pairs (R1 and R2) to the WT box and your immune profiling (TCR or BCR) FASTQ file pairs (R1 and R2) to the Immune box, then click 'Upload'. You can upload one or more pairs of FASTQ files per sublibrary. In cases where you have multiple pairs of FASTQ files per sublibrary, such as where sublibraries were split over multiple sequencing lanes, concatenation is NOT required. All FASTQ file pairs can be uploaded to this modal.
Drag and drop the WT FASTQ files to the left box, and the immune profiling FASTQ files to the right box, then click Upload.
- For immune profiling (TCR or BCR) kits without paired whole transcriptome data, you'll see the following instructions.
You can upload one or more pairs of FASTQ files per sublibrary. These should be paired (R1 and R2) files corresponding to the immune profiling data.
Drag and drop the immune profiling FASTQ files to the box below, then click Upload.
Alternatively, to upload FASTQ files via the command line, select the ‘Console upload’ option. Start by downloading the ‘parse-upload.py’ script. Then, click the ‘Generate token’ button.
Once your token is generated, click the ‘Copy to clipboard’ button at the bottom of the script box. Note that the script is different depending on whether you have whole transcriptome only data, immune profiling (TCR or BCR) data with paired whole transcriptome data, or immune profiling (TCR or BCR) data only.
Open your command line tool (for example, Terminal for Mac users or Powershell for Windows users) and paste the copied script. There are two changes that you will need to make before running the script:
- Define the path to the parse-upload.py script that you downloaded from Trailmaker.
- Define the path(s) to the FASTQ files that you want to upload. You can specify a single or multiple file paths regardless of your kit type. For immune profiling (TCR or BCR) runs with paired WT data, you need to specify the WT and immune files separately.
When you run the script, you will be prompted to confirm the correct files for upload.
Data upload progress is then shown in the console.
The file upload progress is also reported in Trailmaker, indicating that the file is being uploaded from the console.
When upload is complete, both the console and Trailmaker FASTQ file upload modal report this.
Note that FASTQ file requirements in Trailmaker are as follows:
- FASTQ files from the same Parse Biosciences experiment that have different Illumina indexes should not be concatenated. These files are separate sublibraries.
- FASTQ files from the same Parse Biosciences experiment that share identical Illumina indexes do not need to be concatenated before uploading to Trailmaker - all FASTQ file pairs can be uploaded.
- When uploading FASTQ files, you must provide paired R1 and R2 files.
Note the following details about the FASTQ file upload process in Trailmaker:
- Uploading large FASTQ files can take multiple hours or even days. You must keep your computer running and your browser tab open for the duration of the upload.
- If your internet connection fails, file upload will resume from the last checkpoint. Checkpoints are created every 128 MB.
- The FASTQ file size limit for upload to Trailmaker is 5TB per file.
Note: From 26th February 2026 onwards, FASTQ file pairs are internally ordered based on the sublibrary name extracted from the FASTQ filenames. This ensures consistent sublibrary ordering across repeated runs, independent of the upload.
FASTQ files are deleted from Trailmaker 30 days after upload. After this time, your Pipeline Run Details and any Outputs will continue to be available but the FASTQ files will be marked as 'Expired'.
For further instructions and support on command line upload of FASTQ files to Trailmaker, see: How to upload FASTQ files to Trailmaker using command line.
For immune profiling (TCR or BCR) runs with paired WT data, the WT and immune FASTQ file pairs need to be matched before the pipeline run can be initiated. This is done in the Run Details page after the final step of the wizard. The WT FASTQ file pairs are listed, with a row per sublibrary. In the case of multiple FASTQ file pairs per sublibrary, these may need to be assigned. The matching immune FASTQ file pair(s) should be selected using the dropdown menus in the final column of the FASTQ pair matcher table.
Running the Pipeline
Running the pipeline is blocked when any of the required fields in the Run Details page are incomplete. In this case, the ‘Run the pipeline’ button is disabled.
When all required fields are complete and the required data files have been successfully uploaded, all sections will be marked with a green tick and the ‘Run the pipeline’ button becomes enabled.
Clicking ‘Run the pipeline’ starts your pipeline run. For the first few minutes, the pipeline launches and does some initial checks. You can select to cancel the pipeline run.
Then, when the pipeline is fully running, the progress is shown, together with the option to view the current logs by selecting the sublibrary. Each sublibrary has its own log stream. Note that the pipeline steps, and therefore logs, will differ depending on whether you have whole transcriptome only data, immune profiling (TCR or BCR) data with paired whole transcriptome data, or immune profiling (TCR or BCR) data only.
The duration of your pipeline run depends on the kit type, the number of cells in your experiment as well as the sequencing depth. A typical WT Mini pipeline run time is 6-12 hours; for a WT kit it’s 12-24 hours; and for a WT Mega or WT Mega 384 kit it could take 24+ hours. Immune profiling (TCR or BCR) runs with paired WT data take longer than WT only runs.
Whilst your pipeline is running, you can navigate away from Trailmaker and shut down your computer - the pipeline will continue to run. You can choose to receive an email notification when your run is finished.
Pipeline Version
The Pipeline module in Trailmaker operates the Parse pipeline. The current and previous versions of the pipeline used in Trailmaker are reported below:
- From 15th June 2026 to date: v1.8.1
- From 5th May 2026 to 15th June 2026: v1.7.3
- From 27th April 2026 to 5th May 2026: v1.7.2
- From 27th March 2026 to 27th April 2026: v1.7.1
- From 16th March 2026 to 27th March 2026: v1.7.0
- From 10th December 2025 to 16th March 2026: v1.6.3
- From 6th November 2025 to 10th December 2025: v1.6.2
- From 29th August 2025 to 6th November 2025: v1.6.1
- From 15th July 2025 to 29th August 2025: v1.6.0
- From 3rd April 2025 to 15th July 2025: v1.5.1
- From 3rd March 2025 to 3rd April 2025: v1.5.0
- From 13th December 2024 to 3rd March 2025: v1.4.1
- From 7th November 2024 to 13th December 2024: v1.4.0
- From 26th March 2024 to 7th November 2024: v1.2.1
- From 21st March 2024 to 26th March 2024: v1.2.0
The pipeline version used to process your run in Trailmaker is stated at the bottom of the Pipeline Outputs page. We recommend that you report the pipeline version when publishing your data analysis.
Pipeline Outputs
See also: Guided walkthrough: Pipeline Outputs
Successful pipeline runs will display the reports in the Pipeline Outputs tab for you to explore. The “all samples” report is shown by default. You can choose to view the reports for individual samples using the dropdown menu at the top of the page.
In the barcode rank plot, you’re looking for a clearly defined ‘knee’ with the threshold in the steepest part of the drop. This threshold is dynamically set and is likely to be different for different samples. The multiple shades denote min, mean, max cutoff values. If you hover over the legend box (top right), you see the actual values.
The QC metrics include the estimated number of cells as well as the median number of genes and transcripts per cell. These metrics can be compared across samples, and can be considered in the context of published data or your previous experiments.
Further metrics are available in the csv file that can be downloaded in the "Combined reports" option.
The plate heatmaps underneath the plots display transcripts and cells per well and are useful for catching pipetting and plate loading errors. Ideally, you'd like to see a homogenous distribution across the plates with no streaks or outliers.
Pipeline Outputs from immune profiling (BCR or TCR) runs also contain a BCR or TCR tab which provides statistics and graphical reports of the immune run:
At the bottom of the Pipeline Outputs page, the pipeline version used to process your FASTQ files is stated. Further details of the pipeline versions used in Trailmaker can be found in the Pipeline Version section.
Downloading the Pipeline Outputs
The pipeline outputs are available to download from the Pipeline Outputs page.
The available download options for WT or paired immune + WT pipeline runs are:
- The count matrices can be found in the “Unfiltered matrices” and “Filtered matrices” options. These are useful if you choose to perform downstream analysis outside of Trailmaker. Note that the filtered matrices expire 30 days after creation at which point they are no longer available to download.
-
- Unfiltered matrices: This matrix provides a more inclusive dataset for analysis with minimal initial filtering. Barcodes with fewer than 10 transcripts are filtered out, but no other filtering parameters are applied. Trailmaker Insights module automatically uses these unfiltered matrices for downstream analysis.
- Filtered matrices: Further filtered based on the threshold determined from the barcode-rank plot in the Pipeline outputs tab of the Pipeline module. Note that the same threshold is applied in the Data Processing module of Trailmaker Insights. For this reason it is not recommended to use the filtered matrix for analyses in Trailmaker Insights.
-
- The “Combined reports” option allows you to download the all_summaries.zip file which contains the html reports, QC metrics (as a csv file) and log files that are output from Parse’s pipeline combine mode.
- The “Sublibrary reports” option allows you to download the html reports, QC metrics (as a csv file) and log files for each independent sublibrary within your pipeline Run.
- The “All files” option contains the full pipeline output, including the alignment BAM files. Downloading the “All files” option might take a long time for large datasets. For users who are comfortable with the command line, the “All files” option can be downloaded by copying the download command. Note that the “All files” download option expires 30 days after the creation of the pipeline outputs following a successful pipeline run, after which this option will no longer appear and the files are no longer available to download.
The available download options for immune profiling (TCR or BCR) pipeline runs are:
- "Unfiltered files" from the immune run.
- "Filtered files" from the immune run, which are filtered based on the barcode rank plot threshold that is selected by the pipeline.
- "Combined reports" contains the all_summaries.zip file with html reports, QC metrics and log files for immune pipeline run.
- "Sublibrary reports" contains html reports, QC metrics and log files for each independent sublibrary from the immune run.
- The "All files" option contains the full immune pipeline output.
Detailed explanation of the pipeline output files is available in this article on our support suite. Our article on Content of FASTQ files and BAM files may also be useful. Ensure you are logged into the support suite to access these articles.
Failed Pipeline Runs
Failed pipeline runs give the option to download the logs for troubleshooting purposes:
To troubleshoot failed pipeline runs, consult the article on How to troubleshoot Pipeline failures in Trailmaker in the first instance. If your pipeline failure error message is not covered in this article or you need further support, contact us at support@parsebiosciences.com.
Note that Pipeline Runs in Trailmaker that have experienced 2 consecutive failures cannot be re-triggered (i.e. the "Run the Pipeline" button becomes disabled). To initiate a Run that has failed twice, contact us at support@parsebiosciences.com.
Share Pipeline Run Details and Outputs
Within the Pipeline module Run Details page, the 'Share' button enables data sharing between users.
Clicking the 'Share' button will open a modal where the user can input the email address of the colleague, collaborator or Parse Biosciences team member with whom you want to share your Pipeline Run with. Once the email address is inserted, you can assign the level of permission you are granting to that person, as either owner or explorer.
- Explorers can view the Pipeline Details and Pipeline Outputs, but they cannot make any changes to the files or parameters in the pipeline Run. Explorers cannot initiate a pipeline Run. click 'Done'.
- There can be only one owner per Run. The owner has full control over the Run details, data file upload/deletion, as well as running the Pipeline. If you select another user as owner, you will be transferring ownership of that Run to the selected user. In doing so, you will lose all access to the Run.
Note that you can share with multiple other users at once by clicking ‘Enter’ after each email address.
When all email addresses have been inserted and the level of permissions assigned, click ‘Done’.
Owners can revoke access within the same 'Share' modal.
The collaborator(s) will then receive an email indicating that a Pipeline Run has been shared with them. If they already have a Trailmaker account, the Pipeline Run will automatically appear in their account. If they do not already have a Trailmaker account, they will receive an email with the link to sign up. If they sign up using the same email address the Run was shared with, then the Pipeline Run will automatically appear in their account once created.
Note that any linked downstream analyses (related project in the Insights module) to this Pipeline Run need to be shared separately. To do this, navigate to the Insights module Project Details page.
See also: How to Share Data in Trailmaker.
Automatic integration of Pipeline & Insights modules
The outputs of successful pipeline runs are automatically sent to the Insights module for downstream analysis and visualization. Simply click the “Go to Insights downstream analysis” button to navigate to the Data Processing tab of the Insights module where you can begin to deep dive into your dataset.
After release of support for paired WT and immune profiling data in Trailmaker Insights module (22nd January 2026) and immune only Insights support (9th March 2026), successful immune pipeline Runs will automatically generate an Insights module project containing both the WT and immune data that will be linked from the Pipeline Outputs page. Note that paired WT+immune Runs that completed prior to this release date will have a linked Insights module project containing ONLY the WT data.
From the Insights module, when a project has been generated automatically from a Pipeline run, you can navigate back to view the pipeline run outputs using the ‘Go to Pipeline Outputs’ button. The details of the related Pipeline Run are provided in the project description.
Note that poor quality samples that contain few cells (< 10 cells with ≥ 30 transcripts) are not sent from the Pipeline Outputs page to the Insights module.
Combining multiple Pipeline Runs into a single Insights Project
The outputs from multiple Pipeline Runs containing WT data can be combined into a single Insights Project by downloading the unfiltered count matrices from the Pipeline Outputs pages for each relevant Run, and then uploading all relevant count matrices to a single Project within the Insights module. For immune Runs without WT parent data, the filtered count matrices should be used.
Insights module
Overview
The Insights module is where processed data can be filtered, integrated and visualized for deep dive exploration and plotting. This module offers advanced filtering and data cleanup, integration of multi-sample datasets, customization of data visualization and clustering, cluster annotation, and plot customization for the generation of publication-ready figures.
Processed data can enter the Insights module directly from a successful Pipeline Run (in the Pipeline module). Alternatively, processed data files such as count matrices, Seurat objects, etc. can be directly uploaded to this module.
The module offers both Seurat and Scanpy data analysis workflows. Note that Scanpy is the default for all Parse Biosciences Evercode WT Mega, WT Mega 384 and immune profiling (TCR or BCR) projects, while Seurat is the default for WT Mini and WT projects. Seurat/Scanpy selection is controlled in Step 6: Data integration.
In the case of immune (either TCR or BCR) datasets with or without a WT parent, Immune Insights projects are generated automatically following a successful run in Trailmaker’s Pipeline module. Alternatively, immune data with or without WT parent data, can be uploaded directly to the Insights module. All paired WT+immune projects that are processed in the Insights module automatically integrate the WT and immune data, giving researchers a complete view of clonal structure and diversity, and gene expression signatures. Clonotypes and cell types can be visualized on an integrated UMAP, dominant clones can be identified quickly in the frequency and honeycomb plots, and conserved patterns can be detected using motif analysis.
Insights Project Details page
See also: Guided walkthrough: Insights module set-up
Selecting ‘Insights’ in the left side navigation sidebar takes you to the Insights module Project Details page.
A list of all available projects in your account is displayed on this page, along with details about the selected project. Among these details are project name, description, sample list, and associated data and metadata. You can edit existing projects and create new projects on this page.
New users will see an empty project list.
Clicking 'Select from Data Repository' takes you to the data repository where you can select to explore one of the public datasets.
Clicking ‘Create new project’, allows you to upload your own data.
Note that successful Pipeline Runs from the Pipeline module automatically send data to the Insights module for downstream analysis and visualization - see Automatic integration of Pipeline and Insights modules.
Exploring a public dataset from the datasets repository
Trailmaker provides a repository of publicly available datasets that you can use to get started with the platform: https://app.trailmaker.parsebiosciences.com/repository
To access the public dataset repository, use the link above or navigate to the Insights module Project Details page (accessible by clicking 'Insights' on the navigation bar), then click ‘Select from Dataset Repository’.
The dataset repository contains ~65 publicly available datasets, totalling >11.5 million cells! You can use these datasets to quickly explore Trailmaker features, validate your findings in an independent single cell dataset, or even integrate them with your own data to increase the power of your analysis.
Use the search bar at the top of the repository to search for key words or specific tissues, fields, species or technologies of interest.
Use the 'Action' buttons to explore or download the selected dataset:
1. Clicking ‘Explore’ on any dataset within the repository opens up a drop-down with two options:
-
View: Viewers can explore all aspects of the project including data processing settings and plots, clusters and UMAPs, differential expression and a variety of plot visualizations. However, Viewers cannot change settings or clusters.
You should select this option if you want to quickly explore the platform features and/or project but don't need to make any changes to the project. It's fast to get started but has restricted access. -
Copy: By creating a copy, you will become the project Owner. Owners have full control over data processing settings, clusters including the generation of custom clusters, and all changes are saved. Note that copying large projects can take some time.
You should select this option if you want to make changes to the project, such as to Data Processing parameters or clusters. It's slower to get started (due to files being copied) but has access to all functionality in the platform.
2. Clicking 'Download' initiates the download of a zip file containing the count matrices and any associated cell-level metadata for the selected dataset.
3. Clicking 'Pipeline Summary Report', where present, allows you to explore the Pipeline Summary Report for the selected dataset.
See also: How to explore demo datasets from Trailmaker’s dataset repository.
Uploading your own data
Click ‘Create New Project’ to begin uploading your own dataset to Trailmaker:
Now, you can name your project and add a project description (optional). Note that the project name must be different from other projects in your account.
Projects can be easily renamed in the list of projects by clicking on the Edit button next to the project name. Input the new project name and click save.
To begin uploading data to your new project, select the ‘Add data’ button. Samples that you want to analyze together should be uploaded to a single project. Samples generated using different technologies can be analyzed together in a single project.
The file format requirements for data upload to the Trailmaker Insights module vary depending on the Technology selection.
- For Parse Biosciences Evercode WT data that has already been processed using the Pipeline, you should upload the unfiltered count matrices. These are stored in the pipeline output folder entitled ‘output_combined’. You can drag and drop a single folder. For each sample, Trailmaker will select the unfiltered count matrices stored in ‘DGE_unfiltered’. For each sample, you should have the following 3 files: all_genes.csv; cell_metadata.csv and count_matrix.mtx or DGE.mtx. The files are usually in gzip format i.e. ending in .gz. Note that the all-samples unfiltered count matrices are not currently supported.
- For samples processed using a Parse Biosciences Evercode WT FFPE kit, the file format instructions detailed in the previous bullet point apply. In the case of FFPE samples, ensure the FFPE toggle is set to 'on' when uploading the data files.
-
For Parse Biosciences Evercode TCR or BCR data, the required files depend on whether the immune data is being processed with or without WT parent data.
- For paired WT+immune data, you should upload the unfiltered count matrices for both WT and immune. The WT file requirements are listed in the previous bullet point. The immune files are stored in the pipeline outputs folder entitled ‘immune_output_combined’. You can drag and drop a single folder. For each sample, Trailmaker will select the unfiltered count matrices stored in ‘DGE_unfiltered’. For each sample, you should have the following 3 files: clonotype_frequency.tsv, barcode_report.tsv and either bcr_annotation_airr.tsv or tcr_annotation_airr.tsv. The bcr_contigs.fa or tcr_contigs.fa files are not required and will be ignored. Note that the all-samples unfiltered count matrices are not currently supported.
- For immune data without WT parent data, you should upload the filtered immune data. The immune files are stored in the pipeline outputs folder entitled ‘immune_output_combined’. You can drag and drop a single folder. For each sample, Trailmaker will select the filtered count matrices stored in ‘DGE_filtered’. For each sample, you should have the following 3 files: clonotype_frequency.tsv, barcode_report.tsv and either bcr_annotation_airr.tsv or tcr_annotation_airr.tsv. The bcr_contigs.fa or tcr_contigs.fa files are not required and will be ignored. Note that the all-samples filtered count matrices are not currently supported.
- For 10x Chromium count matrices that are output from Cell Ranger, we recommend uploading the unfiltered count matrices that can be found in the folders entitled ‘raw_bc_matrix’. You should have 3 data files per sample: barcodes.tsv; features.tsv or genes.tsv and matrix.mtx. The files are usually in gzip format i.e. ending in .gz. The files should be uploaded within folders that are named with the sample names.
- Data generated using BD Rhapsody in the expression_data.st file format. The zip files that are output by the primary processing pipeline contain the .st files that should be uploaded and they must be unzipped first. The folder with Multiplet and Undetermined cells should not be uploaded since it would distort the analysis. Note that AbSeq data is filtered out by default. After uploading your data, you can elect to include AbSeq data by checking the box in the Project Details page. Support for AbSeq is currently for visualization purposes only, as experiment-wide normalization will be slightly skewed. In case there is AbSeq data in your experiment, we suggest you create two projects; one including AbSeq data and one without, and compare the results.
-
Seurat objects in the .rds file format. There is a size limit of 15GB. If file size is over 15GB, try removing any assays not indicated in the list of requirements below. Ensure the default dimensionality reduction in your Seurat object is named exactly umap or tsne. If the default reduction name includes umap or tsne (e.g., ref.umap), it will be automatically renamed. If the default reduction is different and does not contain these names, the upload will not be successful. The Seurat object must contain the following slots and metadata:
- scdata$samples: sample assignment. If absent, it will be treated as a single-sample experiment.
- scdata[['RNA']]@counts: raw feature counts
- scdata@reductions: contains the embeddings for pca, as well as either umap or tsne
- Note that cluster metadata in scdata@meta.data is auto-detected
- Note that sample level metadata in scdata@meta.data that groups samples in scdata$samples is auto-detected for downstream analysis
- Note that some cell-level metadata columns may not be available in Trailmaker: columns with very high cardinality and where more than one-third of the entries occur fewer than four times are filtered out
- H5 files in the matrix.h5 file format, such as those output from Cell Ranger. [Note that H5AD files are different to H5 files, and are not currently compatible with Trailmaker. Files in .h5ad format can be converted into a Seurat-compatible format, for example following tutorials such as this: https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html]
- For guidance on how to handle other data formats not listed here, see the Unsupported data formats section.
For Parse Biosciences Evercode Technology, it is necessary to specify the kit type in the Parse Kit Type section:
When kit type is selected, the relevant instructions for file upload are displayed on the modal.
The kit type is used to optimize some Data Processing settings within the Insights module, such as the calculation of the knee threshold in the cell size distribution filter (filter 2) and the calculation of the doublet scores (filter 5). See Step 5: Doublet filter for more details.
An example screenshot is shown below, with Parse Kit Type selected as 'Evercode WT'. You can proceed with file upload. The required files are listed on screen.
When an Evercode WT kit is selected, the FFPE option is available. The toggle should be set to 'on' if the samples have been processed using an Evercode WT FFPE kit.
A second example screenshot is shown below, with Parse Kit Type selected as 'Evercode TCR'. For immune data, you must select whether you have WT parent data using the toggle button in the Immune Data Type section. Then, you can proceed with file upload. The required files are listed on screen.
To upload data, simply drag and drop it into the box(es) in the File Upload section. You can remove unwanted files using the delete icon that appears next to each file. Select ‘Upload’ to begin uploading the files. Note that multiple samples can easily be uploaded at the same time.
The files will be compressed (if not yet so) before uploading. You can see the status of the upload for each file from the upload bar. Files that are getting compressed appear in orange. Successfully uploaded files appear in green. Files that fail to upload will show in red. Examples of these file upload statuses are shown below:
You can click on the “Uploaded” or “Upload error” text of the specific sample to see the details of the file. In the case of a successful upload, you will be able to download or replace the file. In case of a failed upload, the modal will show options to retry the upload or replace the file.
For WT data uploaded as count matrices, multiple technologies (specifically Parse Biosciences Evercode and 10x Chromium) can be uploaded and integrated in a single project.
To upload data generated using multiple technologies, start by selecting one technology type in the dropdown menu and upload the relevant data files. Then, click 'Add data' from the Project details page again.
On the second visit to the data upload modal, select the second technology type from the dropdown menu and upload the second set of data files. The Project Details page will then display all samples in the 'All' tab of the samples table:
By selecting an individual technology tab, e.g. Parse Evercode WT, you can view the data files and upload status for all samples of the selected technology:
Note that technology is automatically added as sample-level metadata for multi-technology projects. This means that data can be explored and plotted per technology in the other pages within Trailmaker Insights module.
Further information and considerations for analyzing multi-technology projects is available here: How to integrate Parse Evercode and droplet-based data for analysis in Trailmaker.
Once sample files have been uploaded to your project, you can re-order samples in the sample list. Drag the sample to the desired position by using the button (3 lines) on the left of the sample name. The sample order on this page determines the order that samples appear in the other modules of Trailmaker. Sample order can also be changed later in Data Exploration. Sample names can also be changed after upload.
Adding metadata
The addition of metadata is important for multi-sample experiments in order to assign samples to groups. For example, samples within a dataset could be assigned as “control” and “treated”; or “healthy” and “disease”. Assigning metadata will then allow the comparison of groups to determine differentially expressed genes (e.g. to calculate differentially expressed in genes in a cluster of interest comparing two groups) and visualization of groups (e.g. a dot plot showing the expression of multiple genes of interest across two or more groups) further downstream in the platform. Samples can be assigned to multiple metadata groups.
Sample-level metadata
Sample-level metadata provides context about the sample as a whole, with details like the biological source (organism, tissue, cell types), experimental conditions (treatment, disease, etc.) or collection datetimes (in time-course studies). Sample-level metadata is usually defined during the experimental design, and known a priori.
Once the samples are uploaded, you can add metadata to the samples by clicking the “Metadata” button, followed by “Sample level” and “Create track”. You will be asked to name the “metadata track”. This results in a column being added to the sample information table. Metadata can be assigned to each sample. There is no limit to the number of metadata tracks that you can add.
For example, you might label the metadata track “Treatment” and assign each sample as “control” or “treated”. Alternatively, you might label the metadata track “Tissue” and assign each sample as “blood” or “skin”. There is no limit to the number of metadata tracks that you can add to a project.
When a metadata track is created in the Project Details page, the paint roller icon can be used to insert or change the metadata value for multiple samples at the same time. Clicking the paint roller icon (left screenshot below) opens the Fill metadata popup (right screenshot below). In the Fill metadata popup, the group name can be inserted into the first box, and the samples selected in the second box. This tool is particularly useful for assigning sample level metadata to projects with many samples.
Alternatively, clicking “Metadata”, then “Sample level” and “Upload file” allows you to upload sample-level metadata in bulk in the form of a tab-separated (.tsv) file. Note that a .tsv file can easily be exported from Excel or Google Sheets. Note that Mac users of Excel may need to export as .txt and then manually change the extension to .tsv.
Automatic sample-level metadata: In the following cases, Trailmaker will automatically create a column of sample-level metadata:
- Where a project contains a mixture of samples generated using multiple technologies, a "Technology" column is created automatically.
- Where a project contains a mixture of FFPE and non-FFPE samples, an "FFPE" column is created automatically.
Once all the metadata has been inserted, click on “Process project” and confirm by clicking “Yes”. This will launch your data analysis.
Cell-level metadata
Cell-level metadata is specific to individual cells, and usually includes information that is variable from cell to cell, such as the cell type and subtypes, measures for different quality metrics (read counts per cell, number of genes, proportion of mitochondrial genes, etc), information about cell cycle stage, clustering assignment, among others. Cell-level metadata is usually generated by processing the data, by running different types of analyses whose output is a different value for each cell, irrespective of the sample that they belong to.
You can upload cell-level metadata to trailmaker by clicking on the “Metadata” button followed by the “Cell-level”. A modal will open where you can drag and drop your cell-level metadata .tsv file. Note that .tsv file format can easily be exported from Excel.
The file must be structured as a tidy, long format table. That is, each row should contain the information for a single cell, with each variable (cell type, cell subtype, proportion of mitochondrial content) as a column, as shown in the following screenshot.
The ‘barcode’ column is mandatory, that is the cell identification, which allows correctly matching values in the table with corresponding cells. In addition, a ‘sample’ column is highly recommended, in case there are duplicate barcodes between samples. If there is no sample column in the uploaded table, and there are duplicate barcodes, no metadata will be added to those cells (because there is no way to ensure that the data belongs to one cell or other).
Note that Trailmaker automatically filters out metadata columns that aren't good candidates for cell-level cell sets, meaning that some columns may not appear in downstream analysis modules in the platform. Cell-level metadata columns with very high cardinality (too many unique values) are excluded, as are columns where more than one-third of the entries occur fewer than four times (i.e., too many rare values). If a column is filtered out, the file upload still succeeds, but that column won't be available in the Data Exploration or Plots and Tables modules for creating cell sets, visualising plots or performing differential expression comparisons.
Note: it is not possible to add cell-level metadata to immune (TCR or BCR) projects without WT parent data.
Launching an Insights project analysis
Clicking on the “Process project” button initiates Data Processing.
The first step of this process is that count matrices are converted into an individual Seurat object for each sample in the project. Sample names and metadata that were input in the Insights module Project Details page are inserted into the Seurat object. The progress of this conversion process is displayed to the user. This step might take some time for large datasets, so you can opt to get notified via email once this step is completed and leave the screen.
If this conversion step fails, you will see an error screen like the one below. You can try to re-run the process, or return to Insights module project details page (click 'Insights') where you can edit the samples or data files in this project or you can choose to launch another analysis.
If this data conversion step completes successfully, Insights module Data Processing will be triggered automatically to run using our automatically determined settings. For more information, see the chapter of this user guide that’s dedicated to the Data Processing tab.
Share a project
Insights module Projects can be shared with colleagues or collaborators using the “Share” button in the Insights Project Details page (accessible by clicking 'Insights').
In the ‘Share’ modal, you can input the email address(es) of the individuals you want to share your Project with. Once the email address is inserted, you can assign the level of permission you are granting to that person, as either owner or explorer.
- Explorers can use Data Exploration and Plots and Tables pages, but will not be able to make any changes to samples or metadata or re-run Data Processing.
- There can be only one owner per Project. The owner has full control over the Project details, data files, samples and metadata, as well as running Data Processing. If you select another user as owner, you will be transferring ownership of that Project to the selected user. In doing so, you will lose all access to the Project.
Note that you can share with multiple other users at once by clicking ‘Enter’ after each email address. When all email addresses have been inserted and the level of permissions assigned, click ‘Done’.
Additionally, you can revoke access to the Project for specific collaborators in the same modal.
If the Explorer wants to create an independent copy of the Project so that they can control the Data Processing settings, they can do this using the ‘Copy’ button in the Insights Project Details page. The Explorer then becomes the Owner of the copied Project. Project owners should keep this functionality in mind when sharing Projects.
Note that any linked upstream analysis (related Run in the Pipeline module) to this Insights Project needs to be shared separately. To do this, navigate to the Pipeline module Run Details page.
See also: How to Share Data in Trailmaker.
Copy a project
The 'Copy' button in the Insights module Project Details page allows you to quickly and easily create a copy of an existing project. In doing so, you can create multiple versions of analysis of a dataset, for example, to compare different data processing settings side-by-side.
Click ‘Copy’ and a copy of your project automatically appears in your project list.
Downloading data from a project
All uploaded data files can be downloaded from the sample list view by clicking on the green ‘Uploaded’ text for each file.
Additionally, the processed Seurat or Anndata (Scanpy) object can be downloaded using the ‘Download’ button at the top of the Project Details panel. For detailed explanation of the contents and format of the objects, see the following articles:
- Understanding the Seurat object that is downloadable from Trailmaker Insights module
- Understanding the AnnData (Scanpy) object that is downloadable from Trailmaker Insights module
The Insights module Data Processing settings (which contain the values for every parameter) can also be downloaded from here as a text (.txt) file.
Insights module - Data Processing
Overview
Data generated from a single cell RNA-sequencing experiment always requires filtering and cleanup. During data processing, background, dead cells, doublets and poor quality cells are excluded from the downstream analysis. These steps ensure that the processed data are high quality and return accurate results during downstream analysis.
After successfully launching an Insights module project, Trailmaker applies default downstream processing settings in the Data Processing tab to prepare it for analysis and visualization. The automatic default settings that are applied to your dataset enable you to immediately access and explore the first pass of the analysis. All data processing settings can be adjusted to your preferences.
The Insights module Data Processing tab consists of 7 sequential steps. The output of each step in this module becomes the input for the next step. Steps 1-5 consist of filters to remove unwanted and poor quality data from each individual sample. In step 6, multiple sample datasets are integrated to remove batch effects, and dimensionality reduction is performed. Finally, in step 7, the embedding is configured (e.g. UMAP or t-SNE) and clustering is applied.
The filtered, integrated data with clustering is then available for downstream exploration and visualization in the Data Exploration and Plots and Tables tabs within the Insights module.
See also: Guided walkthrough: Insights Data Processing
How it works
For typical projects that contain count matrices generated by the Trailmaker Pipeline module or uploaded directly to the Insights module (e.g. in Parse Biosciences or 10x Genomics format), the data files for each sample are converted into separate Seurat objects. These Seurat objects are processed separately and in parallel for filtering steps 1-5 of Data Processing.
Then in step 6 data integration, the multiple Seurat objects are integrated into a single object. Depending on the selected parameters in step 6 data integration, the project will have either a Seurat object or an Anndata (Scanpy) object. All downstream analysis, from steps 6 and 7 in Data Processing, through to the Data Exploration and Plots and Tables modules, then function using the Seurat or Anndata object. Immune projects utilize Scanpy and a MuData object - see Data Processing for immune repertoire analysis for more details.
See the 'Step 6: Data integration' section below for more details on selecting Seurat or Scanpy options in Trailmaker.
Note that Scanpy is the default for all Parse Biosciences Evercode WT Mega and WT Mega 384 projects as well as all immune projects (TCR or BCR), while Seurat is the default for all other projects.
Automated data processing
Insights module projects undergo downstream processing in the Data Processing tab following automatic trigger from a successful Pipeline Run or from manual upload and trigger from the Insights module Project Details page via the ‘Process Project’ button. The first run of Data Processing uses default settings to dynamically estimate appropriate thresholds for filtering, and standard settings for integration and clustering.
Some minimal filtering parameters are applied automatically to all projects. Specifically:
- Cells with less than 10 features are excluded
- Features that have zero counts are excluded
The automated default Data Processing values are established according either to the current best practice in the field or according to the spread of each sample data. Specific details on the default values for each step in the Data Processing tab are explained fully in the Data Processing Steps section below.
Data Processing status indicator
At the top right of the page in the Data Processing tab, there is a status indicator. When data processing is complete, the status indicator will appear green (screenshot A, below), whilst steps that are in progress appear gray (screenshot B). If data processing fails, the indicator will appear incomplete and marked as failed (screenshot C). The step that is currently being viewed is marked in orange (screenshots A-C).
For more information on what to do if data processing fails for your project, see the Data Processing Failures section below.
Navigating through the data processing steps
Trailmaker has the following steps in the Data Processing module:
1. Classifier filter
2. Cell size distribution filter
3. Mitochondrial content filter
4. Number of Genes vs transcripts filter
5. Doublet filter
6. Data integration
7. Configure embedding
You can navigate between these filters using the dropdown menu on the top left of the page or the navigation arrows on the top right of the page.
The dropdown menu and the status bar also show if the step is completed or not. Steps with a check mark (✔) to the left are complete; steps with a cross mark (❌) to the left have failed.
Filtering steps (1-5) can be disabled using the ‘Disable’ button at the top of the page. Filtering steps that are disabled are shown in the dropdown menu with the step name in strikethrough.
In filtering steps 1-5, the samples available within the project are listed vertically with one plot for each sample. You can scroll through the samples easily. Individual sample plots can be minimized by clicking on the sample name above the plot.
Data processing plots and statistics
For the filtering steps (steps 1-5), data is filtered on a per sample basis. A plot is shown for each individual sample within each filter. Samples can be selected to view/hide using the selection box at the top left. Note that hiding samples from the data processing filter view does not exclude them from the analysis - all samples present in the project are included in the analysis.
Below each sample plot, there is a table that describes the filtering statistics for each sample: ‘# before’ describes the number of barcodes present in the sample before the current filtering step; ‘# after’ describes the number of barcodes present in the sample after the current filtering step; ‘% changed’ describes the proportion change in barcode number as a result of the current filtering step.
The number of barcodes shown in the first step refers to the number of barcodes present in each sample after initial filtering in Insights Data Processing where cells with fewer than 10 transcripts are removed. After the first active filter, the number of barcodes refers to the filter currently being applied.
The total number of genes is calculated as the number of genes with non-zero counts across all cells.
An example filtering plot and associated statistics table for a single sample in filtering step 4 is shown below as an example:
Each plot can be fully customized to your design preferences using the ‘Plot styling’ menu.
Data Processing steps
Step 1: Classifier filter
The classifier filter aims to exclude background and retain cells. To achieve this, the filter uses the ‘emptyDrops’ method to calculate the False Discovery Rate (FDR), a statistical value which represents the probability that a droplet is empty (read more about this method here). The default FDR value is 0.01 for all samples. Only data points with FDR < 0.01 are retained. Therefore, in this step, barcodes with low FDR are retained for downstream analysis, whilst barcodes with a high FDR are removed from downstream analysis.
Note that step 1, classifier filter, is not needed for Parse Biosciences data, as the emptyDrops method is only needed for droplet based technologies. You can choose to enable this filter using the Enable button at the top of the page. For 10x Chromium and BD Rhapsody data, this filter is enabled by default. Note that for datasets that have been pre-filtered (e.g. 10x Chromium data that has been filtered in Cell Ranger) the Classifier filter is disabled.
Two plots are provided to visualize the data in the classifier filter. The first plot is a knee plot that determines the FDR threshold for considering cells valid for analysis. The knee plot ranks cells by the number of distinct transcripts for each barcode on a logarithmic scale. Using the log value of the transcripts exposes a “knee” on the graph curve where the number of transcripts decreases. The turning point in the “knee” is usually used as the point to set the FDR threshold. Cells with low transcript counts contain fewer transcripts, and there is a higher probability that the cells are background or empty droplets. Therefore we would like to filter cells that are above the FDR threshold (orange) out. The cells in the green region have an FDR<0.01 and are retained. The gray “mixed” region contains some cells that are retained and some cells that are filtered out.
The second plot is an “empty drops plot”. This is an alternative visualization of the data which plots the number of transcripts against the probability that the cell is a real data point or background. The red line shows the threshold value that is set to filter the cells. Cells below this red line are retained, while cells above the red line are excluded from downstream analysis.
The default FDR value is set to 0.01 in this filtering step. This is the standard threshold used for the emptyDrops method. Although it is possible to override this filter threshold, we do not recommend that you do so.
Step 2: Cell size distribution filter
The cell size distribution filter can be used to fine-tune the classifier filter, by further discarding background from your dataset. For Parse Biosciences data, this is the main filter that is used to exclude background. For other data types (including 10x Chromium and BD Rhapsody), this filter is disabled by default though you can choose to enable it. You will then need to re-run Data Processing in order to apply the changes.
Unlike the classifier filter which works on probability, this filter sets a hard threshold on the minimum number of transcripts. Cells with transcript counts lower than this threshold are filtered out.
The data for this filter is visualized as a knee plot. The plot ranks cells according to the number of transcripts on a logarithmic scale. The inflection point around the “knee” signifies the threshold at which the number of transcripts in a cell changes drastically. Note that the cell rank on the x-axis is on a logarithmic scale, which means the area under the curve does not proportionally represent the number of cells that are filtered / unfiltered.
The second plot view in this filter is a histogram that shows the number of cells that are affected by the cell size distribution filter. This histogram visualizes cells below (orange) and above (green) the set threshold. The orange cells are filtered out of the dataset whereas the green cells are retained. If the histogram plot shows a binomial distribution then consider switching on this filter. For example, in the histogram plot below the cells identified in orange may in fact be background and you should consider filtering them out.
Step 3: Mitochondrial content filter
Cells may be alive, dead or dying. The mitochondria of dead and dying cells rupture, spilling out transcripts of mitochondrial genes into the cell. The presence of these mitochondrial gene sequences can skew the analysis results, as transcripts from live cells are often of interest rather than transcripts of dead cells. Thus, for most datasets it is advisable to filter out dead cells from the analysis.
The mitochondrial content filter removes dead and dying cells by looking at the percentage of mitochondrial transcripts and setting an appropriate threshold. Cells with mitochondrial content higher than the threshold are removed from downstream analysis.
The default threshold for the proportion of mitochondrial genes is calculated per sample. The typical cut-off range is 5-30% of mitochondrial reads per cell, with the default cut-off in Trailmaker determined as 3 median absolute deviations above the median.
Two plot views are available in this filter. The first plot is a histogram which shows percentages of mitochondrial reads and their corresponding proportions of cells. The percentage of mitochondrial reads is the percentage of transcripts mapped to mitochondrial genes from total number of transcripts. Dead cells (blue) are filtered out and live cells (green) are retained.
The second plot is a scatter plot which shows the total number of transcripts in each cell plotted against the percentage of mitochondrial reads. Each dot in this plot is an individual cell. As in the previous plot, the dead cells are filtered out (blue) and live cells (green) are retained.
Step 4: Number of genes vs transcripts filter
The number of genes vs transcripts filter works on the principle that the number of unique transcripts increases with the number of genes. Outliers typically fall into one of two categories:
(1) Cells contain a lot of genes but few transcripts. This means transcripts are not amplified well.
(2) Cells contain few genes but a lot of transcripts. This means that the few transcripts that exist are over-amplified.
This filter visualizes the data using a scatter plot to map the number of gene counts on a logarithmic scale against the number of transcripts on a logarithmic scale. The range of acceptable data points is defined with 2 thresholds, signified by red lines. Cells not located between the two red lines are considered outliers and are filtered out.
For Parse data, a third order spline (a cubic polynomial) model `splines::bs` is applied to the data by default ('spline' option in the menu controls), whereas for other technology types a linear fit model `MASS::rlm` is applied by default.
The ‘prediction interval’ (R function predict) is the stringency for defining outliers: it sets the prediction intervals calculated by the R `predict` where `level = prediction interval`. Prediction intervals represent the likelihood that the predicted value will be between the upper and lower limits of the prediction interval. Prediction intervals are similar to confidence intervals, but on top of the sampling uncertainty, they also express uncertainty around a single value. They must account for the uncertainty in estimating the population mean, plus the random variation of the individual values. Higher prediction interval means higher probability of the value to be inside the range. Consequently, the size of the interval will be wider. The higher the prediction level, the less stringent we are when filtering the cells. Conversely, the lower the prediction level, the more stringent we are, and we exclude more cells that are far from the behaviour of the relationship between the number of genes and the number of transcripts.
The scatter plot is interactive - moving the red prediction interval lines will help you to choose the most appropriate value to filter cells in your samples. To do so override the automatic settings as shown in Adjusting a data processing setting section, and use the prediction interval slider to choose your preferred values.
If one or more samples in your dataset contains a separate population of cells in this filter plot, such as in the example plot above, then we recommend further investigating the population to determine if it should be excluded or retained. One way to do this is to disable the filter (using the ‘Disable’ button at the top of the page) which will retain all cells for downstream analysis, and allow you to further investigate the secondary population in downstream modules in Trailmaker.
Step 5: Doublet filter
Doublets contain the content of multiple cells which can lead to skewed data and false conclusions, especially concerning cellular heterogeneity and identity. There are several reasons for doublet occurrence, which can vary according to the technology type.
The doublet filter calculates the doublet probability for all cells and filters out cells with a high probability of being a doublet. Calculation of the probability is carried out using the scDblFinder algorithm (a detailed explanation of this method can be found here).
The doublet score calculation incorporates the expected doublet rate. Trailmaker uses the following expected doublet rate values for Parse Biosciences samples: 0.046 for WT Mini, TCR Mini and BCR Mini kits; 0.034 for WT, TCR and BCR kits; 0.064 for WT Mega, WT Mega 384, TCR Mega and BCR Mega kits. For 10x Chromium samples in Trailmaker, the doublet filter uses the default scDblFinder expected doublet rate of 1% per 1000 cell captures.
This filter sets a hard threshold above which cells are filtered out. This threshold is marked by the red line in the provided plot for this filter. The plot shows the proportions of cells and their corresponding probabilities of being doublets.
For samples that contain few cells, the calculation of doublet score probabilities has less power and, therefore, tends to show more cells with an intermediate score between 0.2 and 0.8. Care should be taken to check the doublet filter threshold for samples with few cells.
Step 6: Data integration
The Data Integration step removes batch effects and reduces the dimensionality of the data.
Batch effects are variations caused by differences in experimental conditions, introducing noise which skews the true variation for a sample. Runs of different samples have different values of noise. Hence, comparing these samples directly without addressing batch effects would compound the noise. Removing batch effects enables comparison and composition of samples analyzed in different runs with minimized error. In essence, batch effect correction ensures that downstream analysis focuses on real biological differences between samples, rather than irrelevant sample-to-sample or batch-to-batch variation.
At this step, it is possible in some cases to you can select either Seurat or Scanpy methods for data integration. Note the following:
- Scanpy is the default for all Parse Biosciences Evercode WT Mega and WT Mega 384 projects, while Seurat is the default for WT Mini and WT projects
- Scanpy is both the default and the only option for immune (TCR or BCR) projects
If you select Seurat, a Seurat object will be generated and utilized for all downstream analysis in Trailmaker. If you select Scanpy, an Anndata object will be generated and utilized for all downstream analysis in Trailmaker. For immune (TCR or BCR) projects, only the Scanpy option is available, with use of MuData (plus Anndata in the case of paired WT+immune) objects on the backend for downstream analysis. Your selection also dictates which type of object is available to download from the Insights Project Details page.
The integration methods available are as follows:
- In Seurat, three data integration methods are available – Harmony, Fast MNN, and Seurat v4. You can also select ‘No integration’.
- In Scanpy, only the Harmony integration method is currently available. You can also select ‘No integration’.
Harmony is selected as default. However, you can select the integration method and set the controls based on your requirements.
RPCA vs CCA (Seurat v4 only): When using Seurat v4 for data integration, anchors are identified using a dimensionality reduction approach. Two methods are available: RPCA (Reciprocal PCA) and CCA (Canonical Correlation Analysis).
- RPCA is selected by default. This method projects each dataset into the PCA space of the other datasets and is designed to be robust to noise and outliers. RPCA provides a more conservative integration, where cells in different biological states are less likely to be incorrectly aligned.
- CCA identifies shared patterns of variation across datasets by finding correlated gene expression structures. This approach can improve alignment when datasets share similar cell types and biological signals.
Normalization is applied to each sample before integration. There are several methods to achieve normalization; the default method in Trailmaker is LogNormalize. The SCTransform method of normalization claims to recover sharper biological distinction compared to log-normalization. SCTransform can only be applied when the integration method is set to Seurat v4. SCTransform applies additional filtering of genes that is more strict than using the LogNormalize method. For example, when using SCTransform, only the genes that are expressed in at least 5 cells are retained, and further genes may be filtered out based on their variance. These filters are applied internally using the sctransform::vst function, which Seurat calls under the hood. As a result, the number of genes that are shown in the Gene list in the Data Exploration page may be lower than the total number of genes originally detected in your dataset. Further information on SCTransform is available here.
Dimensionality reduction reduces the complexity of the dataset while preserving variation. In essence, dimensionality reduction “compresses” the data to enable visualization in 2 dimensions. There are many methods of dimensionality reduction, but one of the most popular in the field is Principal Component Analysis (PCA). This method introduces principal components (PCs) - a linear combination of variables in the data that better explain variations. The largest variance is accounted for by the first PC, the second largest variance by the second PC, and so on.
PCA is great for high dimensional data, but it is not optimized to generate 2-dimensional embedding. In practice, PCA is used to reduce the raw data into a lower dimension, acting as a pre-processing step. The resulting data is fed into other dimensional reduction algorithms, such as UMAP or t-SNE, to reduce the data into 2 dimensions.
Normalization can be biased by certain gene categories, such as ribosomal, mitochondrial and cell cycle genes. In the data integration step, these three gene categories can be excluded from the analysis if you are working with human or mouse data. For example, cell cycle genes should be removed if sampling timepoints occurred throughout the day. Those genes can otherwise introduce within-cell-type heterogeneity that can obscure the differences in expression between cell types. To mitigate this, cell cycle genes can be excluded from the analysis of human and mouse species under ‘Dimensionality reduction settings’.
-
- Ribosomal genes are excluded based on the selection of genes that contain “rps”, “rpl”, “mrps” or “mrpl”, as well as the following three specific genes: FAU, UBA52 and DAP3. If this gene nomenclature is not true for your species, this feature will not work and should not be used.
- Mitochondrial genes are excluded based on the selection of genes that start with "mt-" or "MT-" (case insensitive). If this gene nomenclature is not true for your species, this feature will not work and should not be used.
- Cell cycle genes can be excluded from human and mouse datasets only. Trailmaker uses the list of cell cycle genes reported in the following article: Tirosh et al. “Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq.” Science (New York, N.Y.) vol. 352,6282 (2016): 189-96. doi:10.1126/science.aad0501. If you are using a species other than human or mouse, this feature will not work and should not be used.
There are 3 plot views available in the data integration step. To change the plot type, select the desired plot under the plot view menu.
The first plot is a preview of the embedding generated after the dimensional reduction. This plot is only available for multi-sample datasets, and is not available in projects with only one sample.
The second plot is a frequency plot which shows the contribution of each sample to each cluster. This plot is also only available in projects with multiple samples.
These first two plot views allow you to assess the quality of the integration of multi-sample datasets. Well integrated datasets will display good distribution of each sample across all clusters.
The third plot is an elbow plot which maps the percentage contribution of each Principle Component (PC) to the total variation in the dataset. The default setting for the number of PCs is defined by Trailmaker as the number of PCs that explains 85% of the variation (if less than 30 PCs), or 30 PCs.
Note that the reported percentage of variation explained may appear lower than what you may have seen with Seurat-based workflows. This is expected, and applies to both Scanpy and Seurat projects in Trailmaker, as we use the Scanpy approach consistently across all pipelines for variance calculation.
Scanpy (which uses scikit-learn under the hood) calculates the proportion of variance explained relative to the total variance in the input data (not just the selected PCs), which is typically the variance of the top selected highly variable genes. Seurat, by contrast, typically calculates the proportion of variance explained relative to the total variance captured by the selected PCs themselves, making Seurat-style values typically higher, because their denominator is the sum of eigenvalues of the computed PCs rather than the total variance of the full input matrix. In Trailmaker, we use the Scanpy approach across both Scanpy and Seurat projects. Therefore, the per-PC percentages reported here will be lower than you may expect from prior Seurat experience, even when the underlying PCA is capturing the same structure in the data.
To determine the initial number of PCs to use for downstream analysis, we apply the elbow method, which identifies the point at which adding more PCs yields diminishing returns in explained variance. For Scanpy projects, this is done using KneeLocator from the kneed library; for Seurat projects, we use findElbowPoint from the PCAtools package. For further reference on variance explained, see the scikit-learn PCA documentation, particularly the explanation of the n_components parameter and the explained_variance_ratio_ output.
Downsampling
Large datasets (e.g. >100,000 cells) can be downsampled specifically for the integration step. This speeds up the time it takes to integrate large datasets using some methods (especially Seurat_v4 and FastMNN) and enables very large datasets to be processed successfully. Once the data are integrated, the full data are available for downstream analysis and visualization.
Geometric sketching finds random subsamples of a dataset that preserve the underlying geometry, which is described in this paper: Geometric sketching compactly summarizes the single-cell transcriptomic landscape. In short, geometric sketching divides the transcriptional space into variable-sized hypercubes and then randomly samples the same amount of cells from each of the cubes; the resulting sketches preserve the data structure and put more emphasis on small and underrepresented cell types, leading to improvements even over using the whole dataset.
You can downsample your data under Downsampling Options.
Then change the Method to Geometric sketching. If you wish, you can also change the percentage of cells to keep.
Step 7: Configure embedding
In the last step of the Data Processing module, integrated data is further reduced into a 2-dimensional embedding. An embedding is a space which allows for the translation of data of a high dimension into a low dimensional space. High dimensional data represents a data set where the number of features is higher than the number of samples. The low dimensional space should represent the meaningful properties of the high dimensional data.
Trailmaker provides two methods to visualize embedding: UMAP and tSNE. UMAP is a more recent technique with an algorithm that is more readily adjustable to parallelization and works faster than tSNE. Hence, UMAP scales better for large datasets compared to tSNE. Generally, it is recommended to use UMAP embedding to visualize your data.
After creating the embedding, the embedded data points are clustered and colored according to those cluster annotations. Clustering is the process of grouping cells of high similarity. There are several clustering methods available, but the most used are Louvain and Leiden methods. Trailmaker uses the Leiden clustering method by default. Clusters are color-coded and numbered numerically so they can be identified and explored in downstream analysis in Trailmaker.
The clustering result can be modified by adjusting the clustering resolution in the clustering settings menu in step 7 of the Data Processing tab within the Insights module. The desired clustering resolution will depend on your sample preparation and research question. The embedding and clustering results that are produced in this step propagate the Data Exploration and Plots & Tables modules of Trailmaker.
There are two plot views available in step 7 where the embedding and clustering settings are configured, as well as multiple metrics that can be viewed on the plots. To change the type of the embedding plot, select the desired plot under the plot view menu.
Selecting ‘Embedding’ shows a UMAP embedding by default, showing cells from all samples clustered and colored according to the Leiden clustering algorithm. The plot view can be changed to any of the options in the ‘Colour plot by’ menu, such as Number of genes.
This plot can be useful for identifying clusters or areas in the embedding plot that have particularly high or low metrics. For example, if high mitochondrial content is concentrated in one cluster, this would suggest that there is a population of dead cells that is clustered together. In this case, we recommend returning to the mitochondrial content filter (step 3) and reduce the threshold of the percentage of mitochondrial reads to try to remove the cluster.
Selecting ‘Violin’ switches the plot to a violin plot. This plot is particularly useful for visualizing the quality control metrics, such as number of genes, across the samples in your dataset:
Data Processing for immune repertoire analysis
Paired WT+immune projects
The Data Processing page is enabled for paired WT and immune (WT+TCR or WT+BCR) projects.
Within Data Processing steps 1-5, the WT data are filtered according to the default or user-selected filtering thresholds. After WT data filtering is complete (at the end of step 5), the immune data are filtered to match the barcodes present in the WT data. This immune filtering step is not visible on the user interface. Any failures of Data Processing at the immune filtering step will appear on the user interface as a failure in step 6 (Data integration) or step 7 (Configure embedding). In this case, reach out to support@parsebiosciences.com for help.
The functionality of Step 6: Data integration and Step 7: Configure embedding is unaffected by the inclusion of immune data. Note that Scanpy is the default and only option for immune repertoire analysis in Trailmaker.
At the end of Data Processing, paired WT+immune projects contain a MuData object with matched WT+immune data, for downstream analysis and visualization in Data Exploration and Plots & Tables pages.
Immune only projects
Insights Projects that contain immune only (TCR or BCR) data without a WT parent contain the filtered count matrices that are output from Trailmaker’s Pipeline module. No further filtering is performed on the immune data within the Insights module, and the Data Processing page is disabled.
Adjusting a data processing setting
The default settings for each data processing step can be overridden using the ‘manual’ button in the Filtering Settings menu that is available within each step. For filtering steps 1-5, the setting can be altered for one specific sample or the adjusted setting can be applied to all samples using the ‘Copy to all samples’ button.
Specific filtering steps (steps 1-5) can be disabled using the ‘Disable’ button at the top of the filter. Using this button will switch off the selected filter so that no cells are filtered out at this step, and all cells are taken forward to the next step of data processing.
When a data processing setting is adjusted or a filter is disabled or enabled, you will be prompted to re-run Data Processing. You can elect to ‘Run’ data processing or ‘Discard’ the changes. When the re-run is initiated, only the steps that have adjusted settings will be re-run. Note that re-running Data Processing is likely to take several minutes, with the exact time dependent on the size of your dataset.
Data Processing failures
It’s important to note that the data processing steps may fail for a variety of reasons. One major reason for data processing failures is if there is not enough data to be processed. Data Processing steps 1-5 can fail if the number of cells is very low, and the data integration step (step 6) can fail if the number of cells is lower than 100.
Guidance on how to troubleshooting and resolve data processing failures is available in the following article: How to adjust data processing settings to fit your dataset and troubleshoot data processing failures.
Briefly, there are several ways that you can address data processing failures:
- Reduce the number of cells being filtered out: You can alter the filtering settings in steps 1-5 (e.g. by lowering thresholds) or disable specific filters to reduce the number of cells that are filtered out which will, therefore, increase the number of cells proceeding to downstream processing and analysis. However, this can lead to poor quality cells being included in the downstream analysis.
- Elect to use no integration: You can select ‘No integration’ in the data integration settings in step 6. However, selecting this option may result in suboptimal analysis due to batch effects that have not been removed from the analysis. If you choose to select ‘No integration’ we recommend that integration quality is checked using the embedding and frequency plots in step 6.
- Exclude the sample(s) with too few cells: It is possible to exclude the problem sample(s) that has too few cells. This can be done in the existing project by setting a high threshold in one of the filtering steps in order to filter out 100% of the cells in that sample. Alternatively, you could delete the problem sample(s) in the Insights module Project Details page in the existing project or create a new project and upload only the other samples, excluding the sample(s) with few cells.
If data processing fails for your project and cannot be resolved by following the guidance in this support article, contact us at support@parsebiosciences.com.
Saving a processed project
Whilst the project is being processed, you can leave the screen, log out of Trailmaker and close your web browser without affecting the processing - Data Processing will continue to run. You can elect to receive an email confirmation when your Data Processing completes using the ‘Receive email notifications’ toggle button when you first process a project.
The processed project is saved automatically by Trailmaker. When you log out and then return to the platform, you can immediately view the processed Insights project.
Exporting the data processing plots
All plots that are available in the data processing module can be fully customized using the ‘Plot styling’ menu. You may want to consider including these quality control plots in your manuscript as evidence of sample quality.
To download a plot from the data processing module, select the ‘...’ menu on the top right of the plot. Download options include SVG (high resolution) and PNG (lower resolution).
Downloading the data processing settings
All data processing settings can be downloaded as a text file (.txt) using the ‘Download’ button in the Project Details page of the Insights module.
We recommend that you report these data processing settings (filtering thresholds, integration method, etc.) in your manuscripts. We can assist with writing this paragraph, if needed - contact us at support@parsebiosciences.com.
Summary of Insights Data Processing tab
The Data Processing tab within the Insights module of Trailmaker filters out background, dead cells, doublets and low-quality cells, removes batch effects and reduces the dimensionality of the data. Data processing is an essential prerequisite for downstream data analysis and visualization.
Data Exploration
Overview
The Data Exploration tab within the Insights module of Trailmaker has a wide variety of features for in-depth exploration of your data. Using this module, users can identify which cell types are represented by their cell sets, fully customize cell set selection, and generate insight into the dataset using gene expression visualization and differential expression.
Custom cell sets can be created using selection tools, based on the expression of one or more genes, or by manipulating the default Leiden or Louvain clusters. It's easy to rename clusters or recolor by sample, metadata, or gene. Standard analysis actions such as marker heatmap and UMAP are pre-loaded. Cell set annotation can be done automatically, or manually using the marker heatmap and differential expression features.
Users can calculate differential expression between cell sets within a sample/group or compare a cell set between samples and groups. Differential expression results can be filtered further, for example, by selecting only upregulated genes. Users can perform pathway analysis on the list of differentially expressed genes using external services - Pantherdb or Enrichr.
See also: Guided walkthrough: Insights Data Exploration
Navigation
The Data Exploration tab consists of several tiles. On the top left, we have the UMAP embedding that was created and customized in step 7 of the Data Processing (1). In the middle, we have the list of default clusters (Leiden or Louvain), any custom cell sets that have been created, as well as the list of samples and metadata (2). On the right, the gene list shows the full list of genes present in the dataset ordered by dispersion (3). Dispersion is a measure of variability, so some of the most variable genes in the dataset are listed at the top of the gene list. At the bottom, the heatmap shows marker genes for your selected clustering method (Leiden or Louvain) (4).
The width and height of different tiles can be changed to suit your preference. The tiles can also be moved around using the moving arrows , and closed using the X button on the top right corner of each tile. To get to the default layout of the Data Exploration module back again, refresh the page or click on another module and then back to the Data Exploration.
Cell sets and Metadata tile
In the Cell sets and Metadata tile, there are two tabs: 'Cell sets' where you find the lists of all cluster families, samples and metadata groups; and the 'Annotate clusters' tab where you can perform automatic annotation.
Viewing cell sets
The ‘Cell sets’ tab shows the list of the default clusters (Leiden or Louvain) and any custom cell sets that have been created, as well as the list of samples and metadata. To expand any of the lists, click on the arrow on the left of the list name.
Changing names and colors of cell sets
In Leiden/Louvain clusters, Custom cell sets lists, ScType annotated cell sets, samples and metadata in the Cell sets and Metadata tile, you can change the names and colors of cell sets.
To change the name click the edit button next to the cell set name. After inputting the new name, click the checkmark to save the name or the cross to cancel. Any changes to cell set names made in this tile propagate all other modules of the platform.
To change the color of a cluster, click on the colored circle next to the cell set. In the popup, choose the new color and it will be applied automatically. Any changes to cell set colors made in this tile propagate all other modules of the platform.
Reordering cell sets
It’s possible to rearrange the order of cell set and group lists. To reorder a cell set or group list, drag an item to the desired position using the button (3 lines) on the left of the item name. The new order of cell sets or groups in the Cell sets and Metadata block will be then represented in the heatmap in the Data Exploration module, as well as all plots in the Plots and Tables module. This is a useful feature for ensuring that your samples, metadata groups and annotated cell sets are plotted in order that you want them to be.
Performing automatic annotation
The ‘Annotate clusters’ tab within the Cell sets and Metadata tile is where you can perform automatic cluster annotation. For Seurat projects the automatic annotation option is ScType, while for Scanpy projects the two automatic annotation options are Decoupler and CellTypist.
- The ScType method (Seurat) of automatic annotation uses a marker genes database which was built using CellMarker, PanglaoDB, and 15 novel cell types with corresponding marker genes added by manual curation of more than 10 papers. The current version of the ScType database contains a total of 3,980 cell markers for 194 cell types in 17 human tissues and 4,212 cell markers for 194 cell types in 17 mouse tissues. More details can be found in the original paper and in the ScType github repository.
- The Decoupler method (Scanpy) of automatic annotation uses the Overrepresentation Analysis (ORA) method implemented in decoupler-py. ORA measures the overlap between a target feature set (the marker genes in the PanglaoDB database) and the marker genes of a given experiment. With these, a contingency table is built and a one-tailed Fisher's exact test is computed to determine if a cell type's set of features are over-represented in the selected features from the data. More details can be found in the decoupler documentation, and in the original paper.
- The CellTypist method (Scanpy) of automatic annotation uses an automated scRNA-seq annotation tool based on regularised logistic regression classifiers. More details can be found in the CellTypist website or in the original paper.
To perform automatic annotation, select the method (if relevant) and input the species and tissue tissue type using the dropdown menus. Then click ‘Compute’.
Note that the method selection options are enabled/disabled based on whether the project uses Seurat or Scanpy (see Step 6: Data integration), and the ‘Compute’ button is disabled until the dropdown menu selections are complete.
The annotated clusters appear in a new cell set within the Cell sets tab, labelled according to the annotation method used.
For further guidance on cell type annotation, see: How to annotate cell types in Trailmaker.
Creating custom cell sets
Creating custom cell sets using the lasso tool in the UMAP
New cell sets can be created using the lasso selection tool in the UMAP (or t-SNE) embedding plot. The lasso tool allows for a precise selection of an area of cells in your UMAP embedding plot. You can name the new cell set. The new cell set will appear in the ‘Custom cell sets’ list in the Cell sets and Metadata tile. To see the new cell set colored in UMAP, click on the eye icon next to ‘Custom cell sets’.
Custom cell sets based on gene expression
You may want to create a new custom cell set based on the expression (or lack of expression) of one or more genes. In the gene list on the right-hand side of the Data Exploration tab, you can select one or more genes of interest using the checkboxes next to the genes.
By selecting genes and clicking the 'Cellset' button, you can generate a new cell set based on the raw expression of the selected genes. In the Cellset modal, set the thresholds of expression for each selected gene. For example, you can select only the cells that express a particular gene at very high levels; or you can select only the cells that lack expression of selected genes. Note that raw gene expression values can vary a lot depending on the gene, so the thresholds should be selected carefully for each gene based on viewing the violin plot.
Then click ‘Create’. Your new cell set will appear in the Custom cell sets list in the Cell sets and Metadata tile. To see the new cell set colored in UMAP, click on the eye icon next to ‘Custom cell sets’.
Custom cell sets of combined Leiden or Louvain clusters
Locate the list of Leiden or Louvain clusters in the Cell sets and Metadata tile. Select two or more clusters that you would like to combine using the checkboxes next to cluster names. Then click the ‘Combine’ button. In the popup, name the new cluster and click on the tick button to save.
The new cluster will appear in the ‘Custom cell sets’ list.
Note that if you want to copy over all your other Leiden or Louvain clusters to the Custom cell sets list, you can do so using the 'Combine' button with only one cluster selected at a time. This essentially copies the selected Leiden or Louvain cluster to the Custom cell sets list.
Intersect selected cell sets
Intersecting selected cell sets can be very useful when working with non-mutually exclusive cell sets. For example, you’ve created two new cell sets based on gene expression. Cell set 1 contains cells with gene expression of Gene 1 greater than 0.10, and cell set 2 contains cells with gene expression of Gene 2 less than 1. Now, there might be some cells in both of these new cell sets that are the same - with gene expression of Gene 1 greater than 0.1 and Gene 2 less than 1. Intersecting cell set 1 and cell set 2 will highlight cells present in both cell sets and combine them in a new cluster.
To use this function, locate the list of clusters in the Cell sets and Metadata tile. To create an intersection of cells, select clusters using the checkboxes next to cluster names. Then click the ‘Intersection’ button. In the popup, name the new cluster and click on the save button. The new cluster will appear in the ‘Custom cell sets’ list.
Create a new custom cell set from the complement of selected cell sets
Using this function, you can create a new custom cell set that contains all cells that are not in the selected cluster(s). This can be useful and time-saving when you have many clusters and want to create a cell set with all cells outside of these clusters.
Select the cell set(s) that you want to create a complement of, and click the “Complement” button. In the popup, name the new cluster and click on the save button. The new cluster will appear in the ‘Custom cell sets’ list. To see the new cell set colored in the UMAP, click on the eye icon next to ‘Custom cell sets’.
Subset selected cell sets to a new project
You can create a new project by subsetting (also known as sub-clustering) a cell selection from your project. This allows for a further deep dive into part of your data, or for removal of contamination from your project.
When you have made a selection of a group of cells, a subset button appears in the Cell sets and Metadata tile.
When you click on the subset button, a pop-up appears to start a new project from your current cell selection. You can change the name of the new project, if you wish. Then click ‘Create’, to make a new project containing your selection.
Data Processing is run for this subset project, after which you can start your deep dive in the Data Exploration module for the subset of cells.
UMAP or t-SNE embedding tile
The tile on the top left of the Data Exploration module shows the embedding - UMAP or t-SNE - that was customized in step 7 of Data Processing. UMAP is shown by default. To change between UMAP and t-SNE, go back to step 7 of Data Processing to change your selection. The embedding plot in the Data Exploration module is interactive, allowing you to zoom in and out to focus on a particular area of interest, move, and hover over single cells.
Hovering over a single cell gives you information about the cell ID and the cluster the cell belongs to, with the selected cell simultaneously highlighted in the marker heatmap.
The embedding is colored by Leiden or Louvain clusters by default. The coloring of the embedding can be changed using the ‘eye’ icons throughout the Data Exploration tab, for example to visualize samples or metadata, or the expression of a single gene from the gene list. Note that when cell sets are viewed on the embedding, cells that are not assigned to a cell set appear in gray.
The UMAP plot also allows the creation of new custom cell sets by using the lasso tool.
Heatmap
The heatmap shows marker genes for the Leiden or Louvain clusters by default. Marker genes have been calculated using a Wilcoxon rank-sum test (wilcoxauc from the presto package in Seurat or sc.tl.rank_genes_groups in Scanpy). The heatmap displays log-normalized expression values.
The number of genes shown per cluster varies depending on how many clusters you have in your dataset. You can zoom in on a specific cell set of interest in the heatmap, and hover over marker genes to identify the gene name which will help to identify the represented cell type.
The heatmap settings menu is accessed by clicking on the gear/cog icon. In this menu, you can add sample/metadata tracks to the heatmap view or reorder the heatmap, as explained below.
Adding sample/metadata track to the heatmap view
To add sample or metadata tracks to the heatmap view, hover over Metadata tracks in the heatmap settings menu. In the sub-menu, toggle the eye icon to add a metadata track. The toggled selections appear as colored tracks above the heatmap view. The order of the metadata tracks can be changed by clicking on the up and down arrows. The item on top of the list is also going to be shown at the top of the heatmap tile. Note that this doesn’t reorder the cells within the heatmap itself - this is done using the ‘Group by’ function (see the next section).
Reordering the cells on the heatmap using ‘Group by’ parameter
To reorder the cells viewed in the heatmap, hover over ‘Group by’ in the settings menu. In the sub-menu, hover over ‘Select the parameters to group by’ dropdown menu. Click + to add a parameter you want to order cells by. To exclude a parameter, click - on the left of the parameter.
Then, in the ‘Group by’ sub-menu, arrange the parameters in descending order by which you would like to group them by.
In the example below, the heatmap is ordered first by sample and then by Louvain clusters:
Viewing genes in the heatmap
You can search for specific genes of interest in the gene list. If you want to look at these genes in the heatmap, you can select them using the checkbox and click ‘Heatmap’.
This gives you an option to add or remove the selected genes from the heatmap or overwrite the heatmap with the selected genes(s).
Clicking remove will remove the selected gene(s) from the heatmap. Clicking add will add the selected gene(s) to the heatmap.
Using overwrite, the heatmap only shows the expression of the selected gene(s). If at any point you want to reload the default heatmap view showing marker genes, simply reload the page.
Hiding Cell Sets
You can hide one or more clusters, samples, or metadata groups from the embedding plot and heatmap.
To hide a particular cluster from the embedding plot and heatmap, click the Hide button on the right side of the cluster name in the Cell sets and Metadata tile. To unhide a cluster or clusters, click the ‘Unhide’ button or use ‘Unhide all’ to unhide all hidden clusters. Metadata groups and Samples can also be hidden/unhidden in this way.
Gene list
You can find the full Gene list for your dataset in the ‘Genes’ tile on the right-hand side of the Data Exploration module. By default, genes are presented in descending order by dispersion. Dispersion describes how much the variance deviates from the mean. Genes with high dispersion have a high level of variation between cells in the dataset. You can rearrange the gene list based on the gene name or dispersion by clicking on the column names (Gene and Dispersion).
Note that if you use SCTransform, the number of genes displayed in the Gene List may be lower than the total number of genes originally detected in your dataset. Further information on gene filtering when using SCTransform is provided in Step 6: Data Integration.
Search for genes in the gene list
You can search for genes that contain, start with or end with certain letter/s or possible subunits. Your search is applied automatically to the gene list as you type.
To clear a gene search, delete your input in the search box or click the cross button (✖) on the left of the search box.
Viewing gene information
If you want to view information on a particular gene in the Gene List, click on the gene name. This action opens a new window showing the selected gene in GeneCards.
Note that the GeneCards database is used primarily for human genes and may not provide useful information if your dataset is from a species other than human.
Differential expression analysis
Differential expression analysis allows you to determine which genes are expressed at different levels between experimental groups. Differentially expressed genes can then be used in pathway analysis to offer insight into the biological processes affected by the condition of interest.
Using Trailmaker, you can find the differential expressed genes between two groups of cells, where each group must have at least 3 cells. Differential expression can be calculated using the differential expression tab on the right side in the ‘Genes’ block.
Differential expression results in Trailmaker are computed on log-normalized expression values.
Reported log fold changes (logFC) reflect differences in log-normalized counts between the groups being compared, rather than raw transcript counts.
You can compare cell sets within a sample/group, which allows you to find marker genes that distinguish clusters from one another.
Alternatively, you can compare a selected cell set between samples/groups to find genes that are differentially expressed between two experimental groups.
Compare cell sets within a sample/group
The differential expression calculation to compare cell sets within a sample or group uses the presto implementation of the Wilcoxon rank sum test and auROC analysis. For more information see the presto vignette.
To perform this analysis, choose a cell set you want to compare in the first drop-down menu. Choose another cell set, the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu. [Note that in the case of Louvain clusters, ‘Rest of Louvain clusters’ and ‘All other cells’ is the same because all cells are assigned to a Louvain cluster; whereas for Custom cell sets, these two options will be different if not all cells in the dataset are assigned to a Custom cell set.] Lastly, select the sample/group within which you want to compare cell sets or choose the option ‘All’. Then click compute.
You will be presented with the differential expression (DE) results table: a list of genes in descending order of log fold change (logFC). The table returns the following results:
- LogFC: The fold change is the ratio of the expression of a gene between the two groups being compared. They are then log-transformed in base2. Genes with a positive logFC that appear at the top of the list are expressed at higher levels in the comparison group A compared to group B. Given that logFC = log2(A) - log2(B), if log2(A) is negative and log2(B) is positive, then the logFC will be positive.
- Adj p-value: The probability of observing the difference in expression for a given gene under the assumption that said gene is not differentially expressed. In addition, the value is adjusted using the Benjamini–Hochberg correction for multiple hypothesis testing, to account for the fact that when testing thousands of genes, some might have a small p-value due to random chance. The smaller it is, the higher the chance the gene is actually differentially expressed.
- Pct1: The percentage of cells where the gene is expressed in the first group (A).
- Pct2: The percentage of cells where the gene is expressed in the second group (B).
- AUC: Area under the receiver operating characteristic (ROC) curve. It is proportional to the Wilcoxon U statistic calculated by the rank-sum test. The larger it is, the more likely it is that the corresponding gene is differentially expressed.
- Note that average expression is not output from the Presto vignette.
- Note that in Scanpy projects the Wilcoxon rank sum test (used for "within" sample/group comparisons) is run on the top 5,000 highly variable genes. Genes outside this set are not tested and will not appear in the results.
The DE gene list can be reordered in the table by other calculated parameters - adjusted p-value, PCT 1 (the percentage of cells where the feature is detected in the first group), PCT 2 (the percentage of cells where the feature is detected in the second group), and AUC (area under the receiver operating characteristic curve). Both ascending and descending options are available to view.
Clicking on ‘Show settings’ will show your chosen cell sets and samples/groups that have been compared in this DE calculation.
Note that to download the DE results, you must visit the Volcano plot in the Plots and Tables module. Unfortunately, the DE results table cannot be downloaded from the Data Exploration module.
Compare a selected cell set between samples/groups
The differential expression comparison of a selected cell set between samples or groups uses a pseudobulk limma-voom workflow. This is considered best practice for between sample comparisons. Pseudo-bulk differential expression sums the counts for all cells within a cluster for each sample and then uses standard differential expression methods designed for bulk RNA-seq. One major benefit to doing this is that it treats the sample as the level of replication, instead of falsely assuming that each cell is independent.
To perform this DE analysis, choose a cell set you want to compare in the first drop-down menu. Choose the first sample/group to compare in the second drop-down menu. Lastly, select the second sample/group you want to compare with the first sample/group or choose the option ‘Rest of Samples’ or ‘All other cells’. When you have made your selections for the DE calculation parameters, click ‘Compute’.
Note that in some comparison selections, this warning message will appear:
The message explains that in your selected comparison, there are fewer than 3 samples with the minimum number of cells that’s required to perform the DE calculation. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples, but that there are too few cells (<10) in one or more of the comparison groups, which is resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the DE calculation.
In this case, you can still go ahead and perform the DE calculation, but the DE results table will only display the list of DE genes and logFC value. No adjusted p-value will be calculated as it is not considered statistically sound to calculate such a p-value on a 1 versus 1 comparison.
You will be presented with the differential expression (DE) results table: a list of genes in descending order of log foldfull change (logFC).
The table returns the following results:
- If the comparison contains 3 or more samples (e.g. 2 control vs 1 treated) then the DE results table presents both the logFC and the adj p-value. In this case, the p-values generated from pseudo-bulk comparisons are statistically accurate and can be used to determine biological significance. The p-value is adjusted using the Benjamini–Hochberg correction for multiple hypothesis testing, to account for the fact that when testing thousands of genes, some might have a small p-value due to random chance. The smaller it is, the higher the chance the gene is actually differentially expressed.
- For 1 vs 1 comparisons (e.g. 1 control vs 1 treated), only the logFC is returned in the results table, because p-values are not appropriate with this small N. LogFC estimates can be used to ascertain the magnitude of the difference between the two samples but not to draw any statistical inferences.
For further guidance, see Performing differential expression between samples or groups in Trailmaker.
The gene list can be reordered in the table by other calculated parameters by clicking on the column titles.
Clicking on ‘Show settings’ will show your chosen cell sets and samples/groups that have been compared.
Advanced filtering
To filter the DE gene list, click ‘Advanced filtering’.
In the popup menu, you can select advanced filtering options. There are three pre-set filtering options which allow you to quickly filter for only the up-regulated genes (with a positive logFC), only the down-regulated genes (with a negative logFC) or only the significant genes (with an adjusted p-value of <0.05):
Alternatively, you can add your own custom filter using the ‘Add custom filter’ option. Here, you can select to filter by any of the DE results parameters and set a filtering threshold of your choice.
Pathway enrichment analysis
Pathway analysis identifies biological pathways that are enriched in the differentially expressed gene list more than would be expected by chance. The goal is to give the list of genes across different phenotypes a biological context by condensing down a potentially long list of genes into a few select biological pathways.
Click ‘Pathway analysis’ after performing differential expression, to start your pathway analysis. We strongly recommend using Advanced filtering to filter your list of DE genes before performing pathway analysis. This is because the list of differentially expressed genes is very long and contains both up- and down-regulated genes with varying levels of significance. So, further filtering will lead to more consistent and clear pathway analysis results.
Once your list of DE genes has been filtered using the ‘Advanced filtering’ tool, select ‘Pathway analysis’ to begin:
Note that if you have not already filtered your gene list, you will be prompted to do so.
Pathway analysis can be performed on a list of differentially expressed genes using the external service providers PantherDB or Enrichr. The list of genes and species will be submitted to the external service, and no other information will be sent.
We recommend running your pathway analysis using both PantherDB and Enricher and then comparing the results. Your final choice for the pathway analysis service might depend on the databases in the platforms, user interface, and, ultimately, on your personal preference.
For help using these external pathway analysis services, we recommend visiting the Help pages for PantherDB and Enrichr.
PantherDB
Select the ‘pantherdb’ toggle at the top of the pathway analysis modal:
In the pathway analysis modal, you can confirm the species of your dataset. You can also select the number of differentially expressed genes that will be included in the pathway analysis by clicking ‘Top’ and inputting the desired number. To send all the genes in your filtered list, select ‘All’.
Then, initiate your pathway analysis by clicking ‘Launch’.
PantherDB is hosted on an unsecured server (HTTP), so you will see a warning upon launch. Click “Send anyway” to continue. The list of genes and species will be submitted to the external service, and no other information will be sent. See the example below.
You will be redirected to the PantherDB website in a new tab.
We recommend inputting the reference list of genes by setting it in "Reference List" on the PantherDB results page and re-run the pathway analysis. If gene names in Trailmaker are different than in the reference list of genes on PantherDB (for instance, lowercase vs. uppercase gene names), the results of pathway analysis will be incorrect.
For further help using PantherDB, please visit the relevant help pages on the PantherDB website: http://pantherdb.org/help/PANTHERhelp.jsp.
Enrichr
Select ‘enrichr’ at the top of the pathway analysis modal.
In the pathway analysis modal, you can confirm the species of your dataset. You can also select the number of differentially expressed genes that will be included in the pathway analysis by clicking ‘Top’ and inputting the desired number. To send all the genes in your filtered list, select ‘All’.
Then, initiate your pathway analysis by clicking ‘Launch’.
You will be redirected to the maayanlab.cloud Enrichr page in a new tab.
For further help using the Enrichr pathway analysis tool, please visit the relevant help pages on the Enrichr website: https://maayanlab.cloud/Enrichr/help.
Data Exploration for immune repertoire analysis
Paired WT+immune projects
The Data Exploration page is enabled for paired WT and immune (WT+TCR or WT+BCR) projects.
For immune projects in Trailmaker, a Clonotype List is displayed as a separate tab next to the Gene List on the right side of the Data Exploration view.
The clonotype list displays all identified clonotypes in the dataset, in descending order of frequency across all samples. The clonotype list includes the following columns:
- Clonotype ID (column 1) is assigned to clonotypes according to frequency across the whole dataset. The most abundant clonotype is assigned to clonotype ID 1.
- The clonotype’s primary chain sequences are available in columns 2 and 3. For TCR data, the chains are TRA and TRB, while for BCR data the chains are IGK/L and IGH.
- Count indicates the number of cells identified with each clonotype across all samples in the dataset.
- Frequency illustrates the abundance of each clonotype across all samples in the dataset. The clonotype list is ordered in descending order of frequency.
From the clonotype list, individual clonotypes can be overlaid on the UMAP using the eye icon.
The selected clonotype will be shown in the UMAP in navy. The non-focus cells will remain colored by Leiden clusters (if selected) or will appear in grey. Other cell set coloring of the UMAP, such as Leiden clusters, can be cleared from the UMAP by clicking the eye icon next to the relevant cell set family in the Cell sets and Metadata tile.
To overlay multiple clonotypes on the UMAP, select the relevant clonotypes of interest using the check boxes and click ‘Color UMAP’.
In the modal that appears, the default color settings can be set as a gradient or categorical (1.) and the colors of individual clonotypes can be customized (2.). Click ‘Color’ to color the UMAP (3.).
To clear multiple clonotype colors from the UMAP, select ‘Clear UMAP’ or use the paintbrush in the UMAP tile.
To create a custom clonotype cell set, select the relevant clonotypes of interest from the clonotypes list, then click ‘CellSet’. The new cell set will appear in the ‘Custom clonotype cell set’ family in the Cell sets and Metadata tile. The Custom clonotype cell set can be renamed, recolored or deleted as per the behaviour of the Custom cell sets.
With Custom clonotype cell sets, it’s then possible to utilize the other functionality within the Cell sets and Metadata tile, including hide, combine/intersect/complement, and subset to a new analysis. See this section of the user guide for more details. Note that the clonotype IDs will be regenerated during subsetting of immune projects, which will potentially result in different clonotype IDs for a given clonotype with specific chains of interest.
In the Cell sets and Metadata tile the ‘TCR Detected’ (for TCR datasets) or ‘BCR Detected (for BCR datasets) is included automatically in all paired WT+immune projects. All cells in the dataset are included in this cell set family, grouped as Yes or No. Cells in the ‘Yes’ group have an identified paired TCR or BCR clonotype.
Immune only projects
The Data Exploration page is disabled for Immune only (TCR or BCR data without a WT parent) projects.
Plots and Tables
Overview
The Plots and Tables module of Trailmaker provides a wide range of pre-loaded data visualization options to quickly and easily get insights from your data. It also allows users to customize the plots and export them in a variety of formats.
The module is organized into three sections to make finding the right plot easy and intuitive. The Cell sets & metadata section contains plots that graphically represent cell set properties - categorical embedding, frequency plot, and a trajectory plot. The Gene expression section contains plots that represent the expression of individual genes across cell sets, such as violin plots, dot plots, and more. The Differential expression section includes a volcano plot that visually represents differences between and within groups.
See also: Guided walkthrough: Insights Plots and Tables
General options
All the plots have general customization options!
Main schema
Under the main schema control, you can change the dimensions of the plot - customize the plot’s height and width using the slider scale.
In the title menu, you can define the plot's title, change the title's font size, and indicate the location of the title.
In the font menu, you can change the text font in the plot from Sans Serif to Sans or Monospace.
Axes and margins
Under the Axes and margins control, you can customize the y-axis and x-axis. You can also customize the margins and grid lines.
- You can change the titles of the x- and y-axis, as well as the size of the axis titles, using the slider. Just slide the dot to your preferred value. The changes to axes titles will be applied to the plot automatically.
- You can also rotate the labels on the x-axis. To do this, toggle the “Rotate X-Axis Labels” button.
- You can change the size of axes labels using a slider scale. Just slide the dot to your preferred value.
- To change the margins in the plot, use the slider scale to change the margins from 0 to your preferred value. This will move the plot off-center by offsetting automatic margins.
- To add grid lines to the plot, use the slider scale to change the grid line weight from 0 to your preferred value.
In this menu, you can also override the automatic axes ranges. To manually input values for axes ranges, deselect the Auto control under X-axis and/or Y-axis. Then input your preferred minimum and maximum values, and click Save.
Color inversion
The Colour inversion control allows inverting the color of the background. If the standard color of the plot's background is white, this control enables you to turn the background black.
Markers
This menu applies to embedding and volcano plots. Here, you can change the style and shape of markers.
The point (marker dot) size can be changed from 1 to 100 using a slider scale. Examples of small point size of 1 and large point size of 10 are shown below on the left and right, respectively.
Point opacity can also be changed using a slider scale for the embeddings. The default opacity is at 5, but it can be customized on a scale from 1 to 10. The examples below show opacity settings of 1 (left) and 10 (right).
There are two options for point shape - diamond and round. To change the shape, select your preferred point shape.
Legend
Under Legend control, you can decide whether to show or hide the plot legend. To hide the legend, toggle the Hide option. You can also choose the position of the legend by clicking Top, Bottom, or Right.
Labels
The label control applies to the categorical and continuous embedding plots. You can use this control to show or hide the cell set labels.
You can also change the size of the labels if you choose to show them on the plot. This might be particularly helpful if you have a lot of clusters in the embedding and their names overlap. To change the size, use the size slider to choose your preferred value.
Additionally, in the volcano plot, you can find a control called "Add labels", which are the gene names. This option allows specifying the negative log10 of the adjusted p-value. Above your chosen values, labels (names) for upregulated and downregulated genes will be displayed.
Reset plots
All the plots have a reset button that appears after you make any changes to the default plot.
Click the blue reset button on top of the plot to return to the default plot and undo all changes.
Cell sets & metadata
Categorical Embedding
The default categorical embedding plot shows a UMAP embedding of cells from all samples clustered and colored according to the Louvain clustering algorithm. You can read more about how this plot is generated in Step 7: Configure embedding section.
Categorical embedding allows the coloring of the UMAP according to categorical variables. These variables are discrete and used to split data based on specific characteristics, such as samples. The default embedding plot displays Louvain clusters.
Group by
You can use the Group by control to change the cell set category by which you would like to group cells.
Select data
Using Select data control, you can select a sample of interest. This will result in the embedding only showing cells from the selected sample instead of all samples.
Frequency Plot
A frequency plot shows the distributional information of a variable. Simply, it summarizes the data by plotting how frequently a specific value occurs. In Trailmaker, the default frequency plot shows the proportions of cells from each cluster in every sample. So, the y-axis represents the proportions, which are frequency values. While the x-axis represents samples by which the cells are grouped. You can use a frequency plot to see if there is a significant shift in the proportions of cells between samples.
Select data
You can change the metadata and cell sets used for this plot using select data control. Using the first drop-down menu, you can specify whether sample or metadata groups are presented on the x-axis. Using the second drop-down menu, you can select the data that are presented on the plot, such as the default Leiden or Louvain clusters, custom cell sets or scType annotations.
Note that depending on the set-up of your custom cell sets, it may not be appropriate to visualize custom cell sets in a frequency plot. For example, in cases where individual cells belong to multiple custom cell sets or in cases where not all cells are assigned to a custom cell set, you should carefully consider whether a frequency plot, particularly showing proportion, is appropriate.
Plot type
You can change the plot type to a frequency plot of absolute counts. To do this, use the Plot type control and click on Count. Absolute counts reflect the number of cells in that cluster in a sample, while proportions reflect the proportion of cells in the cluster compared to all other clusters.
Trajectory Analysis
Trajectory analysis allows the user to determine a pattern of a dynamic biological process experienced by cells - a "trajectory" of gene expression changes. Then the cells are arranged according to their progression through that process, which means they are placed at their proper position in the trajectory. This progression can be quantitatively measured using pseudotime. Pseudotime has been defined as “an abstract unit of progress: simply the distance between a cell and the start of the trajectory, measured along the shortest path.” [1].
The method used to perform Trajectory analysis in Trailmaker is dependent on the type of project:
- For Seurat projects, trajectory analysis is calculated using the Monocle3 method.
- For Scanpy projects, trajectory analysis is calculated using the Partition-based graph abstraction (PAGA) method. In PAGA, when nodes are disconnected from the trajectory, their pseudotime value is calculated as infinite. In Trailmaker those nodes are displayed with the maximum pseudotime value in the analysis.
Calculate root nodes
To get started, select cell sets to use for trajectory analysis. By default, all Louvain clusters are included. However, you can choose to perform trajectory analysis for specific clusters, samples, or metadata. Click on the box above “Calculate root nodes”, and a drop-down menu will appear with all the cell sets in your dataset. Once you’ve made your selection, click “Calculate root nodes”.
Select root nodes
After the calculation is done, select root nodes by clicking on the white points. The root nodes signify where you want the trajectory to start. You can select multiple nodes at once by drawing a selection. Hold down the Shift key to do this, and then click and drag. Nodes inside the selection will be added as root nodes and appear in red.
Deselect nodes by clicking on a selected node or by clicking Clear selection.
When you have made your selection, click “Calculate pseudotime”. The trajectory plot will be colored by pseudotime. Move around the plot by panning (click and drag) and zooming (pinch and zoom/scroll).
If you have made changes to your selection (e.g., cleared the selection or added new nodes to the selection) after clicking calculate, you can recalculate the pseudotime.
Check out the “How to reproduce a published trajectory analysis plot with Trailmaker" for a guided walkthrough the trajectory analysis plot using a Seurat project.
Display
Under the display control, you can change plot values from pseudotime to cell sets. Before selecting root nodes and calculating pseudotime, the default plot values will be based on cell sets. After calculating pseudotime, the default plot will show pseudotime.
You can hide the starting nodes (white points) from the plot using the Show/Hide Trajectory controls.
Gene expression
Continuous Embedding
The continuous embedding plot allows you to see the expression of a particular gene.
Gene selection
Type the gene name in the search box to select a gene of interest. You can find the search box under the Gene selection control.
Select data
You can also select the data to view on the embedding. For example, you can choose to see the gene expression in cells from a specific sample. To do this, use the Select data control.
Expression values
You can choose to have capped or uncapped values under the “Expression values” control, where the default is set to capped Capped values for the expression level of a gene in a given cell refer to genes that are expressed at a level above a predetermined threshold, determined by the 95th percentile. These genes are said to be capped because their expression values are artificially set to this threshold, even though their true expression level may be higher.
Whereas uncapped values refer to genes that are reported as their actual expression value.
Capping can be done to manage the potential high variability often found in scRNA-seq data. By capping the expression values at this threshold, one can mitigate the impact of extremely high outliers which may not be biologically relevant but rather artefacts or noise. However, a limitation of this approach is that it could potentially result in the loss of meaningful information about genes that are naturally expressed at extremely high levels or in specific cellular conditions.
Heatmap
The heatmap shows marker genes for the Leiden or Louvain clusters by default. You can choose to see custom genes or marker genes in the heatmap.
Gene selection
By default, three marker genes per cluster are shown. To view custom genes on the Heatmap, select the “custom genes” option, type in a gene name and select it to add it to the plot. You will see automatic suggestions for the genes when you are typing out the gene name. Click on the gene in the suggestion box to add the gene, or click on the Add button.
To add multiple genes, separate them with a space or comma. Gene lists can be pasted into the gene search box from the Data Exploration module or from a document or spreadsheet.
The genes can be reordered on the y-axis of the heatmap by dragging and dropping the dots next to the gene name. To deselect a gene click on the X on the right of the gene name.
To see the expression of marker genes, click on the “Marker genes” option. Type the number of marker genes per cluster that you want to plot, and click “Run”. You can also choose to show or hide gene labels in this menu.
Metadata tracks
To add metadata tracks, click on the Metadata tracks control. Toggle the eye icon to add a metadata track to the heatmap. The toggled selections appear as colored tracks above the heatmap.
To change the order of metadata tracks or Louvain cluster tracks in the heatmap, click on the arrow icon to move the track up or down in the plot. The item on top of the list will also be shown at the top of the heatmap block. Note that this doesn’t reorder the default ordering of cells as it’s still grouped by Louvain clusters. Cells can be reordered within the heatmap using the ‘Group by’ control (see the next section).
Group by
To reorder the cell ordering in the heatmap, click on the ‘Group by’ control. In the popup, hover over the ‘Select the parameters to group by’ drop-down menu. Click + to add a parameter you want to order cells by. To exclude the parameter, click - on the left of the parameter. Then click on the up arrow to change the ordering of the cells. The parameter on the top of the list will be used as a grouping parameter.
Expression values
You can change the type and capping of the expression values under the Expression values control. You can choose to use raw values or Z-scores. You can also choose to have capped or uncapped values. Capped values for the expression level of a gene in a given cell refer to genes that are expressed at a level below a predetermined threshold, typically set to be the detection limit of the scRNA-seq experiment. These genes are said to be capped because their expression values are artificially set to this threshold, even though their true expression level may be lower. Whereas uncapped values refer to genes that are reported as their actual expression value.
Violin Plot
The violin plot allows you to look at the distribution of normalized expression of a gene of interest across Leiden or Louvain clusters by default. The black dots represent cells.
Sometimes you can see black horizontal lines at the bottom of kernels. These are points that signify the cells where the gene is not expressed, and visually can look like a line on the plot.
Gene selection
Under the Gene selection control, you can select your gene of interest. Type the gene into the search box. You will see automatic suggestions of the gene when you are typing out the gene name. You can click on the suggested gene to autocomplete the gene name. Click search to plot the violin plot.
View multiple plots
You can view multiple violin plots for the expression of different genes in a grid view in one window.
Type the gene name into the search box to plot multiple violin plots. You will see automatic suggestions of the gene when you are typing out the gene name. You can click on the suggested gene to autocomplete the gene name. To add multiple genes, separate them with a space or comma. Click add to plot the expression of your selected genes. The selected genes are going to appear at the bottom of the controls menu.
Drag and drop the genes in the gene list to rearrange the order of plots in the grid. To deselect a gene and remove a plot from the grid, click on the X on the right of the gene name.
You can also change the dimensions of the grid. The grid dimensions are represented as Rows x Columns. For example, to view four plots you could choose a 1x4 grid or a 2x2 grid.
You can also find the options to select a specific plot and update the controls. If you have selected “Controls update: All plots”, then changes in other controls such as Select data and Data transformation are going to be applied to all the plots in the grid.
If you select a plot and choose “Controls update: Selected plot”, changes in controls are going to be applied only to the selected plot.
Note that each plot needs to be saved individually.
Select data
You can change the metadata and cell sets used for this plot using the first dropdown menu in the select data control. The selection in the first dropdown menu controls the x-axis of the plot
In the second dropdown menu in the select data controls, you can change the cell set or metadata to be used as data. The default option is to show ‘All’. However, you can choose to display only a part of the data such as an individual sample or metadata group.
Data transformation
Under the Data transformation control, you can change the type of gene expression values from normalized to raw values. Note the change in the values on the y-axis in the screenshots below.
You can also adjust the bandwidth, which impacts the density fit of the kernels. To change the bandwidth, move the slider to your preferred value. Values range from 0 to 1 in 0.05 intervals.
Dot Plot
In Trailmaker, the dot plot shows the percentage of cells expressing the genes of your choice. The percentage of gene expression in all the cells of a specific cluster is represented by the size of the dot. The smaller the dot, the smaller the percentage expression. If you see a bigger dot in a specific cluster, the gene is more expressed there. The color reflects the level of expression of the gene.
In Seurat projects, the DotPlot function from the Seurat R package is used, which calculates the average expression of each gene across a specific cluster or group of cells. This average is computed using scaled data, meaning the expression values are standardized to have a mean of 0 and a standard deviation of 1. As a result, the “Average Expression” values depend on the subset of cells or datasets that have been selected for comparison. Also note that scaling is automatically disabled when only two groups are present, to avoid misleading results, as discussed here.
By default, three genes with the highest dispersion across all cells are shown.
Gene selection
You can look at the expression of custom genes of your choice or marker genes.
To select custom genes, type in a gene name in the gene search box. You will see automatic suggestions for the genes when you are typing out the gene name. To add multiple genes, separate them with a space or comma. Gene lists can be pasted in from the Data Exploration module or from a list in a document or spreadsheet. Click Add to apply to plot your selected gene/s. The gene/s you have selected will appear below the search bar.
To rearrange the order of the genes on the x-axis, drag and drop these genes in the gene list below the search box. To deselect a gene click on the X on the right of the gene name.
To see the expression of marker genes, click on the “Marker genes” option. Type the number of marker genes per cluster that you want to plot. Click Run to plot the marker gene dot plot.
Select data
In the dot plot, you can also change the cell sets or metadata that cells are grouped by, using the first dropdown menu in the “select data” controls which determines the y-axis.
You can also select the cell sets or metadata to be shown as data. For example, the cells can be grouped by Louvain clusters (y-axis), and you can select to view data only from one sample using the second dropdown menu in the select data controls.
Size scale
You can change the size scale of the dot plot. There are two available options - relative and absolute scale. Absolute scale will show total expression, while relative scale will be relative to what you select in the "Select data" control. So, if you select Louvain clusters, the size scale will be relative to all clusters, but if you select samples, the size scale will be relative to all samples.
Normalized expression matrix
In the Plots and Tables module, you can download the normalized expression matrix for specific samples, metadata groups, clusters, and custom cell sets. The normalized expression matrix contains genes as rows and cells in columns, where for each gene you have a normalized expression value for each cell. The normalized values allow us to see biological variability more clearly. The Seurat object is subsetted before exporting the matrix.
To export the full normalized expression matrix, just click download. The matrix is going to be exported as CSV.
To subset the matrix, click on the “All” box below the parameter. Note that you can also subset using multiple parameters. For example, let's subset the normalized expression matrix based on clusters.
Click on the cluster(s) you want to subset the matrix by. You can choose multiple clusters, and the selected cluster(s) will appear in the box. To deselect a cluster, click on X.
When you have selected your preferred parameters, click download.
Differential expression
Volcano Plot
A volcano plot is a type of scatter plot that represents the differential expression of genes. This plot allows you to identify possible biologically significant genes. The x-axis represents the log fold change, and the y-axis represents the negative log10 of the adjusted p-value.
If you performed differential expression analysis in the Data Exploration module, your most recent selections for the analysis would be reflected in the Volcano plot controls menu. You only need to click compute if you want to plot a volcano plot with the selection from Data Exploration!
Compare cell sets within a sample/group
Here, you can perform differential expression analysis to find marker genes that distinguish one cluster from another cluster or from all other clusters and plot a volcano plot that represents these genes. The calculation uses the presto implementation of the Wilcoxon rank sum test and auROC analysis. For more information, see the presto vignette.
To perform the differential expression analysis and plot the volcano plot go to the Differential expression control. Choose a cell set you want to compare in the first drop-down menu. (1). Choose another cell set, the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu (2). [Note that in the case of Louvain clusters, ‘Rest of Louvain clusters’ and ‘All other cells’ is the same because all cells are assigned to a Louvain cluster. Whereas for Custom cell sets, these two options will be different if not all cells in the dataset are assigned to a Custom cell set.]
Lastly, select the sample/group within which you want to compare cell sets or choose the option ‘All’ (3). Then click compute.
The volcano plot based on the computed differential expression analysis will appear in the plot tile.
The most upregulated genes in the plot are toward the right (in blue), while the most downregulated genes are toward the left (in red). Other genes with a small magnitude of change are colored in gray. The most statistically significant genes are toward the top of the plot.
You can also export the results of the differential expression analysis in a CSV format. Read more in the Export to CSV section.
Compare a selected cell set between samples/groups
With this differential expression analysis, you can find differentially expressed genes between two experimental groups and plot the respective volcano plot. This analysis uses a pseudobulk limma-voom workflow.
To perform this differential expression analysis and plot the volcano plot go to the Differential expression control. Then choose a cell set you want to compare in the first drop-down menu (1). Choose the first sample/group to compare in the second drop-down menu (2). Lastly, select the second sample/group you want to compare with the first sample/group or choose the option ‘Rest of Samples’ or ‘All other cells’ (3).
When you have made your selections, click compute.
The volcano plot based on the computed differential expression analysis will appear in the plot tile.
The most upregulated genes in the plot are toward the right (in blue), while the most downregulated genes are toward the left (in red). Other genes with a small magnitude of change are colored in gray. The most statistically significant genes are toward the top of the plot.
You can also export the results of the differential expression analysis in a CSV format. Read more in the Export to CSV section.
Data thresholding
Under the Data thresholding control, you can modify the significance thresholds and the design of the guidelines.
Note that in some cases you may observe horizontal bands of points in the volcano plot, where many genes share the same y-axis value (−log10 adjusted p-value). This behavior is expected and not an error.
Trailmaker reports adjusted p-values using the Benjamini–Hochberg procedure, which is a method that enforces monotonicity of adjusted p-values across all tested genes. Many distinct raw p-values can collapse to the same adjusted p-value, resulting in horizontal banding in the plot.
In such cases, volcano plots based on adjusted p-values can become visually uninformative, even though the underlying differential expression results are valid. For exploratory visualization, users may find it helpful to export Seurat/AnnData object or the differential expression results (csv) and generate a custom volcano plot using raw p-values (e.g. log fold change vs −log10 p-value), while still relying on adjusted p-values to assess statistical significance.
This behavior is a known property of multiple-testing correction and does not indicate rounding, loss of precision, or a processing issue in Trailmaker.
Significance Thresholds
You can change the -log10 (p-value) threshold on the y-axis. To do this, input a new threshold into the input box. This can change which genes are going to be considered upregulated or downregulated. The equivalent p-value is displayed below the input box. See the example below.
You can also change the fold change (log) value on the x-axis. The fold change is the ratio of gene expression between the two groups being compared. They are then log-transformed. Adjusting the log fold change value threshold can change which genes will be considered upregulated or downregulated. The new value will be represented as the negative and positive log fold change thresholds, as shown in the example below.
You can also choose to deselect the option ‘Show guideline.’
For both thresholds, deselecting this option will color the guideline black instead of the default red in the plot tile. When you download the plot, the guidelines will disappear completely. See the example below.
Guideline design
Under the guideline design, you can adjust the width and color of the guidelines.
To change the width of the lines, just input the new width value.
You can also change the colors of the guidelines. To do so click on Browse next to “Colors”. Then click on the colored circle on the right of the guideline whose color you want to change.
Then simply select a new color. The change is going to be applied automatically.
Colors
Besides color inversion, under the “Colors” control you are able to change the appearance of markers. Click on Browse next to “Markers”. To change the color of gene markers, click on the colored circle on the right of the marker type.
Then choose a new color. The change is going to be applied automatically.
Comparing fewer than 3 samples
Note that for some selections, an error message will appear:
The message explains that there are fewer than 3 samples with the minimum number of cells required to perform the differential expression calculation in your selected comparison. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples but that there are too few cells (<10) in one or more of the comparison groups, resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the calculation.
As the error message explains, for a comparison between only two ‘valid’ samples it is not possible to compute the differential expression and obtain statistically meaningful results. For such a comparison it is therefore not possible to create a volcano plot in Trailmaker.
In this case, you can still perform the differential expression calculation, but the results will only consist of DE genes and logFC values and should be interpreted with caution and only used for exploratory purposes. No adjusted p-value will be calculated, which means that you will not be able to plot a volcano plot. The plot will look like the image below. However, you can download your results in a CSV format.
Export to CSV
When you have performed the differential expression analysis, alongside the volcano plot an ‘Export to CSV’ button is going to appear at the top of the plot window.
Click on the button, and your differential expression results will be automatically downloaded in a CSV file.
Batch Differential Expression Table
Here you can perform DE analysis and download batch DE tables. There are three comparison options.
Generate a full list of marker genes for all cell sets
First, you can download a full list of marker genes for all cell sets, where each cell set will be compared to all other cells, using all samples. You can select which cell sets you want to use for analysis.
When you have selected the cell sets for which marker genes are to be computed in batch, click “Compute and Download”.
A DE table will be created for each cell set.
Compare two selected samples/groups within a cell set for all cell sets
You can also compare two selected samples/groups within a cell set in a batch for all cell sets.
Choose a sample/group you want to compare in the first drop-down menu. (1). Choose another sample/group, or the option ‘Rest of Samples’ or ‘All other cells’ in the second drop-down menu (2). Lastly, select the cell sets within which you want to compare samples/groups (3).
Then click “Compute and Download”. Note that a warning may appear.
Compare between two cell sets for all samples/groups
And lastly, you compare two cell sets for all samples/groups.
Choose a cell set you want to compare in the first drop-down menu. (1). Choose another cell set, or the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu (2). Lastly, select the sample/group for which you want to batch compare cell sets (3).
Then click “Compute and Download”. Note that you might get a warning message, depending on the selection you made.
The message explains that there are fewer than 3 samples with the minimum number of cells required to perform the differential expression calculation in your selected comparison. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples but that there are too few cells (<10) in one or more of the comparison groups, resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the calculation.
As the warning message explains, for a comparison between only two ‘valid’ samples it is not possible to compute the differential expression and obtain statistically meaningful results. For such a comparison it is therefore not possible to create a volcano plot in Trailmaker.
In this case, you can still perform the differential expression calculation, but the results will only consist of DE genes and logFC values and should be interpreted with caution and only used for exploratory purposes. No adjusted p-value will be calculated.
Where the downloaded file contains "AveExpr" column, this value represents the average log2 expression (log2‑CPM) of each gene within the selected cell set (if selected) and across all samples in the dataset. Note that this value is not dependent on the samples or groups selected for comparison.
Plots & Tables for immune repertoire analysis
The plots available in the Plots and Tables page depend on the data type your project contains. For immune profiling projects (TCR or BCR data with or without a WT parent), there are 3 additional plots at the bottom of the Plots and Tables homepage view, in a section entitled “TCR analysis” or “BCR analysis”.
Clonotype Frequency Plot
The clonotype frequency plot displays a bar chart of the most abundant clonotypes in the dataset.
By default, the plot shows the top 10 identified clonotypes in the dataset.
In the ‘Select data’ menu, individual samples or metadata groups can be selected. In this case, the clonotype frequency is recalculated according to the selection and the clonotypes are reordered such that the plot (and associated legend) displays clonotypes in descending order of frequency.
In the ‘Clonotypes’ menu, you can select the number of top clonotypes to plot. The default is 10. The maximum is 100.
In the ‘Plot type’ menu, the plot view can be changed from showing ‘Proportional’ where the x-axis displays Frequency, to ‘Count’ where the x-axis displays the count of cells.
Other plot control menus behave in the same way as other plots in the Plots and Tables page, as described in the Plots and Tables: General Options section.
Downloading the Clonotype Frequency Plot results in the download of 2 files: an image file (.png file) of the frequency bar chart and a table (.csv file) of the legend.
Honeycomb Plot
The honeycomb plot displays groupings of cells by clonotype, with the most abundant clonotype cluster plotted in the centre of the plot. By default, the plot displays all cells in the dataset with an identified clonotype. By default, the plot is colored by clonotype frequency.
In the ‘Select data’ menu, individual samples or metadata groups can be selected to view on the plot. The default selection is ‘All’, which shows all cells with an identified clonotype.
In the ‘Labels’ menu, the clonotype cluster labels can be shown or hidden, and the number of top clonotypes to be labelled can be selected. The default is that labels are hidden.
The selection in the ‘Color by’ menu dictates the plot coloring. The default selection is frequency, with the default colorscheme. When Frequency is selected, the Viridis, Inferno and Spectral colorschemes are available. Using the dropdown menu, the plot can be colored by cell set families, such as Leiden (or Louvain), samples, metadata groups, custom cell sets, or custom clonotype cell sets. Note that the color selections for cell set families is determined in the Data Exploration page, within the Cell sets and Metadata tile.
Example of Honeycomb plot colored by sample using the ‘Color by’ menu:
Other plot control menus behave in the same way as other plots in the Plots and Tables page, as described in the Plots and Tables: General Options section.
Downloading the Clonotype Frequency Plot results in the download of 2 files: an image file (.png file) of the honeycomb plot and a table (.csv file) of the legend.
Motif Analysis
Motif analysis displays the composition of the selected CDR3 chain in your data. Motif analysis is performed using the logoplot_cdr3_motif function in Scirpy - more details can be found on the Scirpy tutorial.
- The letters represent the amino acids at each position in the chain sequence.
- The colors represent the properties of the amino acids, as defined by the chemistry parameter in the python logoplot function.
By default, the plot shows all CDR3 amino acid sequences of the specified chain and length (see ‘CDR3 selection’ menu below) for all samples and all cell sets.
The first dropdown menu in the ‘Select data’ menu allows users to select the samples/groups to plot. The second dropdown menu allows users to select the cell sets to plot. The project will be subsetted on these selections (e.g. on Activated T cells in the Control group) before plotting the chains according to the CDR3 selection criteria in the ‘CDR3 selection’ menu.
The ‘CDR3 selection’ menu dictates the chain, chain length and format of the data to be plotted:
- The dropdown menu dictates the chain that is plotted. Options for TCR data are TRA or TRB, with TRA as the default. Options for BCR data are IGK/L and IGH, with IGK/L as the default.
- The slider allows the user to select the number of positions to view, which dictates the x-axis of the plot. The range is 11-17. The default is 11.
- The plot options are Information (default) and Probability which dictates the y-axis of the plot. High Information in a given position means that that position is dominated by one (or a few) amino acids. This will result in one (or a few) amino acid letters being displayed at that position, taking up the majority of the y-axis. Low information shows that amino acids are more variable at that position. Probability shows the proportion of identified amino acids at each position.
The number of chains included in the plot is displayed below the plot.
Note that chains containing * symbols are filtered out in motif analysis in Trailmaker. This is because the * symbol represents a stop codon, which indicates that the receptor sequence is unproductive or truncated and doesn’t encode a full functional protein. Including such sequences in motif calculations can distort amino acid frequency estimates and introduce noise into the resulting motif plot. By filtering out chains containing stop codons, Trailmaker ensures that motif analysis is performed only on valid CDR3 sequences, producing clearer and more biologically meaningful results.
Other plot control menus behave in the same way as other plots in the Plots and Tables page, as described in the Plots and Tables: General Options section.
Plot restrictions for Immune only Projects
The Plots and Tables module for immune only projects (TCR or BCR data without a WT parent) display only the 3 immune plots: Clonotype Frequency Plot, Honeycomb Plot and Motif Analysis.
Some plot customization options, including the Select data menu and the Color by menu are restricted due to the lack of cell sets (such as Leiden clusters) in immune only data.
Downloading plots
You can download the plots by clicking on the button with three dots in the top right corner of the plot.
You can save your plots as PNG or SVG. The SVG option is typically higher resolution than the PNG option. Click on your preferred option to start the download.
You can also download your plot using right-click “Save image as.”
Citing Trailmaker
For guidance on citing Trailmaker in a publication, see our article: How to use and cite Trailmaker in a publication.