Introduction to Trailmaker™
Getting started with your single cell RNA-sequencing (scRNA-seq) data analysis doesn’t have to be daunting! Trailmaker guides you through the end-to-end analysis of your Evercode™ Whole Transcriptome data, taking you from FASTQ files to figures in just a few simple clicks. You can gain unprecedented insight into the cellular heterogeneity of complex biological systems and drive forward your research programme.
Visit our website or watch an in-depth demo of Trailmaker in action.
Table of Contents
-
Getting started
- Pipeline module
- Automatic integration of Pipeline & Insights modules
- Insights module
-
Insights module - Data Processing tab
- Automated data processing
- Data processing pipeline status indicator
- Navigating through the data processing steps
- Data processing plots and statistics
- Data Processing steps
- Adjusting a data processing setting
- Pipeline failures
- Saving a processed project
- Exporting the data processing plots
- Downloading the data processing settings
- Summary of Insights Data Processing tab
- Data Exploration
- Plots and Tables
Getting started
Creating an account and logging in
Access Trailmaker at: https://app.trailmaker.parsebiosciences.com/
If you already have an account, simply input your email address and password to log in.
If you do not have an account already, click “Sign Up” to create one. Creating an account takes only a few minutes. Provide your email address and name and set a password. You’ll need to verify your email address by clicking the link in the email that is automatically sent to you during the signup process.
Navigation
On your first time logging into Trailmaker, you’ll be prompted to agree to the Trailmaker terms of use. Acceptance of these terms is mandatory to use Trailmaker.
After accepting the terms, you’ll be directed to a landing page that outlines the functionality of the two major modules of the platform:
- The Pipeline module is where you can upload your Parse Biosciences FASTQ files for alignment to your selected genome. The Pipeline module outputs include a Quality Control report, downloadable count matrices and automated integration with downstream Insights module.
- The Insights module is where you can upload files that have already been pre-processed in order to conduct downstream analysis and visualization. Supported file types include: output files from the Pipeline module, count matrices of multiple technology types including Parse Biosciences, 10x Chromium™ and BD Rhapsody™, H5 files, and Seurat objects. Alternatively, you can choose to explore one of the demo datasets from the datasets repository.
When you’re logged into Trailmaker, you can navigate between the modules using the navigation bar on the left side. Note that when the Pipeline module is active, the tabs for the Insights module collapse, and vice versa:
Navigation to some modules may be restricted by the current state of your analysis. For example, navigation to the Pipeline Output tab of the Pipeline module is dependent on a Pipeline Run being executed. Similarly, navigation to the tabs within the Insights module (Data Processing, Data Exploration and Plots & Tables) is dependent on successful data upload to this module.
Account settings
The account settings menu can be found in the bottom left corner. In the account settings menu, you can change your name and password.
If you need to change the email address associated with your Trailmaker account, please email support@parsebiosciences.com.
Raw sequencing data compatibility with Trailmaker
Raw sequencing output files are usually in the form of FASTQ or raw binary base call (BCL) files. BCL files require conversion to FASTQ format for most downstream analysis protocols. FASTQ files are large data files containing raw sequence data and quality scores.
Which raw sequencing files are compatible with Trailmaker?
- BCL files are not directly compatible with Trailmaker.
- FASTQ files generated using Parse Biosciences’ Evercode Whole Transcriptome kits can be directly uploaded to the Pipeline module of Trailmaker.
- FASTQ files generated using other single cell technologies need processing to count matrices in order to be compatible with Trailmaker. Contact your single cell sequencing technology provider for more information on the relevant FASTQ file processing pipelines.
These data files should be uploaded to Trailmaker’s Pipeline module.
Processed data compatibility with Trailmaker
Trailmaker Insights module supports processed data files in a variety of formats and from several single cell sequencing technologies. This includes:
- Count matrices generated using Parse Biosciences’ Evercode Whole Transcriptome technology that have been processed using the Parse Biosciences Pipeline. You should have 3 data files per sample: all_genes.csv, cell_metadata.csv, and count_matrix.mtx or DGE.mtx.
- Count matrices generated using 10x Chromium technology that have been processed using Cell Ranger. You should have 3 data files per sample: barcodes.tsv, features.tsv or genes.tsv, and matrix.mtx.
- Data generated using BD Rhapsody in the expression_data.st file format.
- Seurat V3 objects in the .rds format.
- H5 files in the matrix.h5 file format, such as those output from Cell Ranger.
These data files should be uploaded to Trailmaker’s Insights module.
Pipeline module
Overview
The Pipeline module is available for the processing of FASTQ files generated using Parse Biosciences’ Evercode Whole Transcriptome technology.
This FASTQ file processing module handles essential tasks such as barcode correction, read alignment, read deduplication, and transcript quantification. These quantified transcripts are then used to generate a cell-by-gene count matrix used for downstream analyses.
Further details about the Parse Biosciences pipeline are available to customers on the support suite.
Pipeline Run Details page
When you first navigate to the Pipeline module of Trailmaker, your list of Pipeline Runs will be empty. To start your first Pipeline Run with Parse Biosciences data, select the ‘Create New Run’ button:
This action opens the Pipeline module wizard, which guides you through Run creation, experimental information input, and data upload.
In the first step of the wizard, provide the new Run with a name and a description (optional).
Next, provide the details of the experimental setup using the dropdown menus. Specifically, select the Parse Biosciences technology that you used (Evercode WT Mini, Evercode WT, or Evercode WT Mega), and the chemistry version of your kit (v1, v2 or v3).
When you have input the kit type, another field will appear for you to select the number of sublibraries that you would like to process in the current pipeline run. Note that this field is dynamic, with the range determined by the kit choice. In the example below, the Evercode WT Mini kit is selected, and therefore the number of sublibraries can be 1 or 2. Click ‘Next’ when you are ready to proceed.
In the next step, upload your sample loading table in the .xlsm format. This should use the official Parse Biosciences sample loading table template for the relevant kit, available for customers to download from the support suite. Other excel files will not work.
To upload your sample loading table, simply drag and drop the file into the box, and click ‘Upload’. The file will upload in just a few seconds. Once uploaded, you can view the file name, upload date/time, as well as the number of samples and the sample names. When you are ready, click ‘Next’.
In the next step, select the reference genome from the dropdown menu. If the genome you require is not available in the dropdown menu list, email us at support@parsebiosciences.com or through the Support button in the platform to request your genome of interest to be added.
In the final step of the wizard, FASTQ files are uploaded. Trailmaker offers two options for FASTQ file upload: by drag and drop into the current step of the wizard via your web browser or via console (command line) upload.
The instructions for direct FASTQ file upload to Trailmaker via the web browser are shown by default. Drag and drop your FASTQ files into the box, and click ‘Upload’.
Alternatively, to upload FASTQ files via the command line, select the ‘Console upload’ option. Start by downloading the ‘parse-upload.py’ script. Then, click the ‘Generate token’ button.
Once your token is generated, click the ‘Copy to clipboard’ button at the bottom of the script box.
Open your command line tool (for example, Terminal for Mac users or Powershell for Windows users) and paste the copied script. There are two changes that you will need to make before running the script:
- Define the path to the parse-upload.py script that you downloaded from Trailmaker
- Define the path the FASTQ files that you want to upload
When you run the script, you will be prompted to confirm the correct files for upload.
Data upload progress is then shown in the console.
The file upload progress is also reported in Trailmaker, indicating that the file is being uploaded from the console.
When upload is complete, both the console and Trailmaker FASTQ file upload modal report this.
Note that FASTQ file requirements in Trailmaker are as follows:
- FASTQ files from the same Parse Biosciences experiment that have different Illumina indexes should not be concatenated. These files are separate sublibraries.
- FASTQ files from the same Parse Biosciences experiment that share identical Illumina indexes must be concatenated. These files belong to the same sublibrary.
- When uploading FASTQ files, you must provide paired R1 and R2 files.
- You should provide exactly one pair of FASTQ files per sublibrary.
Note the following details about the FASTQ file upload process in Trailmaker:
- Uploading large FASTQ files can take multiple hours or even days. You must keep your computer running and your browser tab open for the duration of the upload.
- If your internet connection fails, file upload will resume from the last checkpoint. Checkpoints are created every 128 MB.
- The FASTQ file size limit for upload to Trailmaker is 5TB per file.
FASTQ files are deleted from Trailmaker 30 days after upload. After this time, your Pipeline Run Details and any Outputs will continue to be available but the FASTQ files will be marked as 'Expired'.
Running the Pipeline
Running the pipeline is blocked when any of the required fields in the Run Details page are incomplete. In this case, the ‘Run the pipeline’ button is disabled.
When all required fields are complete and the required data files have been successfully uploaded, all sections will be marked with a green tick and the ‘Run the pipeline’ button becomes enabled.
Clicking ‘Run the pipeline’ starts your pipeline run. For the first few minutes, the pipeline launches and does some initial checks. You can select to cancel the pipeline run.
Then, when the pipeline is fully running, the progress is shown, together with the option to view the current logs by selecting the sublibrary. Each sublibrary has its own log stream.
The duration of your pipeline run depends on the number of cells in your experiment as well as the sequencing depth. A typical WT Mini pipeline run time is 6-12 hours; for a WT kit it’s 12-24 hours; and for a WT Mega kit it could take 24+ hours.
Whilst your pipeline is running, you can navigate away from Trailmaker and shut down your computer - the pipeline will continue to run. You can choose to receive an email notification when your run is finished.
Pipeline Version
The Pipeline module in Trailmaker operates the Parse pipeline. The current and previous versions of the pipeline used in Trailmaker are reported below:
- From 7th November 2024 to date: v1.4.0
- From 26th March 2024 to 7th November 2024: v1.2.1
- From 21st March 2024 to 26th March 2024: v1.2.0
The pipeline version used to process your run in Trailmaker is stated at the bottom of the Pipeline Outputs page. We recommend that you report the pipeline version when publishing your data analysis.
Pipeline Outputs
Successful pipeline runs will display the reports in the Pipeline Outputs tab for you to explore. The “all samples” report is shown by default. You can choose to view the reports for individual samples using the dropdown menu at the top of the page.
In the barcode rank plot, you’re looking for a clearly defined ‘knee’ with the threshold in the steepest part of the drop. This threshold is dynamically set and is likely to be different for different samples.
The QC metrics include the estimated number of cells as well as the median number of genes and transcripts per cell. These metrics can be compared across samples, and can be considered in the context of published data or your previous experiments.
Further metrics are available in the csv file that can be downloaded in the "Combined reports" option.
The plate heatmaps underneath the plots display transcripts and cells per well and are useful for catching pipetting and plate loading errors. Ideally, you'd like to see a homogenous distribution across the plates with no streaks or outliers.
At the bottom of the Pipeline Outputs page, the pipeline version used to process your FASTQ files is stated. Further details of the pipeline versions used in Trailmaker can be found in the Pipeline Version section.
Pipeline outputs are available to download.
- The count matrices can be found in the “Unfiltered matrices” and “Filtered matrices” options. These are useful if you choose to perform downstream analysis outside of Trailmaker. Note that the filtered matrices expire 30 days after creation.
- The “Combined reports” option allows you to download the all_summaries.zip file which contains the html reports, QC metrics (as a csv file) and log files that are output from Parse’s pipeline (also known as split-pipe) combine mode.
- The “Sublibrary reports” option allows you to download the html reports, QC metrics (as a csv file) and log files for each independent sublibrary within your pipeline Run.
- The “All files” option contains the full pipeline output, including the alignment BAM files. Downloading the “All files” option might take a long time for large datasets. For users who are comfortable with the command line, the “All files” option can be downloaded by copying the download command. Note that the “All files” download option expires 30 days after the creation of the pipeline outputs following a successful pipeline run.
Failed pipeline runs give the option to download the logs for troubleshooting purposes. Contact us through the Support button in the platform if you encounter a failed pipeline run.
Share Pipeline Run Details and Outputs
Within the Pipeline module Run Details page, the 'Share' button enables data sharing between users.
Clicking the 'Share' button will open a modal where the user can input the email address of the colleague, collaborator or Parse Biosciences team member with whom you want to share your Pipeline Run with. Once the email address is inserted, click 'Done'.
Any individual with whom you share your Pipeline Run will have 'Explorer' permissions specifically for the Pipeline module. This means that the 'Explorer' user will be able to view the Pipeline Details and Pipeline Outputs, but they cannot make any changes to the pipeline run.
You can revoke access within the same 'Share' modal.
Note that any linked downstream analyses (related project in the Insights module) to this Pipeline Run need to be shared separately. To do this, navigate to the Insights module Project Details page.
Automatic integration of Pipeline & Insights modules
The outputs of successful pipeline runs are automatically sent to the Insights module for downstream analysis and visualization. Simply click the “Go to Insights downstream analysis” button to navigate to the Data Exploration tab of the Insights module where you can begin to deep dive into your dataset.
From the Insights module, when a project has been generated automatically from a Pipeline run, you can navigate back to view the pipeline run outputs using the ‘Go to Pipeline Outputs’ button. The details of the related Pipeline Run are provided in the project description.
Insights module
Overview
The Insights module is where processed data files (such as count matrices, Seurat objects, etc.) can be uploaded for downstream analysis and visualization. This module offers advanced filtering and data cleanup, integration of multi-sample datasets, customization of data visualization and clustering, cluster annotation, and plot customization for the generation of publication-ready figures.
Insights Project Details page
Selecting ‘Insights’ in the left side navigation sidebar takes you to the Insights module Project Details page.
A list of all available projects in your account is displayed on this page, along with details about the selected project. Among these details are project name, description, sample list, and associated data and metadata. You can edit existing projects and create new projects on this page.
New users will see an empty project list.
By clicking ‘Create new project’, users can choose to upload their own data or to navigate to the datasets repository to explore one of our publicly available datasets:
Exploring an example dataset from the datasets repository
Trailmaker provides a repository of publicly available datasets that you can use to get started with the platform.
To access the public dataset repository click on the ‘Get started using one of our example datasets!’ button in the Project Details view. Alternatively, click on the ‘Create New Project’ button, and then choose ‘Select from Dataset Repository’.
Click ‘Explore’ on any dataset within the repository to add the selected project to your Project list.
The datasets repository allows you to quickly begin exploring a dataset in Trailmaker.
Uploading your own data
By clicking ‘Create New Project’ and selecting the ‘Upload Project’ option, you can begin to upload your own dataset to Trailmaker.
Now, you can name your project and add a project description (optional). Note that the project name must be different from other projects in your account.
Projects can be easily renamed in the list of projects by clicking on the Edit button next to the project name. Input the new project name and click save.
To begin uploading data to your new project, select the ‘Add data’ button. Samples that you want to analyze together should be uploaded to a single project.
The file format requirements for data upload to the Trailmaker Insights module vary depending on the Technology selection.
- For Parse Biosciences Evercode WT data that has already been processed using the Pipeline, you should upload the unfiltered count matrices. These are stored in the pipeline output folder entitled ‘DGE_unfiltered’. You should have the following 3 files for each sample in your experiment: all_genes.csv; cell_metadata.csv and count_matrix.mtx or DGE.mtx. The files are usually in gzip format i.e. ending in .gz. The files should be uploaded within folders that are named with the sample names. Note that the all-files unfiltered count matrices are not currently supported.
- For 10x Chromium count matrices that are output from Cell Ranger, we recommend uploading the unfiltered count matrices that can be found in the folders entitled ‘raw_bc_matrix’. You should have 3 data files per sample: barcodes.tsv; features.tsv or genes.tsv and matrix.mtx. The files are usually in gzip format i.e. ending in .gz. The files should be uploaded within folders that are named with the sample names.
- Data generated using BD Rhapsody in the expression_data.st file format. The zip files that are output by the primary processing pipeline contain the .st files that should be uploaded and they must be unzipped first. The folder with Multiplet and Undetermined cells should not be uploaded since it would distort the analysis. Note that AbSeq data is filtered out by default. After uploading your data, you can elect to include AbSeq data by checking the box in the Project Details page. Support for AbSeq is currently for visualization purposes only, as experiment-wide normalization will be slightly skewed. In case there is AbSeq data in your experiment, we suggest you create two projects; one including AbSeq data and one without, and compare the results.
-
Seurat objects in the .rds format. There is a size limit of 15GB. If file size is over 15GB, try removing any assays not indicated in the list of requirements below. Ensure the default dimensionality reduction in your Seurat object is named exactly umap or tsne. If the default reduction name includes umap or tsne (e.g., ref.umap), it will be automatically renamed. If the default reduction is different and does not contain these names, the upload will not be successful. The Seurat object must contain the following slots and metadata:
- scdata$samples: sample assignment. If absent, it will be treated as a single-sample experiment.
- scdata[['RNA']]@counts: raw feature counts
- scdata@reductions: contains the embeddings for pca, as well as either umap or tsne
- Note that cluster metadata in scdata@meta.data is auto-detected
- Note that sample level metadata in scdata@meta.data that groups samples in scdata$samples is auto-detected for downstream analysis
- H5 files in the matrix.h5 file format, such as those output from Cell Ranger.
Full instructions on the required data files are available in Trailmaker when you make the relevant selection in the ‘Technology’ dropdown menu.
You can rename the samples after upload.
Our team has previously provided a guide on converting csv and tsv data files into a format that’s compatible with upload to Trailmaker, which is available in this blog post.
If you are dealing with a file format that hasn't been listed above and you're not sure how to convert it, please use the Support button in the platform for assistance.
To upload data, simply drag and drop it into the data upload modal. You can remove unwanted files using the delete icon that appears next to each file. Select ‘Upload’ to begin uploading the files. Note that multiple samples can easily be uploaded at the same time.
The files will be compressed (if not yet so) before uploading. You can see the status of the upload for each file from the upload bar. Files that are getting compressed appear in orange. Successfully uploaded files appear in green. Files that fail to upload will show in red. Examples of these file upload statuses are shown below:
You can click on the “Uploaded” or “Upload error” text of the specific sample to see the details of the file. In the case of a successful upload, you will be able to download or replace the file. In case of a failed upload, the modal will show options to retry the upload or replace the file.
Once sample files have been uploaded, you can re-order samples in the sample list. Drag the sample to the desired position by using the button (3 lines) on the left of the sample name. The sample order on this page determines the order that samples appear in the other modules of Trailmaker. Sample order can be changed later in Data Exploration.
Adding metadata
The addition of metadata is important for multi-sample experiments in order to assign samples to groups. For example, samples within a dataset could be assigned as “control” and “treated”; or “healthy” and “disease”. Assigning metadata will then allow the comparison of groups to determine differentially expressed genes (e.g. to calculate differentially expressed in genes in a cluster of interest comparing two groups) and visualization of groups (e.g. a dot plot showing the expression of multiple genes of interest across two or more groups) further downstream in the platform. Samples can be assigned to multiple metadata groups.
Sample-level metadata
Sample-level metadata provides context about the sample as a whole, with details like the biological source (organism, tissue, cell types), experimental conditions (treatment, disease, etc.) or collection datetimes (in time-course studies). Sample-level metadata is usually defined during the experimental design, and known a priori.
Once the samples are uploaded, you can add metadata to the samples by clicking the “Metadata” button, followed by “Sample level” and “Create track”. You will be asked to name the “metadata track”. This results in a column being added to the sample information table. Metadata can be assigned to each sample. There is no limit to the number of metadata tracks that you can add.
For example, you might label the metadata track “Treatment” and assign each sample as “control” or “treated”. Alternatively, you might label the metadata track “Tissue” and assign each sample as “blood” or “skin”. There is no limit to the number of metadata tracks that you can add to a project.
Alternatively, clicking “Metadata”, then “Sample level” and “Upload file” allows you to upload sample-level metadata in bulk in the form of a tab-separated (.tsv) file.
Once all the metadata has been inserted, click on “Process project” and confirm by clicking “Yes”. This will launch your data analysis.
Cell-level metadata
Cell-level metadata is specific to individual cells, and usually includes information that is variable from cell to cell, such as the cell type and subtypes, measures for different quality metrics (read counts per cell, number of genes, proportion of mitochondrial genes, etc), information about cell cycle stage, clustering assignment, among others. Cell-level metadata is usually generated by processing the data, by running different types of analyses whose output is a different value for each cell, irrespective of the sample that they belong to.
You can upload cell-level metadata to trailmaker by clicking on the “Metadata” button followed by the “Cell-level”. A modal will open where you can drag and drop your cell-level metadata TSV file.
The file must be structured as a tidy, long format table. That is, each row should contain the information for a single cell, with each variable (cell type, cell subtype, proportion of mitochondrial content) as a column, as shown in the following screenshot.
The ‘barcode’ column is mandatory, that is the cell identification, which allows correctly matching values in the table with corresponding cells. In addition, a ‘sample’ column is highly recommended, in case there are duplicate barcodes between samples. If there is no sample column in the uploaded table, and there are duplicate barcodes, no metadata will be added to those cells (because there is no way to ensure that the data belongs to one cell or other).
Launching an Insights project analysis
Clicking on the “Process project” button initiates the conversion of the count matrices into a Seurat object. Sample names and metadata that were input in the Insights module Project Details page are inserted into the Seurat object.
The progress of this conversion process is displayed to the user. This step might take some time for large datasets, so you can opt to get notified via email once this step is completed and leave the screen.
The generation of the Seurat object is an essential prerequisite to all downstream analysis in Trailmaker. If the generation of the Seurat object fails, you will see an error screen like the one below. You can try to re-run the process, or return to Insights module project details page (click 'Insights') where you can choose to launch another analysis.
If the generation of the Seurat object completes successfully, the Insights module Data Processing pipeline will be triggered automatically to run using our automatically determined settings. For more information, see the chapter of this user guide that’s dedicated to the Data Processing module.
Share a project
Insights module Projects can be shared with colleagues or collaborators using the “Share” button in the Insights Project Details page (accessible by clicking 'Insights').
In the ‘Share’ pop-up, you can input the email addresses of your collaborators. Additionally, you can revoke access to the project for specific collaborators in the same modal.
You will remain the owner of the project: owners have control over the upload of sample data, the addition of metadata, the running of the Insights module Data Processing pipeline, and the sharing of projects.
Anyone you share a project with will have explorer roles: explorers can use the Data Exploration and Plots and Tables modules, but cannot make any changes to samples or metadata in the Insights Project Details page or re-run the pipeline in the Data Processing module.
Note that any linked upstream analysis (related Run in the Pipeline module) to this Insights Project needs to be shared separately. To do this, navigate to the Pipeline module Run Details page.
Copy a project
The 'Copy' button in the Insights module Project Details page allows you to quickly and easily create a copy of an existing project. In doing so, you can create multiple versions of analysis of a dataset, for example, to compare different data processing settings side-by-side.
Click ‘Copy’ and a copy of your project automatically appears in your project list.
Downloading data from a project
All uploaded data files can be downloaded from the sample list view by clicking on the green ‘Uploaded’ text for each file.
Additionally, the processed Seurat object can be downloaded as an R data file (.rds) using the ‘Download’ button at the top of the Project Details panel.
The Insights module Data Processing settings (which contain the values for every parameter) can also be downloaded from here as a text (.txt) file.
Insights module - Data Processing tab
Overview
Data generated from a single cell RNA-sequencing experiment always requires filtering and cleanup. During data processing, background, dead cells, doublets and poor quality cells are excluded from the downstream analysis. These steps ensure that the processed data are high quality and return accurate results during downstream analysis.
After data upload, you will be prompted to process your project. Trailmaker applies a default data processing pipeline to your dataset in the Data Processing tab within the Insights module to prepare it for analysis and visualization. The default data processing pipeline applies automatic default settings to your dataset so that you can immediately access and explore the first pass of the analysis. All data processing settings can be adjusted to your preferences.
The Insights module data processing pipeline consists of 7 sequential steps. The output of each step in this module becomes the input for the next step. Steps 1-5 consist of filters to remove unwanted and poor quality data from each individual sample. In step 6, multiple sample datasets are integrated to remove batch effects, and dimensionality reduction is performed. Finally, in step 7, the embedding is configured (e.g. UMAP or t-SNE) and clustering is applied.
The filtered, integrated data with clustering is then available for downstream exploration and visualization in the Data Exploration and Plots and Tables tabs within the Insights module.
Automated data processing
In the Insights module Project Details page of Trailmaker, the data processing pipeline is triggered by the ‘Process Project’ button. This first run of the data processing pipeline uses default settings to dynamically estimate appropriate thresholds for filtering, and standard settings for integration and clustering.
The automated data processing pipeline values are established according either to the current best practice in the field or according to the spread of each sample data. Specific details on the default values for each step in the data processing pipeline can be found in the Data Processing Steps section below.
Data processing pipeline status indicator
At the top right of the page in the Data Processing tab, there is a pipeline status indicator. When the data processing pipeline is complete, the pipeline status indicator will appear green (screenshot A, below), whilst steps that are in progress appear gray (screenshot B). If the pipeline fails, the indicator will appear incomplete and marked as failed (screenshot C). The step that is currently being viewed is marked in orange (screenshots A-C).
For more information on what to do if your pipeline fails, see the Pipeline Failures section below.
Navigating through the data processing steps
Trailmaker has the following steps in the Data Processing module:
1. Classifier filter
2. Cell size distribution filter
3. Mitochondrial content filter
4. Number of Genes vs transcripts filter
5. Doublet filter
6. Data integration
7. Configure embedding
You can navigate between these filters using the dropdown menu on the top left of the page or the navigation arrows on the top right of the page.
The dropdown menu and the status bar also show if the step is completed or not. Steps with a check mark (✔) to the left are complete; steps with a cross mark (❌) to the left have failed.
Filtering steps (1-5) can be disabled using the ‘Disable’ button at the top of the page. Filtering steps that are disabled are shown in the dropdown menu with the step name in strikethrough.
In filtering steps 1-5, the samples available within the project are listed vertically with one plot for each sample. You can scroll through the samples easily. Individual sample plots can be minimized by clicking on the sample name above the plot.
Data processing plots and statistics
For the filtering steps (steps 1-5), data is filtered on a per sample basis. A plot is shown for each individual sample within each filter. Each plot can be fully customized to your design preferences using the ‘Plot styling’ menu.
Below each sample plot, there is a table that describes the filtering statistics for each sample: ‘# before’ describes the number of barcodes present in the sample before the current filtering step; ‘# after’ describes the number of barcodes present in the sample after the current filtering step; ‘% changed’ describes the proportion change in barcode number as a result of the current filtering step.
An example filtering plot and associated statistics table for a single sample in filtering step 4 is shown below as an example:
Data Processing steps
Step 1: Classifier filter
The classifier filter aims to exclude background and retain cells. To achieve this, the filter uses the ‘emptyDrops’ method to calculate the False Discovery Rate (FDR), a statistical value which represents the probability that a droplet is empty (read more about this method here). The default FDR value is 0.01 for all samples. Only data points with FDR < 0.01 are retained. Therefore, in this step, barcodes with low FDR are retained for downstream analysis, whilst barcodes with a high FDR are removed from downstream analysis.
Note that step 1, classifier filter, is not needed for Parse Biosciences data, as the emptyDrops method is only needed for droplet based technologies. You can choose to enable this filter using the Enable button at the top of the page. For 10x Chromium and BD Rhapsody data, this filter is enabled by default. Note that for datasets that have been pre-filtered (e.g. 10x Chromium data that has been filtered in Cell Ranger) the Classifier filter is disabled.
Two plots are provided to visualize the data in the classifier filter. The first plot is a knee plot that determines the FDR threshold for considering cells valid for analysis. The knee plot ranks cells by the number of distinct transcripts for each barcode on a logarithmic scale. Using the log value of the transcripts exposes a “knee” on the graph curve where the number of transcripts decreases. The turning point in the “knee” is usually used as the point to set the FDR threshold. Cells with low transcript counts contain fewer transcripts, and there is a higher probability that the cells are background or empty droplets. Therefore we would like to filter cells that are above the FDR threshold (orange) out. The cells in the green region have an FDR<0.01 and are retained. The gray “mixed” region contains some cells that are retained and some cells that are filtered out.
The second plot is an “empty drops plot”. This is an alternative visualization of the data which plots the number of transcripts against the probability that the cell is a real data point or background. The red line shows the threshold value that is set to filter the cells. Cells below this red line are retained, while cells above the red line are excluded from downstream analysis.
The default FDR value is set to 0.01 in this filtering step. This is the standard threshold used for the emptyDrops method. Although it is possible to override this filter threshold, we do not recommend that you do so.
Step 2: Cell size distribution filter
The cell size distribution filter can be used to fine-tune the classifier filter, by further discarding background from your dataset. For Parse Biosciences data, this is the main filter that is used to exclude background. For other data types (including 10x Chromium and BD Rhapsody), this filter is disabled by default though you can choose to enable it. You will then need to re-run the pipeline in order to apply the changes.
Unlike the classifier filter which works on probability, this filter sets a hard threshold on the minimum number of transcripts. Cells with transcript counts lower than this threshold are filtered out.
The data for this filter is visualized as a knee plot. The plot ranks cells according to the number of transcripts on a logarithmic scale. The inflection point around the “knee” signifies the threshold at which the number of transcripts in a cell changes drastically. Note that the cell rank on the x-axis is on a logarithmic scale, which means the area under the curve does not proportionally represent the number of cells that are filtered / unfiltered.
The second plot view in this filter is a histogram that shows the number of cells that are affected by the cell size distribution filter. This histogram visualizes cells below (orange) and above (green) the set threshold. The orange cells are filtered out of the dataset whereas the green cells are retained. If the histogram plot shows a binomial distribution then consider switching on this filter. For example, in the histogram plot below the cells identified in orange may in fact be background and you should consider filtering them out.
Step 3: Mitochondrial content filter
Cells may be alive, dead or dying. The mitochondria of dead and dying cells rupture, spilling out transcripts of mitochondrial genes into the cell. The presence of these mitochondrial gene sequences can skew the analysis results, as transcripts from live cells are often of interest rather than transcripts of dead cells. Thus, for most datasets it is advisable to filter out dead cells from the analysis.
The mitochondrial content filter removes dead and dying cells by looking at the percentage of mitochondrial transcripts and setting an appropriate threshold. Cells with mitochondrial content higher than the threshold are removed from downstream analysis.
The default threshold for the proportion of mitochondrial genes is calculated per sample. The typical cut-off range is 5-30% of mitochondrial reads per cell, with the default cut-off in Trailmaker determined as 3 median absolute deviations above the median.
Two plot views are available in this filter. The first plot is a histogram which shows percentages of mitochondrial reads and their corresponding proportions of cells. The percentage of mitochondrial reads is the percentage of transcripts mapped to mitochondrial genes from total number of transcripts. Dead cells (blue) are filtered out and live cells (green) are retained.
The second plot is a scatter plot which shows the total number of transcripts in each cell plotted against the percentage of mitochondrial reads. Each dot in this plot is an individual cell. As in the previous plot, the dead cells are filtered out (blue) and live cells (green) are retained.
Step 4: Number of genes vs transcripts filter
The number of genes vs transcripts filter works on the principle that the number of unique transcripts increases linearly with the number of genes. Cells that deviate from this linear relationship fall into one of two categories:
(1) Cells contain a lot of genes but few transcripts. This means transcripts are not amplified well.
(2) Cells contain few genes but a lot of transcripts. This means that the few transcripts that exist are over-amplified.
This filter visualizes the data using a scatter plot to map the number of gene counts on a logarithmic scale against the number of transcripts on a logarithmic scale. The range of acceptable data points is defined with 2 linear thresholds, signified by red lines. Cells not located between the two red lines are considered outliers and are filtered out.
The scatter plot is interactive - moving the red prediction interval lines will help you to choose the most appropriate value to filter cells in your samples. To do so override the automatic settings as shown in Adjusting a data processing setting section, and use the prediction interval slider to choose your preferred values.
If one or more samples in your dataset contains a separate population of cells in this filter plot, such as in the example plot above, then we recommend further investigating the population to determine if it should be excluded or retained. One way to do this is to disable the filter (using the ‘Disable’ button at the top of the page) which will retain all cells for downstream analysis, and allow you to further investigate the secondary population in downstream modules in Trailmaker.
Step 5: Doublet filter
Doublets contain the content of multiple cells which can lead to skewed data and false conclusions, especially concerning cellular heterogeneity and identity. There are several reasons for doublet occurrence, which can vary according to the technology type.
The doublet filter calculates the doublet probability for all cells and filters out cells with a high probability of being a doublet. Calculation of the probability is carried out using the scDblFinder algorithm (a detailed explanation of this method can be found here).
This filter sets a hard threshold above which cells are filtered out. This threshold is marked by the red line in the provided plot for this filter. The plot shows the proportions of cells and their corresponding probabilities of being doublets.
For samples that contain few cells, the calculation of doublet score probabilities has less power and, therefore, tends to show more cells with an intermediate score between 0.2 and 0.8. Care should be taken to check the doublet filter threshold for samples with few cells.
Step 6: Data integration
The Data Integration step removes batch effects and reduces the dimensionality of the data.
Batch effects are variations caused by differences in experimental conditions, introducing noise which skews the true variation for a sample. Runs of different samples have different values of noise. Hence, comparing these samples directly without addressing batch effects would compound the noise. Removing batch effects enables comparison and composition of samples analyzed in different runs with minimized error. In essence, batch effect correction ensures that downstream analysis focuses on real biological differences between samples, rather than irrelevant sample-to-sample or batch-to-batch variation.
Three data integration methods are available – Harmony, Fast MNN, and Seurat v4. You can also select ‘No integration’. Harmony is selected as default. However, you can select the integration method and set the controls based on your requirements.
Normalization is applied to each sample before integration. There are several methods to achieve normalization; the default method in Trailmaker is LogNormalize. ‘SCTransform’ claims to recover sharper biological distinction compared to log-normalization. SCTransform can only be applied when the integration method is set to Seurat v4.
Dimensionality reduction reduces the complexity of the dataset while preserving variation. In essence, dimensionality reduction “compresses” the data to enable visualization in 2 dimensions. There are many methods of dimensionality reduction, but one of the most popular in the field is Principal Component Analysis (PCA). This method introduces principal components (PCs) - a linear combination of variables in the data that better explain variations. The largest variance is accounted for by the first PC, the second largest variance by the second PC, and so on.
PCA is great for high dimensional data, but it is not optimized to generate 2-dimensional embedding. In practice, PCA is used to reduce the raw data into a lower dimension, acting as a pre-processing step. The resulting data is fed into other dimensional reduction algorithms, such as UMAP or t-SNE, to reduce the data into 2 dimensions.
Normalization can be biased by certain gene categories, such as ribosomal, mitochondrial and cell cycle genes. In the data integration step, these three gene categories can be excluded from the analysis if you are working with human or mouse data. For example, cell cycle genes should be removed if sampling timepoints occurred throughout the day. Those genes can otherwise introduce within-cell-type heterogeneity that can obscure the differences in expression between cell types. To mitigate this, cell cycle genes can be excluded from the analysis of human and mouse species under ‘Dimensionality reduction settings’.
-
- Ribosomal genes are excluded based on the selection of genes that contain “rps”, “rpl”, “mrps” or “mrpl”, as well as the following three specific genes: FAU, UBA52 and DAP3. If this gene nomenclature is not true for your species, this feature will not work and should not be used.
- Mitochondrial genes are excluded based on the selection of genes that start with “mt-”. If this gene nomenclature is not true for your species, this feature will not work and should not be used.
- Cell cycle genes can be excluded from human and mouse datasets only. Trailmaker uses the list of cell cycle genes reported in the following article: Tirosh et al. “Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq.” Science (New York, N.Y.) vol. 352,6282 (2016): 189-96. doi:10.1126/science.aad0501. If you are using a species other than human or mouse, this feature will not work and should not be used.
There are 3 plot views available in the data integration step. To change the plot type, select the desired plot under the plot view menu.
The first plot is a preview of the embedding generated after the dimensional reduction. This plot is only available for multi-sample datasets, and is not available in projects with only one sample.
The second plot is a frequency plot which shows the contribution of each sample to each cluster. This plot is also only available in projects with multiple samples.
These first two plot views allow you to assess the quality of the integration of multi-sample datasets. Well integrated datasets will display good distribution of each sample across all clusters.
The third plot is an elbow plot which maps the percentage contribution of each Principle Component (PC) to the total variation in the dataset. The default setting for the number of PCs is defined by Trailmaker as the number of PCs that explains 85% of the variation (if less than 30 PCs), or 30 PCs.
Downsampling
Large datasets (e.g. >100,000 cells) can be downsampled specifically for the integration step. This speeds up the time it takes to integrate large datasets using some methods (especially Seurat_v4 and FastMNN) and enables large datasets to successfully complete the pipeline. Once the data are integrated, the full data are available for downstream analysis and visualization.
Geometric sketching finds random subsamples of a dataset that preserve the underlying geometry, which is described in this paper: Geometric sketching compactly summarizes the single-cell transcriptomic landscape. In short, geometric sketching divides the transcriptional space into variable-sized hypercubes and then randomly samples the same amount of cells from each of the cubes; the resulting sketches preserve the data structure and put more emphasis on small and underrepresented cell types, leading to improvements even over using the whole dataset.
You can downsample your data under Downsampling Options.
Then change the Method to Geometric sketching. If you wish, you can also change the percentage of cells to keep.
Step 7: Configure embedding
In the last step of the Data Processing module, integrated data is further reduced into a 2-dimensional embedding. An embedding is a space which allows for the translation of data of a high dimension into a low dimensional space. High dimensional data represents a data set where the number of features is higher than the number of samples. The low dimensional space should represent the meaningful properties of the high dimensional data.
Trailmaker provides two methods to visualize embedding: UMAP and tSNE. UMAP is a more recent technique with an algorithm that is more readily adjustable to parallelization and works faster than tSNE. Hence, UMAP scales better for large datasets compared to tSNE. Generally, it is recommended to use UMAP embedding to visualize your data.
After creating the embedding, the embedded data points are clustered and colored according to those cluster annotations. Clustering is the process of grouping cells of high similarity. There are several clustering methods available, but the most used are Louvain and Leiden methods. Trailmaker uses the Leiden clustering method by default. Clusters are color-coded and numbered numerically so they can be identified and explored in downstream analysis in Trailmaker.
The clustering result can be modified by adjusting the clustering resolution in the clustering settings menu in step 7 of the Data Processing tab within the Insights module. The embedding and clustering results that are produced in this step propagate the Data Exploration and Plots & Tables modules of Trailmaker.
There are two plot views available in step 7 where the embedding and clustering settings are configured, as well as multiple metrics that can be viewed on the plots. To change the type of the embedding plot, select the desired plot under the plot view menu.
Selecting ‘Embedding’ shows a UMAP embedding by default, showing cells from all samples clustered and colored according to the Leiden clustering algorithm. The plot view can be changed to any of the options in the ‘Colour plot by’ menu, such as Number of genes.
This plot can be useful for identifying clusters or areas in the embedding plot that have particularly high or low metrics. For example, if high mitochondrial content is concentrated in one cluster, this would suggest that there is a population of dead cells that is clustered together. In this case, we recommend returning to the mitochondrial content filter (step 3) and reduce the threshold of the percentage of mitochondrial reads to try to remove the cluster.
Selecting ‘Violin’ switches the plot to a violin plot. This plot is particularly useful for visualizing the quality control metrics, such as number of genes, across the samples in your dataset:
Adjusting a data processing setting
The default settings for each data processing step can be overridden using the ‘manual’ button in the Filtering Settings menu that is available within each step. For filtering steps 1-5, the setting can be altered for one specific sample or the adjusted setting can be applied to all samples using the ‘Copy to all samples’ button.
Specific filtering steps (steps 1-5) can be disabled using the ‘Disable’ button at the top of the filter. Using this button will switch off the selected filter so that no cells are filtered out at this step, and all cells are taken forward to the next step of data processing.
When a data processing setting is adjusted or a filter is disabled or enabled, you will be prompted to re-run the data processing pipeline. You can elect to ‘Run’ the pipeline or ‘Discard’ the changes. When the pipeline reruns, only the steps that have adjusted settings will be re-run. Note that re-running the data processing pipeline is likely to take several minutes, with the exact time dependent on the size of your dataset.
Pipeline failures
It’s important to note that the data processing steps may fail for a variety of reasons. One major reason for data processing failures is if there is not enough data to be processed. Steps 1-5 in the pipeline can fail if the number of cells is very low, and the data integration step (step 6) can fail if the number of cells is lower than 100.
If this happens and your pipeline fails, there are several ways that you can address this problem:
- Reduce the number of cells being filtered out: You can alter the filtering settings in steps 1-5 (e.g. by lowering thresholds) or disable specific filters to reduce the number of cells that are filtered out which will, therefore, increase the number of cells proceeding to downstream processing and analysis. However, this can lead to poor quality cells being included in the downstream analysis.
- Elect to use no integration: You can select ‘No integration’ in the data integration settings in step 6. This may result in a successful pipeline run, however, there may be significant batch effects that have not been removed from the analysis. If you choose to select ‘No integration’ we recommend that integration quality is checked using the embedding and frequency plots in step 6.
- Exclude the sample(s) with too few cells: It is possible to exclude the problem sample(s) that has too few cells. This can be done in the existing project by setting a high threshold in one of the filtering steps in order to filter out 100% of the cells in that sample. Alternatively, you could delete the problem sample(s) in the Insights module Project Details page in the existing project or create a new project and upload only the other samples, excluding the sample(s) with few cells.
If your pipeline fails and you are not sure how to fix the issue, use the Support button in the platform for assistance.
Saving a processed project
Whilst the project is being processed, you can leave the screen, log out of Trailmaker and close your web browser without affecting the processing - the data processing pipeline will continue to run. You can elect to receive an email confirmation when your data processing pipeline completes using the ‘Receive email notifications’ toggle button when you first process a project.
The processed project is saved automatically by Trailmaker. When you log out and then return to the platform, you can immediately view the processed Insights project.
Exporting the data processing plots
All plots that are available in the data processing module can be fully customized using the ‘Plot styling’ menu. You may want to consider including these quality control plots in your manuscript as evidence of sample quality.
To download a plot from the data processing module, select the ‘...’ menu on the top right of the plot. Download options include SVG (high resolution) and PNG (lower resolution).
Downloading the data processing settings
All data processing settings can be downloaded as a text file (.txt) using the ‘Download’ button in the Project Details page of the Insights module.
We recommend that you report these data processing settings (filtering thresholds, integration method, etc.) in your manuscripts. We can assist with writing this paragraph, if needed - use the Support button in the platform for assistance.
Summary of Insights Data Processing tab
The Data Processing tab within the Insights module of Trailmaker filters out background, dead cells, doublets and low-quality cells, removes batch effects and reduces the dimensionality of the data. Data processing is an essential prerequisite for downstream data analysis and visualization.
Data Exploration
Overview
The Data Exploration tab within the Insights module of Trailmaker has a wide variety of features for in-depth exploration of your data. Using this module, users can identify which cell types are represented by their cell sets, fully customize cell set selection, and generate insight into the dataset using gene expression visualization and differential expression.
Custom cell sets can be created using selection tools, based on the expression of one or more genes, or by manipulating the default Leiden or Louvain clusters. It's easy to rename clusters or recolor by sample, metadata, or gene. Standard analysis actions such as marker heatmap and UMAP are pre-loaded. Cell set annotation can be done automatically using the scType annotation method or manually using the marker heatmap and differential expression features.
Users can calculate differential expression between cell sets within a sample/group or compare a cell set between samples and groups. Differential expression results can be filtered further, for example, by selecting only upregulated genes. Users can perform pathway analysis on the list of differentially expressed genes using external services - Pantherdb or Enrichr.
Navigation
The Data Exploration tab consists of several tiles. On the top left, we have the UMAP embedding that was created and customized in step 7 of the Data Processing (1). In the middle, we have the list of default clusters (Leiden or Louvain), any custom cell sets that have been created, as well as the list of samples and metadata (2). On the right, the gene list shows the full list of genes present in the dataset ordered by dispersion (3). Dispersion is a measure of variability, so some of the most variable genes in the dataset are listed at the top of the gene list. At the bottom, the heatmap shows marker genes for your selected clustering method (Leiden or Louvain) (4).
The width and height of different tiles can be changed to suit your preference. The tiles can also be moved around using the moving arrows , and closed using the X button on the top right corner of each tile. To get to the default layout of the Data Exploration module back again, refresh the page or click on another module and then back to the Data Exploration.
Cell sets and Metadata tile
In the Cell sets and Metadata tile, there are two tabs.
The ‘Cell sets’ tab shows the list of the default clusters (Leiden or Louvain) and any custom cell sets that have been created, as well as the list of samples and metadata. To expand any of the lists, click on the arrow on the left of the list name.
The ‘Annotate clusters’ tab within the Cell sets and Metadata tile is where you can perform automatic cluster annotation using ScType. Details of the ScType method of automatic annotation can be found in the ScType paper and in the ScType github repository. Select the tissue type and species of your samples from the dropdown menus, then click ‘Compute’ Note that the ‘Compute’ button is disabled until the dropdown menu selections are complete.
The annotated clusters appear in a new cell set ‘ScType’ within the Cell sets tab.
Changing names and colors of cell sets
In Leiden/Louvain clusters, Custom cell sets lists, ScType annotated cell sets, samples and metadata in the Cell sets and Metadata tile, you can change the names and colors of cell sets.
To change the name click the edit button next to the cell set name. After inputting the new name, click the checkmark to save the name or the cross to cancel. Any changes to cell set names made in this tile propagate all other modules of the platform.
To change the color of a cluster, click on the colored circle next to the cell set. In the popup, choose the new color and it will be applied automatically. Any changes to cell set colors made in this tile propagate all other modules of the platform.
Reordering cell sets
It’s possible to rearrange the order of cell set and group lists. To reorder a cell set or group list, drag an item to the desired position using the button (3 lines) on the left of the item name. The new order of cell sets or groups in the Cell sets and Metadata block will be then represented in the heatmap in the Data Exploration module, as well as all plots in the Plots and Tables module. This is a useful feature for ensuring that your samples, metadata groups and annotated cell sets are plotted in order that you want them to be.
Creating custom cell sets
Creating custom cell sets using the lasso tool in the UMAP
New cell sets can be created using the lasso selection tool in the UMAP (or t-SNE) embedding plot. The lasso tool allows for a precise selection of an area of cells in your UMAP embedding plot. You can name the new cell set. The new cell set will appear in the ‘Custom cell sets’ list in the Cell sets and Metadata tile. To see the new cell set colored in UMAP, click on the eye icon next to ‘Custom cell sets’.
Custom cell sets based on gene expression
You may want to create a new custom cell set based on the expression (or lack of expression) of one or more genes. In the gene list on the right-hand side of the Data Exploration tab, you can select one or more genes of interest using the checkboxes next to the genes.
By selecting genes and clicking the 'Cellset' button, you can generate a new cell set based on the expression of the selected genes. In the Cellset modal, set the thresholds of expression for each selected gene. For example, you can select only the cells that express a particular gene at very high levels; or you can select only the cells that lack expression of selected genes. Then click ‘Create’. Your new cell set will appear in the Custom cell sets list in the Cell sets and Metadata tile. To see the new cell set colored in UMAP, click on the eye icon next to ‘Custom cell sets’.
Custom cell sets of combined Leiden or Louvain clusters
Locate the list of Leiden or Louvain clusters in the Cell sets and Metadata tile. Select two or more clusters that you would like to combine using the checkboxes next to cluster names. Then click the ‘Combine’ button. In the popup, name the new cluster and click on the tick button to save.
The new cluster will appear in the ‘Custom cell sets’ list.
Note that if you want to copy over all your other Leiden or Louvain clusters to the Custom cell sets list, you can do so using the 'Combine' button with only one cluster selected at a time. This essentially copies the selected Leiden or Louvain cluster to the Custom cell sets list.
Intersect selected cell sets
Intersecting selected cell sets can be very useful when working with non-mutually exclusive cell sets. For example, you’ve created two new cell sets based on gene expression. Cell set 1 contains cells with gene expression of Gene 1 greater than 0.10, and cell set 2 contains cells with gene expression of Gene 2 less than 1. Now, there might be some cells in both of these new cell sets that are the same - with gene expression of Gene 1 greater than 0.1 and Gene 2 less than 1. Intersecting cell set 1 and cell set 2 will highlight cells present in both cell sets and combine them in a new cluster.
To use this function, locate the list of clusters in the Cell sets and Metadata tile. To create an intersection of cells, select clusters using the checkboxes next to cluster names. Then click the ‘Intersection’ button. In the popup, name the new cluster and click on the save button. The new cluster will appear in the ‘Custom cell sets’ list.
Create a new custom cell set from the complement of selected cell sets
Using this function, you can create a new custom cell set that contains all cells that are not in the selected cluster(s). This can be useful and time-saving when you have many clusters and want to create a cell set with all cells outside of these clusters.
Select the cell set(s) that you want to create a complement of, and click the “Complement” button. In the popup, name the new cluster and click on the save button. The new cluster will appear in the ‘Custom cell sets’ list. To see the new cell set colored in the UMAP, click on the eye icon next to ‘Custom cell sets’.
Subset selected cell sets to a new project
You can create a new project by subsetting a cell selection from your project. This allows for a further deep dive into part of your data, or for removal of contamination from your project.
When you have made a selection of a group of cells, a subset button appears in the Cell sets and Metadata tile.
When you click on the subset button, a pop-up appears to start a new project from your current cell selection. You can change the name of the new project, if you wish. Then click ‘Create’, to make a new project containing your selection.
Data Processing is run for this subset project, after which you can start your deep dive in the Data Exploration module for the subset of cells.
UMAP or t-SNE embedding tile
The tile on the top left of the Data Exploration module shows the embedding - UMAP or t-SNE - that was customized in step 7 of Data Processing. UMAP is shown by default. To change between UMAP and t-SNE, go back to step 7 of Data Processing to change your selection. The embedding plot in the Data Exploration module is interactive, allowing you to zoom in and out to focus on a particular area of interest, move, and hover over single cells.
Hovering over a single cell gives you information about the cell ID and the cluster the cell belongs to, with the selected cell simultaneously highlighted in the marker heatmap.
The embedding is colored by Leiden or Louvain clusters by default. The coloring of the embedding can be changed using the ‘eye’ icons throughout the Data Exploration tab, for example to visualize samples or metadata, or the expression of a single gene from the gene list. Note that when cell sets are viewed on the embedding, cells that are not assigned to a cell set appear in gray.
The UMAP plot also allows the creation of new custom cell sets by using the lasso tool.
Heatmap
The heatmap shows marker genes for the Leiden or Louvain clusters by default. The number of genes shown per cluster varies depending on how many clusters you have in your dataset. You can zoom in on a specific cell set of interest in the heatmap, and hover over marker genes to identify the gene name which will help to identify the represented cell type.
The heatmap settings menu is accessed by clicking on the gear/cog icon. In this menu, you can add sample/metadata tracks to the heatmap view or reorder the heatmap, as explained below.
Adding sample/metadata track to the heatmap view
To add sample or metadata tracks to the heatmap view, hover over Metadata tracks in the heatmap settings menu. In the sub-menu, toggle the eye icon to add a metadata track. The toggled selections appear as colored tracks above the heatmap view. The order of the metadata tracks can be changed by clicking on the up and down arrows. The item on top of the list is also going to be shown at the top of the heatmap tile. Note that this doesn’t reorder the cells within the heatmap itself - this is done using the ‘Group by’ function (see the next section).
Reordering the cells on the heatmap using ‘Group by’ parameter
To reorder the cells viewed in the heatmap, hover over ‘Group by’ in the settings menu. In the sub-menu, hover over ‘Select the parameters to group by’ dropdown menu. Click + to add a parameter you want to order cells by. To exclude a parameter, click - on the left of the parameter.
Then, in the ‘Group by’ sub-menu, arrange the parameters in descending order by which you would like to group them by.
In the example below, the heatmap is ordered first by sample and then by Louvain clusters:
Viewing genes in the heatmap
You can search for specific genes of interest in the gene list. If you want to look at these genes in the heatmap, you can select them using the checkbox and click ‘Heatmap’.
This gives you an option to add or remove the selected genes from the heatmap or overwrite the heatmap with the selected genes(s).
Clicking remove will remove the selected gene(s) from the heatmap. Clicking add will add the selected gene(s) to the heatmap.
Using overwrite, the heatmap only shows the expression of the selected gene(s). If at any point you want to reload the default heatmap view showing marker genes, simply reload the page.
Hiding Cell Sets
You can hide one or more clusters, samples, or metadata groups from the embedding plot and heatmap.
To hide a particular cluster from the embedding plot and heatmap, click the Hide button on the right side of the cluster name in the Cell sets and Metadata tile. To unhide a cluster or clusters, click the ‘Unhide’ button or use ‘Unhide all’ to unhide all hidden clusters. Metadata groups and Samples can also be hidden/unhidden in this way.
Gene list
You can find the full Gene list for your dataset in the ‘Genes’ tile on the right-hand side of the Data Exploration module. By default, genes are presented in descending order by dispersion. Dispersion describes how much the variance deviates from the mean. Genes with high dispersion have a high level of variation between cells in the dataset. You can rearrange the gene list based on the gene name or dispersion by clicking on the column names (Gene and Dispersion).
Search for genes in the gene list
You can search for genes that contain, start with or end with certain letter/s or possible subunits. Your search is applied automatically to the gene list as you type.
To clear a gene search, delete your input in the search box or click the cross button (✖) on the left of the search box.
Viewing gene information
If you want to view information on a particular gene in the Gene List, click on the gene name. This action opens a new window showing the selected gene in GeneCards.
Note that the GeneCards database is used primarily for human genes and may not provide useful information if your dataset is from a species other than human.
Differential expression analysis
Differential expression analysis allows you to determine which genes are expressed at different levels between experimental groups. Differentially expressed genes can then be used in pathway analysis to offer insight into the biological processes affected by the condition of interest.
Using Trailmaker, you can find the differential expressed genes between two groups of cells, where each group must have at least 3 cells. Differential expression can be calculated using the differential expression tab on the right side in the ‘Genes’ block.
You can compare cell sets within a sample/group, which allows you to find marker genes that distinguish clusters from one another.
Alternatively, you can compare a selected cell set between samples/groups to find genes that are differentially expressed between two experimental groups.
Compare cell sets within a sample/group
The differential expression calculation to compare cell sets within a sample or group uses the presto implementation of the Wilcoxon rank sum test and auROC analysis. For more information see the presto vignette.
To perform this analysis, choose a cell set you want to compare in the first drop-down menu. Choose another cell set, the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu. [Note that in the case of Louvain clusters, ‘Rest of Louvain clusters’ and ‘All other cells’ is the same because all cells are assigned to a Louvain cluster; whereas for Custom cell sets, these two options will be different if not all cells in the dataset are assigned to a Custom cell set.] Lastly, select the sample/group within which you want to compare cell sets or choose the option ‘All’. Then click compute.
You will be presented with the differential expression (DE) results table: a list of genes in descending order of log fold change (logFC). The table returns the following results:
- LogFC: The fold change is the ratio of the expression of a gene between the two groups being compared. They are then log-transformed. Genes with a positive logFC that appear at the top of the list are expressed at higher levels in the comparison group A compared to group B. Given that logFC = log2(A) - log2(B), if log2(A) is negative and log2(B) is positive, then the logFC will be positive.
- Adj p-value: The probability of observing the difference in expression for a given gene under the assumption that said gene is not differentially expressed. In addition, the value is adjusted using the Benjamini–Hochberg correction for multiple hypothesis testing, to account for the fact that when testing thousands of genes, some might have a small p-value due to random chance. The smaller it is, the higher the chance the gene is actually differentially expressed.
- Pct1: The percentage of cells where the gene is expressed in the first group (A).
- Pct2: The percentage of cells where the gene is expressed in the second group (B).
- AUC: Area under the receiver operating characteristic (ROC) curve. It is proportional to the Wilcoxon U statistic calculated by the rank-sum test. The larger it is, the more likely it is that the corresponding gene is differentially expressed.
The DE gene list can be reordered in the table by other calculated parameters - adjusted p-value, PCT 1 (the percentage of cells where the feature is detected in the first group), PCT 2 (the percentage of cells where the feature is detected in the second group), and AUC (area under the receiver operating characteristic curve). Both ascending and descending options are available to view.
Clicking on ‘Show settings’ will show your chosen cell sets and samples/groups that have been compared in this DE calculation.
Note that to download the DE results, you must visit the Volcano plot in the Plots and Tables module. Unfortunately, the DE results table cannot be downloaded from the Data Exploration module.
Compare a selected cell set between samples/groups
The differential expression comparison of a selected cell set between samples or groups uses a pseudobulk limma-voom workflow. This is considered best practice for between sample comparisons. Pseudo-bulk differential expression sums the counts for all cells within a cluster for each sample and then uses standard differential expression methods designed for bulk RNA-seq. One major benefit to doing this is that it treats the sample as the level of replication, instead of falsely assuming that each cell is independent.
To perform this DE analysis, choose a cell set you want to compare in the first drop-down menu. Choose the first sample/group to compare in the second drop-down menu. Lastly, select the second sample/group you want to compare with the first sample/group or choose the option ‘Rest of Samples’ or ‘All other cells’. When you have made your selections for the DE calculation parameters, click ‘Compute’.
Note that in some comparison selections, this warning message will appear:
The message explains that in your selected comparison, there are fewer than 3 samples with the minimum number of cells that’s required to perform the DE calculation. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples, but that there are too few cells (<10) in one or more of the comparison groups, which is resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the DE calculation.
In this case, you can still go ahead and perform the DE calculation, but the DE results table will only display the list of DE genes and logFC value. No adjusted p-value will be calculated as it is not considered statistically sound to calculate such a p-value on a 1 versus 1 comparison.
You will be presented with the differential expression (DE) results table: a list of genes in descending order of log foldfull change (logFC).
The table returns the following results:
- If the comparison contains 3 or more samples (e.g. 2 control vs 1 treated) then the DE results table presents both the logFC and the adj p-value. In this case, the p-values generated from pseudo-bulk comparisons are statistically accurate and can be used to determine biological significance.
- For 1 vs 1 comparisons (e.g. 1 control vs 1 treated), only the logFC is returned in the results table, because p-values are not appropriate with this small N. LogFC estimates can be used to ascertain the magnitude of the difference between the two samples but not to draw any statistical inferences.
The gene list can be reordered in the table by other calculated parameters by clicking on the column titles.
Clicking on ‘Show settings’ will show your chosen cell sets and samples/groups that have been compared.
Advanced filtering
To filter the DE gene list, click ‘Advanced filtering’.
In the popup menu, you can select advanced filtering options. There are three pre-set filtering options which allow you to quickly filter for only the up-regulated genes (with a positive logFC), only the down-regulated genes (with a negative logFC) or only the significant genes (with an adjusted p-value of <0.05):
Alternatively, you can add your own custom filter using the ‘Add custom filter’ option. Here, you can select to filter by any of the DE results parameters and set a filtering threshold of your choice.
Pathway enrichment analysis
Pathway analysis identifies biological pathways that are enriched in the differentially expressed gene list more than would be expected by chance. The goal is to give the list of genes across different phenotypes a biological context by condensing down a potentially long list of genes into a few select biological pathways.
Click ‘Pathway analysis’ after performing differential expression, to start your pathway analysis. We strongly recommend using Advanced filtering to filter your list of DE genes before performing pathway analysis. This is because the list of differentially expressed genes is very long and contains both up- and down-regulated genes with varying levels of significance. So, further filtering will lead to more consistent and clear pathway analysis results.
Once your list of DE genes has been filtered using the ‘Advanced filtering’ tool, select ‘Pathway analysis’ to begin:
Note that if you have not already filtered your gene list, you will be prompted to do so.
Pathway analysis can be performed on a list of differentially expressed genes using the external service providers PantherDB or Enrichr. The list of genes and species will be submitted to the external service, and no other information will be sent.
We recommend running your pathway analysis using both PantherDB and Enricher and then comparing the results. Your final choice for the pathway analysis service might depend on the databases in the platforms, user interface, and, ultimately, on your personal preference.
For help using these external pathway analysis services, we recommend visiting the Help pages for PantherDB and Enrichr.
PantherDB
Select the ‘pantherdb’ toggle at the top of the pathway analysis modal:
In the pathway analysis modal, you can confirm the species of your dataset. You can also select the number of differentially expressed genes that will be included in the pathway analysis by clicking ‘Top’ and inputting the desired number. To send all the genes in your filtered list, select ‘All’.
Then, initiate your pathway analysis by clicking ‘Launch’.
PantherDB is hosted on an unsecured server (HTTP), so you will see a warning upon launch. Click “Send anyway” to continue. The list of genes and species will be submitted to the external service, and no other information will be sent. See the example below.
You will be redirected to the PantherDB website in a new tab.
We recommend inputting the reference list of genes by setting it in "Reference List" on the PantherDB results page and re-run the pathway analysis. If gene names in Trailmaker are different than in the reference list of genes on PantherDB (for instance, lowercase vs. uppercase gene names), the results of pathway analysis will be incorrect.
For further help using PantherDB, please visit the relevant help pages on the PantherDB website: http://pantherdb.org/help/PANTHERhelp.jsp.
Enrichr
Select ‘enrichr’ at the top of the pathway analysis modal.
In the pathway analysis modal, you can confirm the species of your dataset. You can also select the number of differentially expressed genes that will be included in the pathway analysis by clicking ‘Top’ and inputting the desired number. To send all the genes in your filtered list, select ‘All’.
Then, initiate your pathway analysis by clicking ‘Launch’.
You will be redirected to the maayanlab.cloud Enrichr page in a new tab.
For further help using the Enrichr pathway analysis tool, please visit the relevant help pages on the Enrichr website: https://maayanlab.cloud/Enrichr/help.
Plots and Tables
Overview
The Plots and Tables module of Trailmaker provides a wide range of pre-loaded data visualization options to quickly and easily get insights from your data. It also allows users to customize the plots and export them in a variety of formats.
The module is organized into three sections to make finding the right plot easy and intuitive. The Cell sets & metadata section contains plots that graphically represent cell set properties - categorical embedding, frequency plot, and a trajectory plot. The Gene expression section contains plots that represent the expression of individual genes across cell sets, such as violin plots, dot plots, and more. The Differential expression section includes a volcano plot that visually represents differences between and within groups.
General options
All the plots have general customization options!
Main schema
Under the main schema control, you can change the dimensions of the plot - customize the plot’s height and width using the slider scale.
In the title menu, you can define the plot's title, change the title's font size, and indicate the location of the title.
In the font menu, you can change the text font in the plot from Sans Serif to Sans or Monospace.
Axes and margins
Under the Axes and margins control, you can customize the y-axis and x-axis. You can also customize the margins and grid lines.
- You can change the titles of the x- and y-axis, as well as the size of the axis titles, using the slider. Just slide the dot to your preferred value. The changes to axes titles will be applied to the plot automatically.
- You can also rotate the labels on the x-axis. To do this, toggle the “Rotate X-Axis Labels” button.
- You can change the size of axes labels using a slider scale. Just slide the dot to your preferred value.
- To change the margins in the plot, use the slider scale to change the margins from 0 to your preferred value. This will move the plot off-center by offsetting automatic margins.
- To add grid lines to the plot, use the slider scale to change the grid line weight from 0 to your preferred value.
In this menu, you can also override the automatic axes ranges. To manually input values for axes ranges, deselect the Auto control under X-axis and/or Y-axis. Then input your preferred minimum and maximum values, and click Save.
Color inversion
The Colour inversion control allows inverting the color of the background. If the standard color of the plot's background is white, this control enables you to turn the background black.
Markers
This menu applies to embedding and volcano plots. Here, you can change the style and shape of markers.
The point (marker dot) size can be changed from 1 to 100 using a slider scale. Examples of small point size of 1 and large point size of 10 are shown below on the left and right, respectively.
Point opacity can also be changed using a slider scale for the embeddings. The default opacity is at 5, but it can be customized on a scale from 1 to 10. The examples below show opacity settings of 1 (left) and 10 (right).
There are two options for point shape - diamond and round. To change the shape, select your preferred point shape.
Legend
Under Legend control, you can decide whether to show or hide the plot legend. To hide the legend, toggle the Hide option. You can also choose the position of the legend by clicking Top, Bottom, or Right.
Labels
The label control applies to the categorical and continuous embedding plots. You can use this control to show or hide the cell set labels.
You can also change the size of the labels if you choose to show them on the plot. This might be particularly helpful if you have a lot of clusters in the embedding and their names overlap. To change the size, use the size slider to choose your preferred value.
Additionally, in the volcano plot, you can find a control called “Add labels.” This option allows specifying the negative log10 of the adjusted p-value. Above your chosen values, labels for upregulated and downregulated genes will be displayed.
Reset plots
All the plots have a reset button that appears after you make any changes to the default plot.
Click the blue reset button on top of the plot to return to the default plot and undo all changes.
Cell sets & metadata
Categorical Embedding
The default categorical embedding plot shows a UMAP embedding of cells from all samples clustered and colored according to the Louvain clustering algorithm. You can read more about how this plot is generated in Step 7: Configure embedding section.
Categorical embedding allows the coloring of the UMAP according to categorical variables. These variables are discrete and used to split data based on specific characteristics, such as samples. The default embedding plot displays Louvain clusters.
Group by
You can use the Group by control to change the cell set category by which you would like to group cells.
Select data
Using Select data control, you can select a sample of interest. This will result in the embedding only showing cells from the selected sample instead of all samples.
Frequency Plot
A frequency plot shows the distributional information of a variable. Simply, it summarizes the data by plotting how frequently a specific value occurs. In Trailmaker, the default frequency plot shows the proportions of cells from each cluster in every sample. So, the y-axis represents the proportions, which are frequency values. While the x-axis represents samples by which the cells are grouped. You can use a frequency plot to see if there is a significant shift in the proportions of cells between samples.
Select data
You can change the metadata and cell sets used for this plot using select data control. If you have multiple metadata tracks, you can change the variables represented by the x-axis. Then you can change the cell set whose frequency you want to plot in the metadata intervals.
Plot type
You can change the plot type to a frequency plot of absolute counts. To do this, use the Plot type control and click on Count. Absolute counts reflect the number of cells in that cluster in a sample, while proportions reflect the proportion of cells in the cluster compared to all other clusters.
Trajectory Analysis
Trajectory analysis allows the user to determine a pattern of a dynamic biological process experienced by cells - a "trajectory" of gene expression changes. Then the cells are arranged according to their progression through that process, which means they are placed at their proper position in the trajectory. This progression can be quantitatively measured using pseudotime. Pseudotime has been defined as “an abstract unit of progress: simply the distance between a cell and the start of the trajectory, measured along the shortest path.” [1].
Trajectory analysis in Trailmaker is calculated using the Monocle3 method.
Calculate root nodes
To get started, select cell sets to use for trajectory analysis. By default, all Louvain clusters are included. However, you can choose to perform trajectory analysis for specific clusters, samples, or metadata. Click on the box above “Calculate root nodes”, and a drop-down menu will appear with all the cell sets in your dataset. Once you’ve made your selection, click “Calculate root nodes”.
Select root nodes
After the calculation is done, select root nodes by clicking on the white points. The root nodes signify where you want the trajectory to start. You can select multiple nodes at once by drawing a selection. Hold down the Shift key to do this, and then click and drag. Nodes inside the selection will be added as root nodes and appear in red.
Deselect nodes by clicking on a selected node or by clicking Clear selection.
When you have made your selection, click “Calculate pseudotime”. The trajectory plot will be colored by pseudotime. Move around the plot by panning (click and drag) and zooming (pinch and zoom/scroll).
If you have made changes to your selection (e.g., cleared the selection or added new nodes to the selection) after clicking calculate, you can recalculate the pseudotime.
Check out the “How to reproduce a published trajectory analysis plot with Trailmaker" for a guided walkthrough the trajectory analysis plot.
Display
Under the display control, you can change plot values from pseudotime to cell sets. Before selecting root nodes and calculating pseudotime, the default plot values will be based on cell sets. After calculating pseudotime, the default plot will show pseudotime.
You can hide the starting nodes (white points) from the plot using the Show/Hide Trajectory controls.
Gene expression
Continuous Embedding
The continuous embedding plot allows you to see the expression of a particular gene.
Gene selection
Type the gene name in the search box to select a gene of interest. You can find the search box under the Gene selection control.
Select data
You can also select the data to view on the embedding. For example, you can choose to see the gene expression in cells from a specific sample. To do this, use the Select data control.
Expression values
You can choose to have capped or uncapped values under the “Expression values” control, where the default is set to capped Capped values for the expression level of a gene in a given cell refer to genes that are expressed at a level above a predetermined threshold, determined by the 95th percentile. These genes are said to be capped because their expression values are artificially set to this threshold, even though their true expression level may be higher.
Whereas uncapped values refer to genes that are reported as their actual expression value.
Capping can be done to manage the potential high variability often found in scRNA-seq data. By capping the expression values at this threshold, one can mitigate the impact of extremely high outliers which may not be biologically relevant but rather artefacts or noise. However, a limitation of this approach is that it could potentially result in the loss of meaningful information about genes that are naturally expressed at extremely high levels or in specific cellular conditions.
Heatmap
The heatmap shows marker genes for the Leiden or Louvain clusters by default. You can choose to see custom genes or marker genes in the heatmap.
Gene selection
By default, three marker genes per cluster are shown. To view custom genes on the Heatmap, select the “custom genes” option, type in a gene name and select it to add it to the plot. You will see automatic suggestions for the genes when you are typing out the gene name. Click on the gene in the suggestion box to add the gene, or click on the Add button.
To add multiple genes, separate them with a space or comma. Gene lists can be pasted into the gene search box from the Data Exploration module or from a document or spreadsheet.
The genes can be reordered on the y-axis of the heatmap by dragging and dropping the dots next to the gene name. To deselect a gene click on the X on the right of the gene name.
To see the expression of marker genes, click on the “Marker genes” option. Type the number of marker genes per cluster that you want to plot, and click “Run”. You can also choose to show or hide gene labels in this menu.
Metadata tracks
To add metadata tracks, click on the Metadata tracks control. Toggle the eye icon to add a metadata track to the heatmap. The toggled selections appear as colored tracks above the heatmap.
To change the order of metadata tracks or Louvain cluster tracks in the heatmap, click on the arrow icon to move the track up or down in the plot. The item on top of the list will also be shown at the top of the heatmap block. Note that this doesn’t reorder the default ordering of cells as it’s still grouped by Louvain clusters. Cells can be reordered within the heatmap using the ‘Group by’ control (see the next section).
Group by
To reorder the cell ordering in the heatmap, click on the ‘Group by’ control. In the popup, hover over the ‘Select the parameters to group by’ drop-down menu. Click + to add a parameter you want to order cells by. To exclude the parameter, click - on the left of the parameter. Then click on the up arrow to change the ordering of the cells. The parameter on the top of the list will be used as a grouping parameter.
Expression values
You can change the type and capping of the expression values under the Expression values control. You can choose to use raw values or Z-scores. You can also choose to have capped or uncapped values. Capped values for the expression level of a gene in a given cell refer to genes that are expressed at a level below a predetermined threshold, typically set to be the detection limit of the scRNA-seq experiment. These genes are said to be capped because their expression values are artificially set to this threshold, even though their true expression level may be lower. Whereas uncapped values refer to genes that are reported as their actual expression value.
Violin Plot
The violin plot allows you to look at the distribution of normalized expression of a gene of interest across Leiden or Louvain clusters by default. The black dots represent cells.
Sometimes you can see black horizontal lines at the bottom of kernels. These are points that signify the cells where the gene is not expressed, and visually can look like a line on the plot.
Gene selection
Under the Gene selection control, you can select your gene of interest. Type the gene into the search box. You will see automatic suggestions of the gene when you are typing out the gene name. You can click on the suggested gene to autocomplete the gene name. Click search to plot the violin plot.
View multiple plots
You can view multiple violin plots for the expression of different genes in a grid view in one window.
Type the gene name into the search box to plot multiple violin plots. You will see automatic suggestions of the gene when you are typing out the gene name. You can click on the suggested gene to autocomplete the gene name. To add multiple genes, separate them with a space or comma. Click add to plot the expression of your selected genes. The selected genes are going to appear at the bottom of the controls menu.
Drag and drop the genes in the gene list to rearrange the order of plots in the grid. To deselect a gene and remove a plot from the grid, click on the X on the right of the gene name.
You can also change the dimensions of the grid. The grid dimensions are represented as Rows x Columns. For example, to view four plots you could choose a 1x4 grid or a 2x2 grid.
You can also find the options to select a specific plot and update the controls. If you have selected “Controls update: All plots”, then changes in other controls such as Select data and Data transformation are going to be applied to all the plots in the grid.
If you select a plot and choose “Controls update: Selected plot”, changes in controls are going to be applied only to the selected plot.
Note that each plot needs to be saved individually.
Select data
You can change the metadata and cell sets used for this plot using the first dropdown menu in the select data control. The selection in the first dropdown menu controls the x-axis of the plot
In the second dropdown menu in the select data controls, you can change the cell set or metadata to be used as data. The default option is to show ‘All’. However, you can choose to display only a part of the data such as an individual sample or metadata group.
Data transformation
Under the Data transformation control, you can change the type of gene expression values from normalized to raw values. Note the change in the values on the y-axis in the screenshots below.
You can also adjust the bandwidth, which impacts the density fit of the kernels. To change the bandwidth, move the slider to your preferred value. Values range from 0 to 1 in 0.05 intervals.
Dot Plot
In Trailmaker, the dot plot shows the percentage of cells expressing the genes of your choice. The percentage of gene expression in all the cells of a specific cluster is represented by the size of the dot. The smaller the dot, the smaller the percentage expression. If you see a bigger dot in a specific cluster, the gene is more expressed there. The color reflects the level of expression of the gene.
By default, three genes with the highest dispersion across all cells are shown.
Gene selection
You can look at the expression of custom genes of your choice or marker genes.
To select custom genes, type in a gene name in the gene search box. You will see automatic suggestions for the genes when you are typing out the gene name. To add multiple genes, separate them with a space or comma. Gene lists can be pasted in from the Data Exploration module or from a list in a document or spreadsheet. Click Add to apply to plot your selected gene/s. The gene/s you have selected will appear below the search bar.
To rearrange the order of the genes on the x-axis, drag and drop these genes in the gene list below the search box. To deselect a gene click on the X on the right of the gene name.
To see the expression of marker genes, click on the “Marker genes” option. Type the number of marker genes per cluster that you want to plot. Click Run to plot the marker gene dot plot.
Select data
In the dot plot, you can also change the cell sets or metadata that cells are grouped by, using the first dropdown menu in the “select data” controls which determines the y-axis.
You can also select the cell sets or metadata to be shown as data. For example, the cells can be grouped by Louvain clusters (y-axis), and you can select to view data only from one sample using the second dropdown menu in the select data controls.
Size scale
You can change the size scale of the dot plot. There are two available options - relative and absolute scale. Absolute scale will show total expression, while relative scale will be relative to what you select in the "Select data" control. So, if you select Louvain clusters, the size scale will be relative to all clusters, but if you select samples, the size scale will be relative to all samples.
Normalized expression matrix
In the Plots and Tables module, you can download the normalized expression matrix for specific samples, metadata groups, clusters, and custom cell sets. The normalized expression matrix contains genes as rows and cells in columns, where for each gene you have a normalized expression value for each cell. The normalized values allow us to see biological variability more clearly. The Seurat object is subsetted before exporting the matrix.
To export the full normalized expression matrix, just click download. The matrix is going to be exported as CSV.
To subset the matrix, click on the “All” box below the parameter. Note that you can also subset using multiple parameters. For example, let's subset the normalized expression matrix based on clusters.
Click on the cluster(s) you want to subset the matrix by. You can choose multiple clusters, and the selected cluster(s) will appear in the box. To deselect a cluster, click on X.
When you have selected your preferred parameters, click download.
Differential expression
Volcano Plot
A volcano plot is a type of scatter plot that represents the differential expression of genes. This plot allows you to identify possible biologically significant genes. The x-axis represents the log fold change, and the y-axis represents the negative log10 of the adjusted p-value.
If you performed differential expression analysis in the Data Exploration module, your most recent selections for the analysis would be reflected in the Volcano plot controls menu. You only need to click compute if you want to plot a volcano plot with the selection from Data Exploration!
Compare cell sets within a sample/group
Here, you can perform differential expression analysis to find marker genes that distinguish one cluster from another cluster or from all other clusters and plot a volcano plot that represents these genes. The calculation uses the presto implementation of the Wilcoxon rank sum test and auROC analysis. For more information, see the presto vignette.
To perform the differential expression analysis and plot the volcano plot go to the Differential expression control. Choose a cell set you want to compare in the first drop-down menu. (1). Choose another cell set, the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu (2). [Note that in the case of Louvain clusters, ‘Rest of Louvain clusters’ and ‘All other cells’ is the same because all cells are assigned to a Louvain cluster. Whereas for Custom cell sets, these two options will be different if not all cells in the dataset are assigned to a Custom cell set.]
Lastly, select the sample/group within which you want to compare cell sets or choose the option ‘All’ (3). Then click compute.
The volcano plot based on the computed differential expression analysis will appear in the plot tile.
The most upregulated genes in the plot are toward the right (in blue), while the most downregulated genes are toward the left (in red). Other genes with a small magnitude of change are colored in gray. The most statistically significant genes are toward the top of the plot.
You can also export the results of the differential expression analysis in a CSV format. Read more in the Export to CSV section.
Compare a selected cell set between samples/groups
With this differential expression analysis, you can find differentially expressed genes between two experimental groups and plot the respective volcano plot. This analysis uses a pseudobulk limma-voom workflow.
To perform this differential expression analysis and plot the volcano plot go to the Differential expression control. Then choose a cell set you want to compare in the first drop-down menu (1). Choose the first sample/group to compare in the second drop-down menu (2). Lastly, select the second sample/group you want to compare with the first sample/group or choose the option ‘Rest of Samples’ or ‘All other cells’ (3).
When you have made your selections, click compute.
The volcano plot based on the computed differential expression analysis will appear in the plot tile.
The most upregulated genes in the plot are toward the right (in blue), while the most downregulated genes are toward the left (in red). Other genes with a small magnitude of change are colored in gray. The most statistically significant genes are toward the top of the plot.
You can also export the results of the differential expression analysis in a CSV format. Read more in the Export to CSV section.
Export to CSV
When you have performed the differential expression analysis, alongside the volcano plot an ‘Export to CSV’ button is going to appear at the top of the plot window.
Click on the button, and your differential expression results will be automatically downloaded in a CSV file.
Data thresholding
Under the Data thresholding control, you can modify the significance thresholds and the design of the guidelines.
Significance Thresholds
You can change the -log10 (p-value) threshold on the y-axis. To do this, input a new threshold into the input box. This can change which genes are going to be considered upregulated or downregulated. The equivalent p-value is displayed below the input box. See the example below.
You can also change the fold change (log) value on the x-axis. The fold change is the ratio of gene expression between the two groups being compared. They are then log-transformed. Adjusting the log fold change value threshold can change which genes will be considered upregulated or downregulated. The new value will be represented as the negative and positive log fold change thresholds, as shown in the example below.
You can also choose to deselect the option ‘Show guideline.’
For both thresholds, deselecting this option will color the guideline black instead of the default red in the plot tile. When you download the plot, the guidelines will disappear completely. See the example below.
Guideline design
Under the guideline design, you can adjust the width and color of the guidelines.
To change the width of the lines, just input the new width value.
You can also change the colors of the guidelines. To do so click on Browse next to “Colors”. Then click on the colored circle on the right of the guideline whose color you want to change.
Then simply select a new color. The change is going to be applied automatically.
Colors
Besides color inversion, under the “Colors” control you are able to change the appearance of markers. Click on Browse next to “Markers”. To change the color of gene markers, click on the colored circle on the right of the marker type.
Then choose a new color. The change is going to be applied automatically.
Comparing fewer than 3 samples
Note that for some selections, an error message will appear:
The message explains that there are fewer than 3 samples with the minimum number of cells required to perform the differential expression calculation in your selected comparison. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples but that there are too few cells (<10) in one or more of the comparison groups, resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the calculation.
As the error message explains, for a comparison between only two ‘valid’ samples it is not possible to compute the differential expression and obtain statistically meaningful results. For such a comparison it is therefore not possible to create a volcano plot in Trailmaker.
In this case, you can still perform the differential expression calculation, but the results will only consist of DE genes and logFC values and should be interpreted with caution and only used for exploratory purposes. No adjusted p-value will be calculated, which means that you will not be able to plot a volcano plot. The plot will look like the image below. However, you can download your results in a CSV format.
Batch Differential Expression Table
Here you can perform DE analysis and download batch DE tables. There are three comparison options.
Generate a full list of marker genes for all cell sets
First, you can download a full list of marker genes for all cell sets, where each cell set will be compared to all other cells, using all samples. You can select which cell sets you want to use for analysis.
When you have selected the cell sets for which marker genes are to be computed in batch, click “Compute and Download”.
A DE table will be created for each cell set.
Compare two selected samples/groups within a cell set for all cell sets
You can also compare two selected samples/groups within a cell set in a batch for all cell sets.
Choose a sample/group you want to compare in the first drop-down menu. (1). Choose another sample/group, or the option ‘Rest of Samples’ or ‘All other cells’ in the second drop-down menu (2). Lastly, select the cell sets within which you want to compare samples/groups (3).
Then click “Compute and Download”. Note that a warning may appear.
Compare between two cell sets for all samples/groups
And lastly, you compare two cell sets for all samples/groups.
Choose a cell set you want to compare in the first drop-down menu. (1). Choose another cell set, or the option ‘Rest of Louvain clusters’ or ‘All other cells’ in the second drop-down menu (2). Lastly, select the sample/group for which you want to batch compare cell sets (3).
Then click “Compute and Download”. Note that you might get a warning message, depending on the selection you made.
The message explains that there are fewer than 3 samples with the minimum number of cells required to perform the differential expression calculation in your selected comparison. The most likely explanation is that you are comparing 1 sample to 1 other sample. An alternative explanation is that you are comparing 3 or more samples but that there are too few cells (<10) in one or more of the comparison groups, resulting in only 2 ‘valid’ comparison groups that contain enough cells to perform the calculation.
As the warning message explains, for a comparison between only two ‘valid’ samples it is not possible to compute the differential expression and obtain statistically meaningful results. For such a comparison it is therefore not possible to create a volcano plot in Trailmaker.
In this case, you can still perform the differential expression calculation, but the results will only consist of DE genes and logFC values and should be interpreted with caution and only used for exploratory purposes. No adjusted p-value will be calculated.
Downloading plots
You can download the plots by clicking on the button with three dots in the top right corner of the plot.
You can save your plots as PNG or SVG. The SVG option is typically higher resolution than the PNG option. Click on your preferred option to start the download.
You can also download your plot using right-click “Save image as.”