Can I integrate my existing datasets generated using droplet-based methods with Parse EvercodeTM data?
Yes, you can integrate Parse Biosciences EvercodeTM datasets with datasets previously generated using droplet-based methods. TrailmakerTM offers a streamlined environment to handle this integration efficiently, ensuring that the unique characteristics of each platform are harmonized.
Through head-to-head comparisons, Parse Evercode Whole Transcriptome data and data generated using droplet-based approaches have been found to detect similar cell types and cluster identities. It is worth noting that higher gene detection and lower ambient RNA in Evercode data can lead to higher resolution of cell types and differentially expressed gene profiles. A detailed example is available in the following application note: Comparison of EvercodeTM WT v2 and ChromiumTM Next GEM Single Cell 3’ Kit v3.1 in Mouse Lymph Node Nuclei.
Additional examples, including four mouse brain nuclei samples, are available on our website at: https://www.parsebiosciences.com/datasets/
When integrating single cell RNA-sequencing datasets generated using multiple technologies, it’s essential to account for technical and non-biological variation. This batch correction process is similar to working with datasets from different biological donors or across different droplet-based scRNA-seq experimental runs. The Data Processing pipeline in Trailmaker Insights module allows integration of datasets or, in other words, matching cell populations across datasets, making it easy to adjust for potential batch effects while preserving biological variation. These adjustments ensure that datasets generated using Parse Evercode and droplet-based methods can be reliably compared and analyzed together.
Considerations for data integration
The Insights module within the Trailmaker data analysis platform provides processes such as library size normalization, scaling, log-normalization, and integration, which allow users to compare datasets from different technologies effectively. However, before these comparisons can be made, there are a few important considerations that can impact the accuracy of your results.
1. Downsampling reads: Gene and transcript detection is a function of read depth. Therefore, in order to ensure both technologies are on equal footing when making comparisons of gene and transcript counts, the FASTQ pair containing the higher mean reads per cell should be downsampled such that the mean reads per cell is the same. If you need assistance on downsampling reads, please reach out to support@parsebiosciences.com
2. Downsampling cells: Datasets that have a substantially greater number of cells have a higher probability of detecting novel subpopulations and detecting genes that do not otherwise exist in a smaller dataset. Small differences in cell numbers between datasets are not likely to make a difference, however, in cases of large differences in cell numbers between datasets, we recommend downsampling cells. Please reach out to support@parsebiosciences.com if you need assistance with downsampling cells.
3. If your Parse assay included multiple samples, the 'all-well' or 'all-sample' count matrix will encompass all these samples. Droplet-based methods usually generate gene count matrices for each individual sample. In order to make sure you are working with the appropriate sample, it’s important to start your analysis with the sample-specific count matrices.
SL1-out/sample_1/DGE_filtered/
SL1-out/sample_1/DGE_unfiltered/
4. Ensure that the same reference genome build has been used for the FASTQ file processing of both datasets, to avoid discrepancies in gene annotations.
Ready to integrate your data?
If you’ve already got data, you can follow the full instructions below to prepare your datasets for upload to Trailmaker.
How Parse Evercode and ChromiumTM data format differ
When uploading datasets to Trailmaker’s Insights module, the formats required for Parse Evercode data and 10x GenomicsTM Chromium data differ slightly. In both cases, count matrices are required, but the specific files have distinct structures.
The three required files are:
- For Evercode data: DGE.mtx (or count_matrix.mtx), cell_metadata.csv, and all_genes.tsv.
- For Chromium data: matrix.mtx, barcodes.tsv, and features.tsv.
While these files may share similarities, they are not directly interchangeable due to differences in their format and content.
Both Evercode and Chromium data have a .mtx (Matrix Market format) file to store gene expression count data. The only difference is that rows and columns are inverted: in Evercode data, each row corresponds to a cell and each column represents a gene, while in Chromium data the arrangement is inverted.
The cell_metadata.csv file (from Evercode) and barcodes.tsv file (from Chromium) both contain information about the cells, but they differ in the structure and details provided:
- cell_metadata.csv: This file includes detailed metadata about each cell, including the cell barcode, species, sample, well in each round of barcoding, and number of transcript/genes detected. It offers more information than just the barcode, making it more comprehensive.
- barcodes.tsv: This file only lists the unique barcodes for each cell in the experiment. This file essentially provides the labels for the columns in the matrix.mtx file, representing individual cells without additional metadata.
Similarly, the all_genes.tsv file (from Evercode) and the features.tsv file (from Chromium) both contain information about the genes detected, but they can include different additional information:
- all_genes.tsv: This file contains a list of all detected genes in the experiment. It often includes the gene name, gene ID, and genome.
- features.tsv: This file contains a list of all detected genes in the experiment. In addition to the gene name and ID, it might include information about the feature type (e.g., whether it's a gene, an antibody tag, or other capture feature).
Why is it not possible to upload Evercode and Chromium files directly to a single Trailmaker Insights project?
Due to the aforementioned differences in data formats between Evercode and Chromium, Trailmaker is currently designed to handle each technology in separate projects. When uploading data to Trailmaker, you must first select the appropriate technology from the dropdown menu in the upload modal. This ensures that the platform processes the data correctly, according to the specific structure and format required for that technology.
As a result, each Trailmaker project can only include samples from a single technology. So, how can Evercode and Chromium datasets be integrated using Trailmaker? The data needs to be converted before uploading to Trailmaker. We will explain how to convert the data in the next section.
How to convert Chromium count matrices to Parse Evercode format
To integrate Evercode and Chromium datasets for analysis in Trailmaker, the most convenient approach is to convert Chromium count matrices into the Evercode format. This ensures compatibility and allows both datasets to be processed and integrated in the Trailmaker Insights module within a single project.
Let’s see a step-by-step tutorial on how to convert your Chromium data into Evercode format using a simple R script. This script converts the Chromium files (matrix.mtx, features.tsv, and barcodes.tsv) into the required Evercode format (DGE.mtx, cell_metadata.csv, and all_genes.tsv):
Before starting, ensure that your Chromium data is demultiplexed and in the correct format. You should have the following 3 files for each sample:
- features.tsv
- barcodes.tsv
- matrix.mtx
Before running the R code reported below, make sure to specify the path of your specific input directory, which contains the demultiplexed Chromium files. You can do this by setting the variable input_dir <- "./"
, adjusting it to the correct path where your files are located.
R script:
# Load the required libraries
library(Seurat)
library(dplyr)
library(fs)
# Load the function to reads annotations from Chromium files
read_10x_annotations <- function(annot_fpath, sample) {
gene_column <- 1
annot <- read.delim(annot_fpath, header = FALSE)
# Remove features that are not "Gene Expression"
if (ncol(annot) > 2 && length(grep("Gene Expression", annot$V3)) > 0) {
annot <- annot %>% dplyr::filter(V3 == "Gene Expression")
}
}
# Set the input directory containing sample data
# Make sure that the input directory contains only folders corresponding to Chromium samples to convert, as names of the folders will be used as sample names
input_dir <- "./"
samples <- list.files(input_dir)
# Loop through each sample and convert data
for (sample in samples) {
message("Converting sample ", sample)
sample_dir <- file.path(input_dir, sample)
sample_fpaths <- list.files(sample_dir)
annot_fpath <- file.path(sample_dir, "features.tsv.gz")
annotations <- read_10x_annotations(annot_fpath, sample)
counts <- Seurat::Read10X(sample_dir, gene.column = 1, unique.features = TRUE)
if (is(counts, "list")) {
slot <- "Gene Expression"
if (!(slot %in% names(counts))) slot <- names(counts)[1]
counts <- counts[[slot]]
}
out_path <- path(input_dir, "converted", sample)
if (!dir_exists(out_path)) dir_create(out_path)
# Write Evercode files
Matrix::writeMM(Matrix::t(counts), path(out_path, "count_matrix.mtx"))
vroom::vroom_write(annotations, path(out_path, "all_genes.csv"), delim = ",")
vroom::vroom_write(
data.frame(bc_wells = colnames(counts), sample = sample),
path(out_path, "cell_metadata.csv"),
delim = ","
)
message("Finished converting sample ", sample)
}
In summary, for each sample, this script reads the Chromium annotations from features.tsv and the count matrix from the matrix.mtx file. The data is then converted to Evercode format by transposing the count matrix and writing out three files: count_matrix.mtx, all_genes.csv, and cell_metadata.csv.
The converted files are saved in a new directory named "converted" that contains one folder per sample, each with the 3 required files.
Considerations for Trailmaker Data Processing adjustments based on technology
In the Data Processing steps within Trailmaker’s Insights module, some filtering parameters are automatically adjusted based on the technology used to generate the data. This ensures that datasets are processed in a way that reflects the unique characteristics of each platform. It’s important to be aware of these aspects when integrating datasets produced with different technologies.
The technology-specific adjustments that take place in Data Processing are:
- Classifier filter:
- Disabled by default for Evercode data, as it’s based on an algorithm that is primarily designed for droplet-based technologies. For Evercode data, the cell size distribution filter (discussed below) is typically sufficient for filtering out low-quality cells.
- Enabled by default for droplet-based data.
- Cell size distribution filter:
- Enabled by default for Evercode data to replace the classifier filter.
- Disabled by default for droplet-based data because the classifier filter usually provides enough precision in filtering low-quality cells, though it can be enabled.
- The threshold for this filter is calculated using different algorithms for Evercode and droplet-based data.
- Doublet filter:
- Evercode datasets typically exhibit a lower doublet rate than Chromium datasets, and the default doublet filter thresholds in Trailmaker are adjusted accordingly.
- Number of genes vs. transcripts filter:
- A spline approach is applied to Evercode data, as it often shows saturation at higher molecule counts, making this method a better fit.
- A linear approach is applied to droplet-based data.
When you convert Chromium data to Evercode format for upload to Trailmaker, you will upload the data as a Parse Evercode project and the Evercode-specific Data Processing settings will be applied. As a result, some Data Processing filters may need manual adjustment, especially to account for the original characteristics of Chromium data. Whether or not the filters will need to be adjusted will depend on the specific dataset. Here we provide some guidelines that highlight the main points to pay attention to:
- Classifier filer: Since it is not possible in Trailmaker to enable or disable filters for specific samples, you should keep it disabled for all samples, including your Chromium data. The cell size distribution filter should be sufficient to replace the classifier filter.
- Cell size distribution filter: Since the Evercode algorithm is applied to all samples in the project, you should carefully check the filtering threshold for your Chromium samples and adjust if required.
- Number of genes vs. transcripts filter: For Chromium samples, you can manually switch from the spline method (default for Evercode data) to the linear method, which typically fits Chromium data better.
- Doublet filter: To fine-tune the doublet filter for Chromium data, consider running the Chromium samples in a separate Trailmaker Insights project and observing the percentage of cells filtered out at the doublet filter step when the Chromium settings for this filter are applied. You can then use this information to manually adjust the doublet filter probability thresholds for the Chromium samples in the mixed technologies project to achieve a similar filtering percentage.
Controlling data integration in Trailmaker
Within step 6 of Trailmaker’s Insights module, data integration takes place in order to remove batch effects. Batch effects pose substantial challenges as they can drive heterogeneity in the data, obscure or distort true biological differences, and complicate the interpretation of results. When integrating Evercode datasets with datasets generated using other technologies, you should pay careful attention to the integration step.
Trailmaker uses the Harmony method by default. Fast MNN and Seurat v4 methods of integration are also available, or you can choose not to apply an integration method to your data (“no integration”).
Successful integration is all about minimizing technical variation (batch effects) while preserving the true biological signals in the data. When there is a good integration you will notice that cells with the same identity from different samples or technologies should cluster together. Moreover, no unexpected separations that seem unrelated to biology should appear; in other words, cells should not cluster by technology. In Trailmaker, you can check this using the UMAP, where you should see a good superposition of all the samples in all the clusters. Another plot that can be helpful to check the integration quality is the frequency plot, where proportions of cells of each sample should be present throughout each one of the clusters.
If you want to know more about data integration in Trailmaker, you can check our free online course at https://courses.trailmaker.parsebiosciences.com/courses/mastering-scrna-seq-with-trailmaker.