How to integrate Parse Evercode and droplet-based data for analysis in Trailmaker – Support Suite - Parse Biosciences

Can I integrate my existing datasets generated using droplet-based methods with Parse Evercode^TM data?

Yes, you can integrate Parse Biosciences Evercode^TM datasets with datasets previously generated using droplet-based methods. Trailmaker^TM offers a streamlined environment to handle this integration efficiently, ensuring that the unique characteristics of each platform are harmonized.

Through head-to-head comparisons, Parse Evercode Whole Transcriptome data and data generated using droplet-based approaches have been found to detect similar cell types and cluster identities. It is worth noting that higher gene detection and lower ambient RNA in Evercode data can lead to higher resolution of cell types and differentially expressed gene profiles. A detailed example is available in our technical note Comparison of Evercode™ WT v3 and Chromium™ GEM-X Single Cell 3’ Kit v4 in Mouse Brain Nuclei and the associated dataset that’s available within the Trailmaker data repository.

When integrating single cell RNA-sequencing datasets generated using multiple technologies, it’s essential to account for technical and non-biological variation. This batch correction process is similar to working with datasets from different biological donors or across different droplet-based scRNA-seq experimental runs. The Data Processing pipeline in Trailmaker Insights module allows integration of datasets or, in other words, matching cell populations across datasets, making it easy to adjust for potential batch effects while preserving biological variation. These adjustments ensure that datasets generated using Parse Evercode and droplet-based methods can be reliably compared and analyzed together.

Considerations for data integration

The Insights module within the Trailmaker data analysis platform provides processes such as library size normalization, scaling, log-normalization, and integration, which allow users to compare datasets from different technologies effectively. However, before these comparisons can be made, there are a few important considerations that can impact the accuracy of your results:

1. Downsampling reads: Gene and transcript detection is a function of read depth. Therefore, in order to ensure both technologies are on equal footing when making comparisons of gene and transcript counts, you may want to consider downsampling the FASTQ pair containing the higher mean reads per cell. The aforementioned normalization processes applied within the Insights module can often be sufficient to mitigate differences in read depth across technologies. However, in some cases, especially when the difference in sequencing depth is large, additional read downsampling might still be necessary. Whether this downsampling step is necessary may also depend on your experimental design and research question. If you have questions or need assistance on downsampling reads, please reach out to support@parsebiosciences.com.

2. Downsampling cells: Datasets that have a substantially greater number of cells have a higher probability of detecting novel subpopulations and detecting genes that do not otherwise exist in a smaller dataset. Small differences in cell numbers between datasets are not likely to make a difference, however, in cases of large differences in cell numbers between datasets, we recommend downsampling cells. Please reach out to support@parsebiosciences.com if you need assistance with downsampling cells.

3. If your Parse assay included multiple samples, the 'all-well' or 'all-sample' count matrix will encompass all these samples. Droplet-based methods usually generate gene count matrices for each individual sample. In order to make sure you are working with the appropriate sample, it’s important to start your analysis with the sample-specific unfiltered count matrices. You can find these in the outputs of the Parse pipeline run (either in Trailmaker Pipeline module or via local installation of the pipeline) in the following folders:

sample_1/DGE_unfiltered/

4. Ensure that the same reference genome build has been used for the FASTQ file processing of both datasets, to avoid discrepancies in gene annotations.

Ready to integrate your data?

If you’ve already got data, you can follow the full instructions below to prepare your datasets for upload to Trailmaker.

How to upload and integrate data from multiple technologies to Trailmaker

Within the Insights module, click ‘Create New Project’ and ‘Upload Project’. Once you have assigned a project name, click on ‘Add data’ from the Project Details page:

To upload data from multiple technologies, select one technology in the dropdown menu and upload the relevant files. When you click ‘Upload’, the modal will automatically close and the files will appear on the Project Details page.

Then return to the data upload modal by clicking ‘Add data’ from the Project Details page to select a second technology and upload relevant files.

The Project Details page displays all samples within the ‘All’ tab of the samples table:

By selecting an individual technology tab, e.g. Parse Evercode WT, you can view the data files and upload status for all samples of the selected technology:

Further information is available in the user guide.

Considerations for Trailmaker Data Processing adjustments based on technology

In the Data Processing steps within Trailmaker’s Insights module, some filtering parameters are automatically adjusted based on the technology used to generate the data. This ensures that datasets are processed in a way that reflects the unique characteristics of each platform. When analysing multi-technology projects in Trailmaker, it’s important to understand the Data Processing filters and parameters that are applied, particularly where settings vary according to technology, and to consider where they might need to be manually adjusted.

The technology-specific adjustments that take place in Data Processing are:

Classifier filter:
- Disabled by default for Evercode data, as it’s based on an algorithm that is primarily designed for droplet-based technologies. For Evercode data, the cell size distribution filter (discussed below) is typically sufficient for filtering out low-quality cells.
- Enabled by default for droplet-based data.
- Disabled by default for all multi-technology projects. The cell size distribution filter should be sufficient to replace the classifier filter in this case.
Cell size distribution filter:
- Enabled by default for Evercode data to replace the classifier filter.
- Disabled by default for droplet-based data because the classifier filter usually provides enough precision in filtering low-quality cells, though it can be enabled.
- The threshold for this filter is calculated using different algorithms for Evercode (Parse algorithm) and droplet-based (Seurat algorithm) data. You should carefully check the filtering threshold for all samples and adjust if required.
Number of genes vs. transcripts filter:
- A spline approach is applied to Evercode samples, as it often shows saturation at higher molecule counts, making this method a better fit.
- A linear approach is applied to samples from droplet-based technologies.
Doublet filter:
- Evercode datasets typically exhibit a lower doublet rate than droplet-based datasets. Doublet score calculations are fine-tuned per technology type and per Parse kit type in order to accommodate differences in the expected doublet rates between technologies and kits.

Controlling data integration in Trailmaker

Within step 6 of Trailmaker’s Insights module, data integration takes place in order to remove batch effects. Batch effects pose substantial challenges as they can drive heterogeneity in the data, obscure or distort true biological differences, and complicate the interpretation of results. When integrating Evercode datasets with datasets generated using other technologies, you should pay careful attention to the integration step.

Trailmaker uses the Harmony method by default. The Harmony method can be selected in either Seurat or Scanpy workflows. Fast MNN and Seurat v4 methods of integration are also available for Seurat projects, or you can choose not to apply an integration method to your data (“no integration”).

Successful integration is all about minimizing technical variation (batch effects) while preserving the true biological signals in the data. When there is a good integration you will notice that cells with the same identity from different samples or technologies should cluster together. Moreover, no unexpected separations that seem unrelated to biology should appear; in other words, cells should not cluster by technology. In Trailmaker, you can check this using the UMAP, where you should see a good superposition of all the samples in all the clusters. Another plot that can be helpful to check the integration quality is the frequency plot, where proportions of cells of each sample should be present throughout each one of the clusters.

If you want to know more about data integration in Trailmaker, you can check our free online course at https://courses.trailmaker.parsebiosciences.com/courses/mastering-scrna-seq-with-trailmaker.

Can I integrate my existing datasets generated using droplet-based methods with Parse EvercodeTM data?