How to adjust data processing settings to fit your dataset and troubleshoot data processing failures – Support Suite - Parse Biosciences

The Insights module in Trailmaker^TM runs a 7-step data processing pipeline to filter, integrate and visualize your dataset (see user guide for a full explanation of data processing steps). Trailmaker applies automatically determined default settings during the first run of data processing. However, you should always carefully check all filtering thresholds for each sample in order to ensure that your dataset is filtered correctly, excluding poor quality cells while ensuring that good quality cells including rare cell populations are retained. In this article, we provide guidance on how to adjust data processing parameters to fit your dataset and discuss how to troubleshoot data processing failures.

How to adjust data processing settings to fit your dataset

The automatically determined default data processing settings have been optimized to be acceptable to the majority of uploaded datasets. However, due to the range of technologies, species, experimental designs, and sample quality experienced in single cell RNA-sequencing data, it is possible that you may need to adjust one or more of the settings to fit your dataset.

Data processing adjustments according to single cell technology

Trailmaker’s data processing pipeline is automatically adjusted according to the single cell technology used in your experiment. For example, for Parse Biosciences Evercode^TM WT data, the Cell size distribution filter (filter 2) is enabled for removing background by setting a filtering threshold at the inflection point (often referred to as the “knee”) on the barcode rank plot. The Classifier filter (filter 1) is disabled for Evercode data as the emptyDrops method used by that filter is not optimized for non-droplet data. The opposite is the case for droplet-based and microwell-based technologies: the Classifier filter is enabled as it is appropriate for other technology types, while the Cell size distribution filter is disabled. Trailmaker also makes some automatic adjustments to the doublet filter algorithm (filter 5) according to the technology used. This is because the expected doublet rate differs for different technologies.

These technology-specific setting adjustments are made automatically in Trailmaker, as determined by the technology selected during data upload to the Insights module. Note that there are some specific nuances when integrating data from multiple technologies in a single Trailmaker project - these are covered in the How to integrate Parse Evercode and droplet-based data for analysis in Trailmaker article.

Data processing adjustments for single nuclei versus single cell data

Trailmaker supports both single cell and single nuclei data. However, it’s worth noting that the automatically determined default settings for data processing are primarily optimized for single cell data. Therefore, if you’re working with single nuclei data you should carefully check all steps of data processing to ensure that the filtering thresholds are appropriate for your nuclei samples.

Removal of background happens in filters 1 and 2. As mentioned in the previous section in this article, the default settings for these filters vary according to the single cell technology used in your project. Whichever combinations of filters 1 and 2 you use to filter background and/or empty droplets from your dataset, you should pay particular attention to the filtering thresholds to ensure that nuclei with low number of transcripts are retained.

In nuclei data, particular attention should also be paid to the mitochondrial content filter as it might be indicative of the efficacy of the nuclei preparation process. The number of mitochondrial reads is expected to be very low in nuclei data. Below are example mitochondrial content plots (filter 3) of mouse brain cells (left) and nuclei (right):

Trailmaker makes adjustments to the automatically calculated filtering settings to account for the expected low number of mitochondrial transcripts in nuclei data. However, it is advisable to verify the thresholds and make sure that no overfiltering is occurring.

Data processing adjustments for different species

The Trailmaker Insights module is deliberately designed to be species agnostic. Most features available in this module for exploring and plotting your data will work well regardless of the species used in your experiment (with the exception of pathway analysis, which does have some species restrictions).

For the data processing settings, we recommend that you consider the following species-specific limitations:

Mitochondrial content filter (filter 3): In order to filter cells based on mitochondrial content, Trailmaker detects reads from mitochondrial genes that start with “mt-”. This gene nomenclature is true for human and mouse genomes, but may not be true for other genomes. Below is an example mitochondrial content plot for Zebrafish, which does not contain mitochondrial genes starting with “mt-”. In this case, all cells are plotted at or near to zero on the x-axis:

If the genome for the species you are working with does not contain annotations starting “mt-" for mitochondrial genes, then you should consider disabling the mitochondrial content filter.

Data integration (step 6): During the data integration step it is possible to exclude specific gene categories from the UMAP or t-SNE embedding computation.

This feature is species-dependent and you should consider the following:

Ribosomal genes are excluded based on the selection of genes that contain “rps”, “rpl”, “mrps” or “mrpl”. If this gene nomenclature is not true for your species, this feature will not work and should not be used.
Mitochondrial genes are excluded based on the selection of genes that start with “mt-”. If this gene nomenclature is not true for your species, this feature will not work and should not be used.
Cell cycle genes can be excluded from human and mouse datasets only. Trailmaker uses the list of cell cycle genes reported in the following article:

Tirosh et al. “Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq.” Science (New York, N.Y.) vol. 352,6282 (2016): 189-96. doi:10.1126/science.aad0501. If you are using a species other than human or mouse, this feature will not work and should not be used.

How to spot potential issues in data processing plots and adjust settings

Occasionally, the default filtering settings calculated by Trailmaker are suboptimal and result in either too lenient or too strict filtering of the data. In these cases, settings should be adjusted in order to optimize filtering of your dataset. Below are some examples of suboptimal filtering where we recommend making manual adjustments.

The Cell size distribution filter (filter 2) sets a filtering threshold according to the inflection point (often referred to as the “knee”) on the barcode rank plot. In cases where there is no distinct steep decline in the number of transcripts (i.e. no distinctive “knee”), the default setting calculated by Trailmaker may be incorrect. This can happen when sample quality is poor and can result in too lenient or too strict filtering of background.

Examples of good quality (left) and poor quality (right) knee plots are shown below:

You can manually adjust the filtering threshold using the filtering settings menu, and then re-run data processing to apply the changes. Use the literature to determine what is typical for the cell types and/or tissues present in your dataset. It is advisable in these cases to start with less stringent values and increase the thresholds, if necessary, after a preliminary cell annotation process.

The mitochondrial content filter (filter 3) excludes filters out dead and dying cells based on mitochondrial content. Examples of mitochondrial content plots with very few (left) and very many (right) dead/dying cells are shown below:

Screenshot 2025-01-23 at 14.18.20.png

In this scenario, the user has to decide where to set the threshold and, specifically in this case, whether to include or exclude the middle peak. It is advisable in these cases to start with less stringent values and increase the thresholds, if necessary, after a preliminary cell annotation process.

The number of genes versus transcripts filter (filter 4) applies a linear or spline interpolation to the data and excludes data points outside the two red dotted lines. For Parse data, the default is spline whereas for other data types the default is linear. It is possible that the interpolation may not accurately fit your sample data, and you might consider changing from spline to linear or vice versa.

Below is an example of a sample from a droplet-based technology where the default linear setting excludes some data points, particularly at the lower left area of the plot (left). Changing the setting to spline reduces the over-filtering at the lower and upper bounds (right plot).

In some cases, you might see a secondary population of poorer quality in this filtering step, as shown in the example below:

Screenshot 2025-01-23 at 14.18.55.png

In this example, the secondary population is filtered out by default (it is outside of the red dotted lines that indicate the filtering threshold) and is, therefore, excluded from downstream analysis. It is advisable to investigate this population further, for example by disabling filter 4 in order to include the population in downstream analysis, in order to decide whether this population is biologically relevant and should be included in or excluded from downstream analysis.

The doublet filter (filter 5) calculates the doublet score for each cell in the dataset on a per sample basis. If a sample has a low number of cells, the doublet score calculation will have low power. In this case, the distribution of the plot can look messy, with fewer cells at the extremes (close to 0 and close to 1) and more cells in the middle (between 0.1 and 0.9). On the left is an example of a plot with low statistical power, containing ~100 cells, and on the right is an example with greater statistical power, containing 10,000 cells:

If you have samples with few cells (e.g., <1000 cells), such as the example on the left, you should carefully check that the automatically determined default setting for the doublet filter is appropriate. It is advisable to check if the filtering threshold is consistent across all samples in the dataset. If necessary, you can manually adjust the setting for individual samples and re-run data processing to apply the change(s).

If a sample has too few cells to calculate the doublet score (<100 cells), a warning message will appear stating that the doublet scores have not been calculated for that sample. In this case, data processing continues without any filtering of that sample at the doublet step.

In the data integration step (step 6), the Harmony method if integration is applied to all datasets by default. You can choose to select a different integration method, or indeed no integration, using the dropdown menu. Data that is well integrated would usually be expected to have all samples distributed evenly across all clusters and, therefore, areas of the UMAP. Below are example plots of good integration (left) and poor integration (right):

Screenshot 2025-01-23 at 14.13.38.png

How to troubleshoot data processing failures

If you experience a data processing failure in Trailmaker, you’ll see that the status is reported as failed, and the steps that did not complete are marked with a red “x” by that step in the dropdown menu. A message appears at the top of the page to recommend that you check all data processing steps for warnings in individual samples.

In the filtering step where the issue(s) occurred, warning message(s) appear on the sample(s) that caused the issue(s). Below is an example of a warning message in the Cell size distribution filter (filter 2), when poor quality data resulted in removing too many cells:

The common causes of data processing failures are:

Cell size distribution filter (filter 2) excludes too many cells in one or more samples which can cause data processing to fail at the Genes versus transcripts filtering step (filter 4). In this case, you should manually change the Cell size distribution filter threshold for the affected sample(s) to be less strict. The warning message(s) point to the affected sample(s).
Whilst Trailmaker allocates data processing resources relevant to dataset size, in some cases failures can occur as a result of resource limitations. In these cases, the pipeline status will report a failure due to a timeout which can be seen by hovering over the pipeline status bar. Users with Parse Biosciences data should first check that the kit type selection made in the Insights Project Details page is correct (Evercode WT Mini, Evercode WT or Evercode WT Mega). If an incorrect kit type was selected, change the selection using the dropdown menu and process the project again. If the failure persists or if you are working with a data type that is not Parse, contact us to request an increase in backend computational power.
Seurat integration (step 6 with selection of the Seurat integration method) can fail due to resource limitation as this method is not memory efficient. A warning is presented to suggest using an alternative integration method.
Very rarely, temporary outages of AWS servers may result in a data processing failure. On Trailmaker, this would result in a failure with no clear cause. In this case, try re-running data processing. If the issue persists, contact us for support.

To adjust settings, select the manual option in the Filtering Settings and change the threshold accordingly. It is advisable to check all samples in the affected step in case over-filtering has occurred. Re-run the Data Processing pipeline to apply your changes.

In summary, the information in this article together with the warning messages in Trailmaker itself should help you to address data processing failures. If you follow the advice provided here but continue to encounter a data processing issue, contact us at support@parsebiosciences.com for further support.