How to analyze a 1 million cell data set using Scanpy and Harmony – Support Suite - Parse Biosciences

Introduction

Here we present an example of a Scanpy analysis on a 1 million cell data set generated with the Evercode™ WT Mega kit. This dataset is composed of peripheral blood mononuclear cells (PBMCs) from 12 healthy and 12 Type-1 diabetic donors from a commercial vendor, which were all barcoded and sequenced in a single experiment.

Experimental protocol

Samples were collected over the course of 3 weeks. Each sample was fixed at the time of collection using a Parse Biosciences Cell Fixation kit. Fixed samples were frozen at -80°C until ready to process with the Evercode WT Mega kit.

All 24 samples were then processed in parallel with with the Evercode WT Mega kit and then sequenced on an Illumina® NovaSeq™ with a S4 flow cell. The resulting FASTQ files were analyzed using our data analysis pipeline (v0.9.6). The resulting Digital Gene Matrix file was used in the Scanpy analysis described below.

Analyzing the scRNA-seq output using Scanpy and Harmony

This tutorial will walk you through a standard single cell analysis using the Python package Scanpy, and then follow with the Python implementation of Harmony for integration. Harmony is also easily accessible via Scanpy's external API, making it convenient to work with a preexisting Scanpy analysis or starting anew.

Recommendations for executing the Scanpy analysis

To avoid computer crashes and program failures, we recommend you execute the Python code on a cloud computing platform. The instance you select on the cloud platform to run this analysis should consist of at least 8 threads and 160 GB of RAM.

Accessing the Parse Biosciences files used in this tutorial

To execute the Harmony integration analysis described below, customer will need to download three files:

The digital gene matrix file (DGE_1M_PBMC.mtx)
The cell metadata file (cell_metadata_1M_PBMC)
The genes file (all_genes_1M_PBMC)

These three files are compressed into a single folder and available for download here. You can also download the data via command line by using wget:

wget "https://cdn.parsebiosciences.com/1M_PBMC_T1D_Parse.zip"

Tutorial

Install Harmony

pip install harmonypy

Load libraries and set the location path for analysis data

import os
import sys
import scipy
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import scipy.io as sio
import scanpy.external as sce
import matplotlib.pyplot as plt

sc.settings.verbosity = 1
sc.settings.figdir = '/my-figure-directory/'
obj_save_path = '/my-object-save-path/'
mat_path = '/my-expression-matrix-path/'

# Adjust Scanpy figure defaults
sc.settings.set_figure_params(dpi=100, fontsize=10, dpi_save=400,
    facecolor = 'white', figsize=(6,6), format='png')

Reading in data

# The DGE_filtered folder contains the expression matrix, genes, and files
adata = sc.read_mtx(mat_path + 'DGE_1M_PBMC.mtx')

adata.write(obj_save_path + 'adata_obj1.h5ad')
# adata = sc.read(obj_save_path + 'adata_obj1.h5ad')

# reading in gene and cell data
gene_data = pd.read_csv(mat_path + 'all_genes_1M_PBMC.csv')
cell_meta = pd.read_csv(mat_path + 'cell_metadata_1M_PBMC.csv')

# find genes with nan values and filter
gene_data = gene_data[gene_data.gene_name.notnull()]
notNa = gene_data.index
notNa = notNa.to_list()

# remove genes with nan values and assign gene names
adata = adata[:,notNa]
adata.var = gene_data
adata.var.set_index('gene_name', inplace=True)
adata.var.index.name = None
adata.var_names_make_unique()

# add cell meta data to anndata object
adata.obs = cell_meta
adata.obs.set_index('bc_wells', inplace=True)
adata.obs.index.name = None
adata.obs_names_make_unique()

sc.pp.filter_cells(adata, min_counts=100)
sc.pp.filter_genes(adata, min_cells=5)
adata.shape

Cell Quality Control

In this step you will perform a cell QC control check to prevent outlier cells from influencing downstream analyses. This step will check for:

cells that have unusually high transcript or gene counts, which are likely to be multiplets (a term for two or more cells that have identical barcodes as the result of being stuck together during processing)
cells that have very low transcript or genes counts
cells with high mitochondrial gene percentages, which may be indicative of cell stress (Luecken and Theis 2019)

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Scanpy will prepend the string in the save argument with "violin"
# and save it to our figure directory defined in the first step.
sc.pl.violin(adata, ['n_genes_by_counts'], save='_n_genes', jitter=0.4)
sc.pl.violin(adata, ['total_counts'], save='_total_counts', jitter=0.4)
sc.pl.violin(adata, ['pct_counts_mt'], save='_mito_pct', jitter=0.4)

# Check cells counts after filtering before assigning
adata[(adata.obs.pct_counts_mt < 25) & 
(adata.obs.n_genes_by_counts < 5000) & 
(adata.obs.total_counts < 25000),:].shape

# Do the filtering
adata = adata[(adata.obs.pct_counts_mt < 25) & 
(adata.obs.n_genes_by_counts < 5000) & 
(adata.obs.total_counts < 25000),:]

Now that we've removed the outlier cells, we can normalize the matrix to 10,000 reads per cell and log transform the results.

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Identify highly-variable genes

Our next goal is to identify genes with the greatest amount of variance (i.e. genes that are likely to be the most informative). This subset of genes will be used to calculate a set of principal components which will determine how our cells are classified using Leiden clustering and UMAP. You can fine tune variable gene selection by adjusting the min/max mean expression and min/max dispersion.

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.25)
# This saves the original set of genes 
adata.raw = adata

adata = adata[:,adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)

Principal component analysis

The most "informative" cells and genes have been extracted from this dataset, and now it is time to start the process of dimensional reduction by generating a list of principal components (PCs).

Principal component analysis will reduce the number of columns (variable gene expression values) to set of PCs which explain the variance in our dataset.

In the analysis we use a simple and effective method for choosing the PCs by plotting the variance ratio for each PC and choosing the last PC where the ratio starts to "flatten" out.

For this particular dataset we've chosen 50 PCs, which will be passed to the sc.pp.neighbors function in the next step.

sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50, save='') # scanpy generates the filename automatically

Perform UMAP and Leiden Clustering

This step will involve reducing the dimensionality of our data into two dimensions using uniform manifold approximation (UMAP), allowing us to visualize our cell populations as they are binned into discrete populations using Leiden clustering. The n_neighbors parameter in the sc.pp.neighbors function will determine the size of each cell cluster; lower values will translate to a greater number of clusters by breaking up the dataset into smaller communities, and visa versa for larger values. We can also fine tune the number of clusters using the resolution parameter in the sc.tl.leiden function.

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=50)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

Plots

Define a helper function to set a single column for our UMAP figure legends. This is to keep the sample and Leiden cluster plots UMAP plots at similar proportions.

def one_col_lgd(umap):
    legend = umap.legend(bbox_to_anchor=[1.00, 0.5],
    loc='center left', ncol=1, prop={'size': 6})
    legend.get_frame().set_linewidth(0.0)
    for handle in legend.legendHandles:
        handle.set_sizes([25.0])
    return legend

Generate the UMAPs:

donor_umap = sc.pl.umap(adata, color=['sample'],
show=False, palette=sns.color_palette("husl", 24),
    legend_fontsize=6, frameon=True, title='Donor')

lgd = one_col_lgd(donor_umap)

fig = donor_umap.get_figure()
fig.set_size_inches(5, 5)
fig.savefig(str(sc.settings.figdir) + '/umap_lgd_sample',
    dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')

# by cluster
leiden_umap = sc.pl.umap(adata, color=['leiden'],
    show=False, palette=sns.color_palette("husl", 24),
legend_fontsize=6, frameon=True, title='Leiden')

lgd = one_col_lgd(leiden_umap)

fig = leiden_umap.get_figure()
fig.set_size_inches(5, 5)
fig.savefig(str(sc.settings.figdir) + '/umap_lgd_leiden',
    dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')

You can examine the relationship between each donor sample and the Leiden clusters in further detail after running Harmony to get a side-by-side comparison between both analyses. For now, it's important to notice how the clusters form and separate by their condition when observing both the sample and cluster UMAP plots.

Note that these algorithms are stochastic, and the result is that the UMAP figures produced by customers may differ slightly from the figures in this tutorial. The reasons for this slight difference in the results is explained in greater detail here.

Integration with Harmony

Harmony (Korsunsky et al. 2019) is a single cell integration method which groups cells by type (as opposed to conditions) onto a shared embedding (e.g., UMAP or t-SNE). This is particularly useful for identifying the same cell type among different conditions, where biological and technical differences can obfuscate cell identity and make it difficult to draw comparisons between each condition in a standard workflow.

Why is it advantageous to remove batch effects due to differences in biological conditions?

Technical variation and differences in underlying biology can make it difficult to study specific cell types across conditions as their transcriptomic signatures may vary. This is especially difficult when using multiple primary samples and complex study designs. Harmony enables scientists to make comparisons between a specific set of conditions with less guesswork about which clusters correspond to a specific cell type.

Running Harmony

It is now time to execute Harmony on the anndata object from the above workflow by specifying "sample" as the cell attribute to be corrected.

sce.pp.harmony_integrate(adata, 'sample')

Harmony works by adjusting the principal components. After the integration function runs to completion, assign the new PCs to the standard PCA slot, and run through the subsequent steps from our standard analysis.

adata.obsm['X_pca'] = adata.obsm['X_pca_harmony']
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

Plots

This step follows the same UMAP plot template as our standard workflow. To generate the plots, execute the code below.

# by sample
donor_umap = sc.pl.umap(adata, color=['sample'],
    show=False, palette=sns.color_palette("husl", 24),
legend_fontsize=6, frameon=True, title='Donor')

lgd = one_col_lgd(donor_umap)

fig = donor_umap.get_figure()
fig.set_size_inches(5, 5)
fig.savefig(str(sc.settings.figdir) + '/umap_lgd_harmony_sample',
    dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')

# by cluster
leiden_umap = sc.pl.umap(adata, color=['leiden'],
show=False, palette=sns.color_palette("husl", 24),
    legend_fontsize=6, frameon=True, title='Leiden')

lgd = one_col_lgd(leiden_umap)

fig = leiden_umap.get_figure()
fig.set_size_inches(5, 5)
fig.savefig(str(sc.settings.figdir) + '/umap_lgd_harmony_leiden',
    dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')

The first and most notable effect of running the integration function to completion is seeing how the samples overlap with another, which generates fewer clusters by merging similar cell types.

Comparing the Two Analyses

(To zoom in right click on each image and select "Open image in new tab")

Conclusion

We demonstrate how to analyze and integrate Parse Biosciences Evercode WT scRNA-sequencing data with Scanpy and Harmony. Using Harmony to remove batch effects makes it easier to compare across conditions with less guesswork about which clusters correspond to a specific cell type. As scRNA-seq studies become larger with more complex study designs, data integration is increasingly important for data analysis.

Analysis and visualization of data generated from Parse Biosciences Evercode^TM WT technology can be performed using Trailmaker^TM or using code-based methods such as Seurat or Scanpy. Trailmaker is a user-friendly platform that incorporates both Seurat and Scanpy options, and covers all essential steps in the data analysis process (including integration with Harmony). For more information on Trailmaker, visit our website or sign up to start exploring - it’s free to all Parse customers and all academic users.

Introduction

Experimental protocol

Analyzing the scRNA-seq output using Scanpy and Harmony

Recommendations for executing the Scanpy analysis

Accessing the Parse Biosciences files used in this tutorial

Tutorial

Install Harmony

Load libraries and set the location path for analysis data

Reading in data

Cell Quality Control

Identify highly-variable genes

Principal component analysis

Perform UMAP and Leiden Clustering

Plots

Integration with Harmony

Why is it advantageous to remove batch effects due to differences in biological conditions?

Running Harmony

Plots

Comparing the Two Analyses

Conclusion

Related articles