Generating Full TCR Alpha and Beta Sequences with stitchr – Support Suite - Parse Biosciences

stitchr is a tool for reconstructing full-length T cell receptor (TCR) sequences from minimal input data. This guide focuses on using the command-line utility thimble, a wrapper included with stitchr, for batch processing of TCR alpha and beta chains.

The Everocode^TM TCR assay was optimized to capture the CDR3 information from TRA and TRB chains and confidently determine the paired clonotype information. The assay achieves sufficient coverage across the VDJ gene segments to provide VDJ annotations along with the CDR3 information for the TRA and TRB chains- that will include the gene-segment annotations for TRBV : VDJ , TRAV:VJ and also the TRAC and TRBC. The Parse bioinformatics pipeline generates TCR output files such as tcr_annotation_airr.tsv, in standardized AIRR format, which contains the identified contigs in each cell along with their VDJ chain annotations. The tcr_annotation_airr.tsv file can be utilized by the stitchr tool to predict and stitch together germline annotations with the CDR3 sequences to output the full-length coding nucleotide sequences for TRA and TRB chains.

Prerequisites

Before you begin, ensure you have installed stitchr and the necessary data (for the genome of interest) by following the instructions here.

Input File Format

We are going to use thimble wrapper provided by stitchR to run the analysis. thimble expects the input file to have a specific format (see below).

TCR_name TRAV TRAJ TRA_CDR3 TRBV TRBJ TRB_CDR3 TRAC TRBC TRA_leader TRB_leader Linker Link_order TRA_5_prime_seq TRA_3_prime_seq TRB_5_prime_seq TRB_3_prime_seq

We will be utilizing the tcr_annotation_airr.tsv output file from the Parse split-pipe TCR pipeline (using --tcr_analysis parameter) as a starting point. We will analyze TRA and TRB chains separately. The following command will format the output from tcr_annotation_airr.tsv into the expected format for TRA and TRB chains.

#For extracting TRA chains into the expected format
awk -F'\t' 'NR==1 {
    print "TCR_name\tTRAV\tTRAJ\tTRA_CDR3\tTRBV\tTRBJ\tTRB_CDR3\tTRAC\tTRBC\tTRA_leader\tTRB_leader\tLinker\tLink_order\tTRA_5_prime_seq\tTRA_3_prime_seq\tTRB_5_prime_seq\tTRB_3_prime_seq"
}
NR>1 && $3=="TRA" {split($8, J, "_");val=$6;split(val, V, "_");val = V[1];
gsub(/-?DV/, "/DV", val);gsub(/-?OR/, "/OR", val);$6=val;
    print $1 "\t" $6 "\t" J[1] "\t" $14 "\t \t \t \t" $9
}' /PATH/TO/tcr_annotation_airr.tsv > thimble_TRA_input.tsv
#For extracting TRB chains into the expected format
awk -F'\t' 'NR==1 {
    print "TCR_name\tTRAV\tTRAJ\tTRA_CDR3\tTRBV\tTRBJ\tTRB_CDR3\tTRAC\tTRBC\tTRA_leader\tTRB_leader\tLinker\tLink_order\tTRA_5_prime_seq\tTRA_3_prime_seq\tTRB_5_prime_seq\tTRB_3_prime_seq"
}
NR>1 && $3=="TRB" {split($8, J, "_");val=$6;split(val, V, "_");val = V[1];
gsub(/-?DV/, "/DV", val);gsub(/-?OR/, "/OR", val);$6=val;
    print $1 "\t \t \t \t" $6 "\t" J[1] "\t" $14 "\t \t" $9
}' /PATH/TO/tcr_annotation_airr.tsv> thimble_TRB_input.tsv

Running `thimble`

Once you have the input file, you can run thimble using the command below.

#For analyzing TRA sequences
thimble -i thimble_TRA_input.tsv -o thimble_TRA_output.tsv -r a -s HUMAN
#For analyzing TRB sequences
thimble -i thimble_TRB_input.tsv -o thimble_TRB_output.tsv -r b -s HUMAN

Required Parameters

-i: Input file path
-o: Output file path
-r: Kind of TCR, i.e. a/b or g/d
-s: Species, i.e HUMAN or MOUSE

Additional options can be found here.

Output File Format

The output is a TSV file with these key columns:

Column Name	Description
TCR_name	The name given to the TCR. In this case, it will be the name of the contig from `tcr_annotation_airr.tsv`
TRA_nt or TRB_nt	The nucleotide sequence of the TCR as predicted by `stitchr`
TRA_aa or TRB_aa	The amino acid sequence of the TCR as predicted by `stitchr`
TRAV or TRBV	The V gene that was used by `stitchr`
TRAJ or TRBJ	The J gene that was used by `stitchr`
TRA_CDR3 or TRB_CDR3	The CDR3 that was used by `stitchr`
TRAC or TRBC	The C gene that was used by `stitchr`
Warnings/Errors	Any warnings or errors related to that prediction

Notes

Please keep in mind that this is a prediction tool that joins together germline annotations with predicted CDR3 sequences to determine the full-length TCR sequence.
stitchr uses IMGT as a reference database. If the pipeline was run using a non-IMGT database, stitchr will try and find a corresponding IMGT annotation with the same gene name and will default to the *01 allele. There is more information in their Gene/allele default behavior section.

Information on a few Warnings/Errors in the output files have explanations here.

Additional Resources

stitchr GitHub: https://github.com/jamieheather/stitchr
stitchr documentation: https://jamieheather.github.io/stitchr

References

James M Heather, Matthew J Spindler, Marta Herrero Alonso, Yifang Ivana Shui, David G Millar, David S Johnson, Mark Cobbold, Aaron N Hata, Stitchr: stitching coding TCR nucleotide sequences from V/J/CDR3 information, Nucleic Acids Research, 2022, gkac190, https://doi.org/10.1093/nar/gkac190.