stitchr
is a tool for reconstructing full-length T cell receptor (TCR) sequences from minimal input data. This guide focuses on using the command-line utility thimble
, a wrapper included with stitchr, for batch processing of TCR alpha and beta chains.
The EverocodeTM TCR assay was optimized to capture the CDR3 information from TRA and TRB chains and confidently determine the paired clonotype information. The assay achieves sufficient coverage across the VDJ gene segments to provide VDJ annotations along with the CDR3 information for the TRA and TRB chains- that will include the gene-segment annotations for TRBV : VDJ , TRAV:VJ and also the TRAC and TRBC. The Parse bioinformatics pipeline generates TCR output files such as tcr_annotation_airr.tsv
, in standardized AIRR format, which contains the identified contigs in each cell along with their VDJ chain annotations. The tcr_annotation_airr.tsv
file can be utilized by the stitchr
tool to predict and stitch together germline annotations with the CDR3 sequences to output the full-length coding nucleotide sequences for TRA and TRB chains.
Prerequisites
Before you begin, ensure you have installed stitchr
and the necessary data (for the genome of interest) by following the instructions here.
Input File Format
We are going to use thimble
wrapper provided by stitchR to run the analysis. thimble
expects the input file to have a specific format (see below).
TCR_name TRAV TRAJ TRA_CDR3 TRBV TRBJ TRB_CDR3 TRAC TRBC TRA_leader TRB_leader Linker Link_order TRA_5_prime_seq TRA_3_prime_seq TRB_5_prime_seq TRB_3_prime_seq
We will be utilizing the tcr_annotation_airr.tsv
output file from the Parse split-pipe TCR pipeline (using --tcr_analysis parameter) as a starting point. We will analyze TRA and TRB chains separately. The following command will format the output from tcr_annotation_airr.tsv
into the expected format for TRA and TRB chains.
#For extracting TRA chains into the expected format
awk -F'\t' 'NR==1 {
print "TCR_name\tTRAV\tTRAJ\tTRA_CDR3\tTRBV\tTRBJ\tTRB_CDR3\tTRAC\tTRBC\tTRA_leader\tTRB_leader\tLinker\tLink_order\tTRA_5_prime_seq\tTRA_3_prime_seq\tTRB_5_prime_seq\tTRB_3_prime_seq"
}
NR>1 && $3=="TRA" {split($8, J, "_");val=$6;split(val, V, "_");val = V[1];
gsub(/-?DV/, "/DV", val);gsub(/-?OR/, "/OR", val);$6=val;
print $1 "\t" $6 "\t" J[1] "\t" $14 "\t \t \t \t" $9
}' /PATH/TO/tcr_annotation_airr.tsv > thimble_TRA_input.tsv
#For extracting TRB chains into the expected format
awk -F'\t' 'NR==1 {
print "TCR_name\tTRAV\tTRAJ\tTRA_CDR3\tTRBV\tTRBJ\tTRB_CDR3\tTRAC\tTRBC\tTRA_leader\tTRB_leader\tLinker\tLink_order\tTRA_5_prime_seq\tTRA_3_prime_seq\tTRB_5_prime_seq\tTRB_3_prime_seq"
}
NR>1 && $3=="TRB" {split($8, J, "_");val=$6;split(val, V, "_");val = V[1];
gsub(/-?DV/, "/DV", val);gsub(/-?OR/, "/OR", val);$6=val;
print $1 "\t \t \t \t" $6 "\t" J[1] "\t" $14 "\t \t" $9
}' /PATH/TO/tcr_annotation_airr.tsv> thimble_TRB_input.tsv
Running thimble
Once you have the input file, you can run thimble
using the command below.
#For analyzing TRA sequences
thimble -i thimble_TRA_input.tsv -o thimble_TRA_output.tsv -r a -s HUMAN
#For analyzing TRB sequences
thimble -i thimble_TRB_input.tsv -o thimble_TRB_output.tsv -r b -s HUMAN
Required Parameters
-
-i
: Input file path -
-o
: Output file path -
-r
: Kind of TCR, i.e. a/b or g/d -
-s
: Species, i.e HUMAN or MOUSE
Additional options can be found here.
Output File Format
The output is a TSV file with these key columns:
Column Name | Description |
TCR_name | The name given to the TCR. In this case, it will be the name of the contig from tcr_annotation_airr.tsv
|
TRA_nt or TRB_nt | The nucleotide sequence of the TCR as predicted by stitchr
|
TRA_aa or TRB_aa | The amino acid sequence of the TCR as predicted by stitchr
|
TRAV or TRBV | The V gene that was used by stitchr
|
TRAJ or TRBJ | The J gene that was used by stitchr
|
TRA_CDR3 or TRB_CDR3 | The CDR3 that was used by stitchr
|
TRAC or TRBC | The C gene that was used by stitchr
|
Warnings/Errors | Any warnings or errors related to that prediction |
Notes
- Please keep in mind that this is a prediction tool that joins together germline annotations with predicted CDR3 sequences to determine the full-length TCR sequence.
-
stitchr
uses IMGT as a reference database. If the pipeline was run using a non-IMGT database,stitchr
will try and find a corresponding IMGT annotation with the same gene name and will default to the *01 allele. There is more information in their Gene/allele default behavior section.
Information on a few Warnings/Errors in the output files have explanations here.
Additional Resources
-
stitchr
GitHub: https://github.com/jamieheather/stitchr -
stitchr
documentation: https://jamieheather.github.io/stitchr
References
James M Heather, Matthew J Spindler, Marta Herrero Alonso, Yifang Ivana Shui, David G Millar, David S Johnson, Mark Cobbold, Aaron N Hata, Stitchr: stitching coding TCR nucleotide sequences from V/J/CDR3 information, Nucleic Acids Research, 2022, gkac190, https://doi.org/10.1093/nar/gkac190.