Quality control of designation of V(D)J recombinations with .should-vdj.fa tests

The .should-vdj.fa tests are sequences with manually curated V(D)J designations. These designations were checked by hand, possibly with the help of some bioinformatics tools. Tests may range from very easy cases with unambiguous V(D)J designations to borderline or difficult cases, including incomplete or unusual recombinations or translocations.

This collection of sequences, distributed as open-source data, may help the robustness of any software doing immune repertoire sequencing (RepSeq) analysis.

Contributing to the tests

Users and developers of RepSeq software are encouraged to send us their manually curated sequences, ideally in the format described below, or by directly proposing pull requests on Gitlab with new tests in the algo/tests/should-vdj directory. We can also help to encode sequences in this format. The current tests were contributed by:

A .should-vdj.fa file

A .should-vdj.fa file is (almost) a Fasta file, containing one or several sequences with their V(D)J designation:

>IGKV1-5*03 9/4/1 IGKJ1*01  [IGK]

# Patient 0122
>TRGV5*01 4/AG/5 TRGJP2*01  [TRG]   # 1st clone

>TRGV11 TRGJ1  [TRG]               # 2nd clone

Nucleotide sequences can mix upper-case and lower-case characters, and may contain arbitrary line breaks. Comments can be given either as lines starting with #, or parts of headers, again starting with #.

Encoding the V(D)J designation

The header of each sequence, beginning by >, gives the V(D)J designation of the underlying sequence, such as in >IGKV1-5*03 9/CTAC/1 IGKJ1*01 [IGK].

The designation can thus be very short, such as in >TRGV2 TRGJP1. However, it is advised to put as much information as possible in the designation, such as in >TRGV2*01 9/CCCTGG/1 TRGJP1*01 [TRG]. Such complete designations will give more extensive tests.

VDJ recombinations can be encoded, such as

Incomplete or unusual recombinations can also be specified, such as

Very special cases should be explained by comments in plain English.

Encoding the locus

The end of the header may also contain information on the locus, between brackets, leading to additional tests. This also allows to specify only >[TRG] for a sequence that should be recognized as TRG even if it is difficult to choose a precise VJ designation.

Human locus should be encoded by [IGH], [IGK], [IGL], [TRA], [TRB], [TRG], [TRD]. Incomplete or unusual recombinations can be encoded with an additional + character, such as in [IGH+] or [TRD+]. Mixed TRA/TRD recombinations can be encoded with [TRA+D].

Other special cases, such as translocations involving BCL1 or BCL2, should be written now as comments after a # character.

Encoding the JUNCTION/CDR3 information

JUNCTION or CDR3 information can be optionnaly encoded, using curly braces:


Ambiguous or alternate designations

On some sequences, several V(D)J designations may be equally acceptable. These alternate choices can be encoded as (choice1, choice2). For difficult cases, is advised to further leave a comment in plain English:

# The D/J junction can be seen as 2//7, 3//6, or 4//5
>IGHV3-48*01 0/AA/6 IGHD5-12*01 (2//7, 3//6, 4//5) IGHJ4*02  [IGH]

# TRGJ1*01 or TRGJ1*02
>TRGV5*01 (TRGJ1*01, TRGJ1*02)  [TRG]

# TRGJ1*01 or TRGJ1*02 but with different deletions
>TRGV4*02 (4/4/4 TRGJ1*01, 4/4/1 TRGJ1*02) [TRG]

Program-specific information


The automated test suite of vidjil-algo launches the analysis on all these .should-vdj sequences and compares the computed designations with the curated designations. Cases with current failure will be marked as TODO. Having a correct behavior on these tests may be a goal for future releases.

Within the algo/tests directory:


The paper [1] includes an evaluation of the V(D)J designation of 125 clones.

  1. Yann Ferret, A. Caillault et al., Multi-loci diagnosis of acute lymphoblastic leukaemia with high-throughput sequencing and bioinformatics analysis, British Journal of Haematology, 2016, 173, 413–420, https://hal.archives-ouvertes.fr/hal-01279160

  2. Mikaël Salson et al., A dataset of sequences with manually curated V(D)J designations RepSeq 2015, https://hal.inria.fr/hal-01331556