fuse.py : converting and merging immune repertoire data

Merging files to follow clones in several samples

Many immune repertoire sequencing studies aim to track clones in several samples. One can compare repertoires from several samples coming from a same person or different ones, and detect and quantify common clones. For example in a minimal residual disease (MRD) setup, we are interested in following the main clones identified at diagnosis in the following samples.

Let assume that four .vidjil files have been produced for each sample (namely diag.vidjil, fu1.vidjil, fu2.vidjil, fu3.vidjil), merging them will be done in the following way:

python tools/fuse.py --output mrd.vidjil --top 100 diag.vidjil fu1.vidjil fu2.vidjil fu3.vidjil

The --top parameter allows to choose how many top clones per sample should be kept. The default value is 50. Here --top 100 means that for each sample, the top 100 clones are kept and followed in the other samples, even if it is not in the top 100 of the other samples. This allows to follow and quantify targeted clones even when there have only a few reads in some samples.

The mrd.vidjil file can then be fed to the web client.

Using AIRR data

The AIRR community has published a standard representation to describe results of immune receptor repertoire analysis. Used by an increasing number of software, this .tsv format allows to easily transfer immune repertoire data between pipelines.

The AIRR output of vidjil-algo enables to feed vidjil-algo output to other software. Conversely, fuse.py is able to take one or several AIRR .tsv file(s) to get a .vidjil file that can be opened by the Vidjil web application:

python tools/fuse.py --output out.vidjil sample1.tsv sample2.tsv

For a same analysis, you can mix .vidjil and AIRR files.

However, the following points should be taken into account: