fuse.py
: converting and merging immune repertoire data⚓︎
Merging files to follow clones in several samples⚓︎
Many immune repertoire sequencing studies aim to track clones in several samples. One can compare repertoires from several samples coming from a same person or different ones, and detect and quantify common clones. For example in a minimal residual disease (MRD) setup, we are interested in following the main clones identified at diagnosis in the following samples.
Let assume that four .vidjil
files have been produced for each sample
(namely diag.vidjil
, fu1.vidjil
, fu2.vidjil
, fu3.vidjil
), merging them will
be done in the following way:
python tools/fuse.py --output mrd.vidjil --top 100 diag.vidjil fu1.vidjil fu2.vidjil fu3.vidjil
The --top
parameter allows to choose how many top clones per sample should
be kept. The default value is 50. Here --top 100
means that for each sample, the top 100 clones are kept
and followed in the other samples, even if it is not in the top 100 of the other samples.
This allows to follow and quantify targeted clones even when there have only a few reads in some samples.
The mrd.vidjil
file can then be fed to the web client.
Using AIRR data⚓︎
The AIRR community has published a standard representation to describe results of immune receptor repertoire analysis.
Used by an increasing number of software, this .tsv
format allows to easily transfer immune repertoire data between pipelines.
The AIRR output of vidjil-algo enables to feed vidjil-algo output to other software.
Conversely, fuse.py
is able to take one or several AIRR .tsv
file(s) to get a .vidjil
file that can be opened by the Vidjil web application:
python tools/fuse.py --output out.vidjil sample1.tsv sample2.tsv
.vidjil
and AIRR files.
However, the following points should be taken into account:
-
The Vidjil web application uses the
duplicate_count
value for each clone in a.tsv
file as the size of each clone. This was discussed on the AIRR mailing list, but other software may use other fields. Note that the AIRR output ofvidjil-algo
uses the same convention. -
Some RepSeq software (such as IgBlast) do not cluster clones at all but only analyze independently each read. As
fuse.py
does not add clustering information, the output of these software will be also shown unclustered in the Vidjil web application. -
More generally, RepSeq software have various definitions of clones (see What is a clone ?). When processed with
fuse.py
, clones across several samples will be identified when they share the sameclone_id
value. When merging data from different samples, one must ensure that the software outputs relevantclone_id
to mark these very same clones, otherwise they would appear as unrelated in the web application (but they can still be clustered there). This will also often be the case when merging files coming from different software.