Here are aggregated notes forming a part of the developer documentation on the vidjil-algo.
These notes are a work-in-progress, they are not as polished as the user documentation.
Developers should also have a look at the documentation for bioinformaticians and server administrators, at the issues, at the commit messages, and at the source code.
Development notes -- Vidjil-algo⚓︎
The algorithm follows roughly those steps:
- The germlines are read. Germlines are in the fasta format and are read
by the Fasta class (
core/fasta.h). Germlines are built using the Germline (or MultiGermline) class (
- The input sequence file (.fasta, .fastq, .gz) is read by an OnlineFasta
core/fasta.h). The difference with the Fasta class being that all the data is not stored in memory but the file is read online, storing only the current entry.
- Windows must be extracted from the read, which is done by the
WindowExtractor class (
core/windowExtractor.h). This class has an
extractmethod which returns a WindowsStorage object (
core/windows.h) in which windows are stored.
- To save space consumption, all the reads linked to a given window are
not stored. Only the longer ones are kept. The BinReadStorage class is
used for that purpose (
- In the WindowStorage, we now have the information on the clusters and on
the abundance of each cluster. However we lack a sequence representative
of the cluster. For that purpose the class provides a
getRepresentativeComputermethod that provides a KmerRepresentativeComputer (
core/representative.h). This class can compute a representative sequence using the (long) reads that were stored for a given window.
- The representative can then be segmented to determine what V, D and J
genes are at play. This is done by the FineSegmenter (
The xxx germline⚓︎
- All germlines are inserted in one index using
build_with_one_index()and the segmentation method is set to
SEG_METHOD_MAX12to tell that the segmentation must somehow differ.
- So that the FineSegmenter correctly segments the sequence, the
Fasta) of the xxx germline are modified by the FineSegmenter. The
override_rep5_rep3_from_labels()method from the Germline is the one that overwrites those members with the Fasta corresponding to the affectation found by the KmerSegmenter.
Unit tests are managed using an internal lightweight poorly-designed library that outputs a TAP file. They are organised in the directory algo/tests.
All the tests are defined in the tests.cpp file. But, for the sake of
clarity, this file includes other
cpp files that incorporate all the
tests. A call to
make compiles and launches the
tests.cpp file, which
outputs a TAP file (in case of total success) and creates a
file (in every case).
Tap test library
The library is defined in the testing.h file.
Tests must be declared in the tests.h file:
- Define a new macro (in the enum) corresponding to the test name
RECORD_TAP_TESTto associate the macro with a description (that will be displayed in the TAP output file).
Then testing can be done using the
TAP_TESTmacro. The macro takes three arguments. The first one is a boolean that is supposed to be true, the second is the test name (using the macro defined in
tests.h) and the third one (which can be an empty string) is something which is displayed when the test fails.