Note
Here are aggregated notes forming a part of the developer documentation on the vidjil-algo.
These notes are a work-in-progress, they are not as polished as the user documentation.
Developers should also have a look at the documentation for bioinformaticians and server administrators,
at the issues, at the commit messages, and at the source code.
Development notes -- Vidjil-algo⚓︎
Code organization⚓︎
The algorithm follows roughly those steps:
- The germlines are read. Germlines are in the fasta format and are read by the Fasta class (
core/fasta.h
). Germlines are built using the Germline (or MultiGermline) class (core/germline.h
) - The input sequence file (.fasta, .fastq, .gz) is read by an OnlineFasta (
core/fasta.h
). The difference with the Fasta class being that all the data is not stored in memory but the file is read online, storing only the current entry. - Windows must be extracted from the read, which is done by the WindowExtractor class (
core/windowExtractor.h
). This class has anextract
method which returns a WindowsStorage object (core/windows.h
) in which windows are stored. - To save space consumption, all the reads linked to a given window are not stored. Only the longer ones are kept. The BinReadStorage class is used for that purpose (
core/read_storage.h
). - In the WindowStorage, we now have the information on the clusters and on the abundance of each cluster. However we lack a sequence representative of the cluster. For that purpose the class provides a
getRepresentativeComputer
method that provides a KmerRepresentativeComputer (core/representative.h
). This class can compute a representative sequence using the (long) reads that were stored for a given window. - The representative can then be segmented to determine what V, D and J genes are at play. This is done by the FineSegmenter (
core/segment.h
).
The xxx germline⚓︎
- All germlines are inserted in one index using
build_with_one_index()
and the segmentation method is set toSEG_METHOD_MAX12
to tell that the segmentation must somehow differ. - So that the FineSegmenter correctly segments the sequence, the
rep_5
andrep_3
members (classFasta
) of the xxx germline are modified by the FineSegmenter. Theoverride_rep5_rep3_from_labels()
method from the Germline is the one that overwrites those members with the Fasta corresponding to the affectation found by the KmerSegmenter.
Tests⚓︎
Unit⚓︎
Unit tests are managed using an internal lightweight poorly-designed library that outputs a TAP file. They are organized in the directory algo/tests.
All the tests are defined in the tests.cpp file. But, for the sake of
clarity, this file includes other cpp
files that incorporate all the
tests. A call to make
compiles and launches the tests.cpp
file, which
outputs a TAP file (in case of total success) and creates a tests.cpp.tap
file (in every case).
- Tap test library
The library is defined in the testing.h file.
Tests must be declared in the tests.h file:
1. Define a new macro (in the enum) corresponding to the test name
2. In declare_tests()
use RECORD_TAP_TEST
to associate the macro with a
description (that will be displayed in the TAP output file).
Then testing can be done using the `TAP_TEST` macro. The macro takes three
arguments. The first one is a boolean that is supposed to be true, the
second is the test name (using the macro defined in `tests.h`) and the
third one (which can be an empty string) is something which is displayed
when the test fails.