Post-sequencer workflow before upload to a Vidjil server
This help is intended for bioinformaticians preparing workflows after their sequencer output. See also considerations on libraries and recombinations.
It is recommended to upload
.fastq.gz files to the Vidjil server.
Indeed, vidjil-algo takes into account the quality information in the output of the representative sequence.
When the base quality is not available, it is also possible to upload
Note that vidjil-algo (and the Vidjil server) also accept uncompressed
.bam files (but the added information of
.bam files is not taken into account,
so uploading such files is not optimal).
Pre-filtering of large datasets
On large capture or RNA-seq datasets, very few reads, are expected to have V(D)J recombinations, typically as few as 0.01%, 0.001%, or even 0.0001%. Vidjil-algo was designed to efficiently find such a few needles in a stack of needles.
Large files may be hard to upload and to store. To save bandwidth and disk space, it is thus advised to locally pre-process reads to merge them (when applicable) and to filter them, with a first iteration of Vidjil-algo, before uploading to a Vidjil server. This filtering will produce much smaller files that could also be used by other software.
We offer two versions:
- The latest stable version,
vidjil-algo-latest, which is in production for clinical applications.
- Tha alpha version,
vidjil-algo-alpha, that provides at least 5× speed-up on multiple locus filtering. Sensibility should be equivalent or even better than with the stable version. Work is underway to release this version for production.
- Requirements (more documentation): on a recent Ubuntu system,
sudo apt-get install zlib1g-dev
- Download and extract http://www.vidjil.org/releases/vidjil-algo-latest.tar.gz or http://www.vidjil.org/releases/vidjil-algo-alpha.tar.gz
vidjil-algo-...directory, build it with
make(it boths compile
vijdil-algoand fetches the
germline/directory, with germlines genes repertoires created from IMGT and NCBI)
- Download and extract https://github.com/dstreett/FLASH2/archive/master.zip
FLASH2-masterdirectory, build it with
You may copy
flash2 binaries to folders available from your
flash2 outputs several files: merged reads, unmerged reads from R1 file, unmerged reads from R2, and histogram.
You can concatenate merged reads and one of the unmerged files
to keep the same number of reads that in the inital fastq file
(as the pre-processing on the Vidjil server).
The following command line thus keeps
out.notCombined_1, from R1,
supposing that R1 reads are "more centered" on the V(D)J junction than R2 reads.
flash2 only works with
flash2 R1.fastq R2.fastq -M 300 -t 4 -z(
-t 4: run on 4 threads)
- Concatenate the files you want to keep, as for example
cat out.extendedFrags.fastq out.notCombined_1.fastq.gz > merged-reads.fastq.gz
vidjil-algo --filter-reads --gz -g germline/homo-sapiens.g merged-reads.fastq.gz(
germline/is the path to the gene repertoires directory, in
merged-reads.detected.vdj.fa.gz file can be uploaded on any Vidjil server,
or re-analyzed with
vidjil-algo or with other software.
Once the filtering has begun, interrupting
SIGINT) gracefully stops execution while still producing (partial) files.
It can be used to check how the filtering works before a full run.
Read trimming, sequencing indexes, primers
It is esssential to upload to Vidjil reads without any sequencing index. When they are not removed, these indexes may impact the analysis, yielding spurious hypermutations or wrong V(D)J designations. The sequencer workflows usually provide tools to output files without these indexes.
You can further trim reads based on quality. Some reads can be of poor quality, or have poor quality on extremities. Trimming these reads may remove some artifact mutations.
See also Working with primers.