Vidjil – Web Application Manual

Table of Contents

Vidjil is an open-source platform for the analysis of high-throughput sequencing data from lymphocytes. V(D)J recombinations in lymphocytes are essential for immunological diversity. They are also useful markers of pathologies, and in leukemia, are used to quantify the minimal residual disease during patient follow-up. High-throughput sequencing (NGS/HTS) now enables the deep sequencing of a lymphoid population with dedicated Rep-Seq methods and software.

This is the help of the Vidjil web application. Further help can always be asked to contact@vidjil.org. We can also arrange phone or Skype meeting.

The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Ryan and Tatiana)

1 Requirements

1.1 Web application

The Vidjil web application runs in any modern browser. It has been successfully tested on the following platforms

  • Firefox version >= 32
  • Chrome version >= 38
  • IE version >= 10.0 (Vidjil will not run on IE 9.0 or below)
  • Opera version >= XX
  • Safari version >= 6.0

1.2 The .vidjil files

The vidjil web application displays .vidjil files that summarize the V(D)J recombinations and the sequences found in a run.

The easiest way to get these files is to request an account on the public Vidjil test server. You will then be able to upload, manage, process your runs (.fasta, .fastq, .gz or .clntab files) directly on the web application (see 4), and the server behind the patient/experiment database computes these .vidjil files. Otherwise, such .vidjil files can be obtained:

  • from the command-line version of Vidjil (starting from .fasta, .fastq or .gz files, see algo.org). To gather several .vidjil files, you have to use the fuse.py script
  • or by any other V(D)J analysis pipelines able to output files respecting the .vidjil file format (contact us if you are interested)

2 First aid

  • Open data by:
    • either with “patients”/“open patient” if you are connected to a patient/experiment database, such as on http://app.vidjil.org/ (in this case, there are always some "Demo" datasets for demonstration purposes),
    • or with “file”/“import/export”, manually selecting a .vidjil file
  • You can change the number of displayed clones by moving the slider “number of clones” (menu “filter”). The maximal number of clones that can be displayed depends on the processing step before. See below "Can I see all the clones ?" (6).
  • Clones can be selected by clicking on them either in the list, on the time graph, or the grid (simple selection or rectangle selection).
  • There are often very similar clones, coming from either somatic hypermutations or from sequencing errors. You can select such clones (for example those sharing a same V and a same J), then:
    • inspect the sequences in the lower panel (possibly using the “align” function),
    • remove some of these sequences from the selection (clicking on their name in the lower panel)
    • cluster them (button “cluster”) in a unique clone. Once several clones are clustered, you can still visualize them by clicking on “+” in the list of clones.
  • Your analysis (clone tagging, renaming, clustering) can be saved:
    • either with “patients”/“save analysis” if you are connected to a patient/experiment database
    • or with “file”/“export .analysis”

You are advised to go through to the tutorial available from http://www.vidjil.org/doc to learn the essential features of Vidjil.

3 The elements of the Vidjil web application

3.1 The info panel (upper left panel)

patient/run information
user can put some informations in this case to retain about the patient or the run.
locus
germline used for analyzing the data. In case of multi-locus data, you can select what locus should be displayed (see locus.org)
analysis
name (without extension) of the loaded file used for displaying the data
sample
name of the current sample point. You can also change the current point by clicking directly on his name in the graph panel (when available).

#The name can be edited (“edit”).

date
indicate the date of the run of the current sample point (edit with the database, on the patient/run tab). You can change the point viewed by clickong on the and buttons. A cycling view is available by the fix button.
analyzed reads
number of reads where the underlying RepSeq algorithm found a V(D)J recombination, for that sample point See 7.1 below.
total
total number of reads for that sample point

3.2 The list of clones (left panel)

  • You can assign other tags with colors to clones using the “★” button. The “filter” menu allows to further filter clones by tags.
  • Under the “★” button it is possible to normalize clone concentrations according to this clone. You must specify the expected concentration in the “expected size” field (e.g. 0.01 for 1%). See 7.2 below.
  • The “i” button displays additional information on each clone.
  • The list can be sorted on V genes, J genes or clone abundance. The “+” and “-” allow respectively to un-cluster or re-cluster all clones that have already been clustered.
  • Clones can be searched (“search” box) by either their name, their custom name, or their DNA sequence.
  • The concentration of some clones may not be displayed. Instead you can have either a + symbol or a - symbol. In the former case that means the clone has been detected (positive) but in few reads (typically less than five). In the latter case it means that the clone has not been detected (negative) in the sample but has been detected in another time point that is not currently displayed.

3.3 The time graph

The time graph is hidden with there is only one timepoint. It shows the X most frequent clones of the sample (this number can be alter with the filter menu).

  • The current point is highlighted with a vertical gray bar, you can change that by clicking on another point or using and .
  • The gray areas at the bottom of the graph show, for each point, the resolution (1 read / 5 reads).
  • You can reorder the points by dragging them, and hide some points by dragging them on the “+” mark at the right of the points. If you want to recover some hidden points, you need to drag them from the “+” mark to the graph.
  • If your dataset contains sampling dates (for example in a MRD setup), you can switch between point keys and dates in “settings > point key”

3.4 The plot view

The grid view show the clones of a selected germline. All the used germlines are on the right of the grid. You can change germline by clicking on it or by using the associated shortcuts (see 8 below).

  • The "plot" menu allow to change the (grid plot, bar plot) as well as the X and Y axes of these plot Some presets are available.
  • In the bar plot mode, the Y axis corresponds to the order of clones inside each bar.
  • The “focus“ button (bottom right) allows to further analyze a selection of clones. To exit the focus mode, click on the “X” near the search box.

To further analyze a set of clones sharing a same V and J, it is often useful to focus on the clones, then to display them ones according to either their “clone length” or their “N length” (that is N1-D-N2 in the case of VDJ recombinations)

3.5 The sequence view (bottom panel)

The sequence view display nucleotide sequences from selected clones.

  • See "What is the sequence displayed for each clone ?" (5.2) below
  • Sequences can be aligned together (“align” button), identifying substitutions, insertions and deletions.
  • You can remove sequences from the aligner (and the selection) by clicking on their name
  • You can further analyze the sequences with IMGT/V-QUEST and IgBlast on the selected sequences. This opens another window/tab.
  • You can unselect all sequences by clicking on the background of the grid.

4 The patient/experiment database and the server

If a server with a patient/experiment database is configured with your installation of Vidjil (as on http://app.vidjil.org/), the 'patient' menu gives you access to the server.

With authentication, you can add patients or runs, then add "samples" (.fasta, .fastq, .gz or .clntab files), possibly pre-processed, then process your data and save the results of your analysis.

4.1 Patients

Once you are authenticated, this page show the patient list. Here you can see your patients and patients whose permission has been given to you.

New patients can be added ('add patient'), edited ('e') or deleted ('X'). By default, you are the only one who can see and update this new patient. If you have an admin access, you can grant access to other users ('p').

4.2 Runs

Runs can be manipulated the same way as patients, New runs can be added ('add run'), edited ('e') or deleted ('X'). Runs and Patients are both used to make set of samples who share a same patient or have been sequenced in the same run. A sample can be included in a patient sample set and a run sample set.

4.3 Samples and pre-processes

Clicking on a patient or a on a run give acccess to the "samples" page. Each sample is a .fasta, .fastq, .gz or .clntab file that will be processed by one or several pipelines with one or several configurations that set software options.

Depending on your granted access, you can add a new sample to the list (+ sample), download sample files when they are available (dl) or delete sequence files (X). Note that sample files may be deleted (in particular to save server disk space), which is not the case for the results (unless the user wants so).

You can see which samples have been processed with the selected config, and access to the results (See results, bottom right).

  1. Adding a sample

    To add a sample (+ sample), you must add at least one sample file. Each sample file must be linked to a patient or to a run. One of those fields will be automatically completed depending on whether you accessed the sample page from a patient or from a run. Both fields provide autocompletion to help you enter the correct patient or correct run. It is advised to fill in both fields (when it makes sense). However please note that the correspondig patients and runs must have been created beforehand.

  2. Pre-processing

    The sample files may be preprocessed, by selecting a pre-process scenario when adding a sample. At the moment the only preprocess avalaible on the public http://app.vidjil.org/ server are the paired-end read merging.

    1. Read merging

      People using Illumina sequencers may sequence paired-end R1/R2 fragments. It is highly recommended to merge those reads in order to have a read that consists of the whole DNA fragment instead of split fragments. To merge R1/R2 fragments, select an adapted pre-process scenario add provide both R1/R2 files as once when adding a sample.

      There are two scenarios to merge reads. Indeed in case the merging is not possible for some reads we must keep one of the fragments (either R1 or R2). We cannot keep both because it would bias the quantification (as there would be two unmerged reads instead of one). Depending on the sequencing strategy it could be better to keep R1 or R2 in such a case. Therefore it really depends on users. You must choose to keep the fragment that most probably contains both a part of the V and the J genes.

4.4 Processing samples, configs

Depending on your granted accesses, you can schedule a processing for a sequence file (select a config and run). The processing can take a few seconds to a few hours, depending on the software lauched, the options set in the config, the size of the sample and the server load.

The base human configurations with the Vidjil built-in algorithm are « TRG », « IGH », « multi » (-g germline), « multi+inc » (-g germline -i), « multi+inc+xxx » (-g germline -i -2, default advised configuration). See https://github.com/vidjil/vidjil/blob/master/doc/locus.org for information on these configurations. There are also configuration for other species and for other RepSeq algorithms, such as « MiXCR ». The server mainteners can add new configurations tailored to specific needs, contact us if you have other needs.

The « reload » button (bottom left) updates the status of the task, that should do QUEUEDASSIGNEDRUNNINGCOMPLETED. It is possible to launch several process at the same time (some will wait in the QUEUED / ASSIGNED states), and also to launch process while you are uploading data. Finally, you can safely close the window with the patient/experiment database (and even your web browser) when some process are queued/launched. The only thing you should not do is to close completely your web browser while sequences are uploading.

4.5 Groups

Each patient and run is assigned to at least one group. This determines which groups have access to a patient or run. Users are assigned to diffrent groups and therefore gain access to any patients and runs that said group has access to.

There are also groups that may be clustered together. Usually this represents an organisation, such as a Hospital. The organisation has a group to which subgroups are associated. This allows users with different sets of permissions to gain access to files uploaded to the organisation's group automatically.

Users may be a part of several groups. By default Users are assigned their personnal group to which they can upload files and be the sole possessor of an access to this file. Different groups implies different sets of permissions. A user may not have the same permissions on a file accessed from an organisation's group as (s)he does on files from her/his personnal group, or even from a group associated to another organisation.

The different permissions that can be attributed are:

  • Read: Permissions to sview patients/runs to which a group or organisation has access to
  • Create: Permissions to create patients/runs
  • Upload: Permissions to upload samples to the patients/runs of a group
  • Run: Permissions to run vidjil on an uploaded samples to the patients/runs of a group
  • View Details: Permissions to view patient/run data in an unencrypted manner for the patients/runs of a group
  • Save: Permissions to save an analysis for the patients/runs of a group

5 How do you define clones, their sequences, and their V(D)J designation?

The Vidjil web application allows to run several RepSeq algorithms. Each RepSeq algorithm (selected by « config », see above) has its own definition of what a clone is (or, more precisely a clonotype), how to output its sequence and how to assign a V(D)J designation. Knowing how clones are defined is important to be aware of the potential biases that could affect your analysis.

5.1 How do you define a clone? How are gathered clones?

In the built-in Vidjil algorithm, sequences are gathered into a same clone as long as they share the same 50bp DNA sequence around the CDR3 sequence. In a first step, the algorithm has a quick heuristic which detects approximatively where the CDR3 lies and extracts a 50bp nucleotide sequence centered on that region. This region is called a window in Vidjil's algorithm. When two sequences share the same window, they belong to the same clone. Therefore in Vidjil clones are only defined based on the exact match of a long DNA sequence. This explains why some little clones can be seen around larger clones: they may be due to sequencing error that lead to different windows. However those small differences can also be due to a real biological process inside the cells. Therefore we let the user choose whether the clones should be manually clustered or not.

In MiXCR, clones are defined based on the amino-acid CDR3 sequence, on the V gene used and on the hypermutations.

5.2 What is the sequence displayed for each clone ?

The sequences displayed for each clone are not individual reads. The clones may gather thousands of reads, and all these reads can have some differences. Depending on the sequencing technology, the reads inside a clone can have different lengths or can be shifted, especially in the case of overlapping paired-end sequencing. There can be also some sequencing errors. The .vidjil file has to give one consensus sequence per clone, and Rep-Seq algorithms have to deal with great care to these difference in order not to gather reads from different clones.

For the built-in Vidjil algorithm, it is required that the window centered on the CDR3 is exactly shared by all the reads. The other positions in the consensus sequence are guaranteed to be present in at least half of the reads. The consensus sequence can thus be shorter than some reads.

5.3 How are computed the V(D)J designations?

In the built-in Vidjil algorithm, V(D)J designations are computed after the clone clustering by dynamic programming, finding the most similar V (or 5') and J (or 3') gene, then trying to match a D gene. Note that the algorithm also detects some VDDJ or VDDDJ recombinations that may happen in the TRD locus. Some incomplete or unusual rearrangements (Dh/Jh, Dd2/Dd3, KDE-Intron, mixed TRA-TRD recombinations) are also detected.

Once clones are selected, you can send their sequence to IMGT/V-QUEST and IgBlast by clicking on the links just above the sequence view (bottom left). This opens another window/tab.

6 Can I see all the clones ?

The interest of NGS/RepSeq studies is to provide a deep view of any V(D)J repertoire. The underlying analysis softwares (such as Vidjil) try to analyze as much reads as possible (see 7.1 below). One often wants to "see all clones", but a complete list is difficult to see in itself. In a typical dataset with about 106 reads, even in the presence of a dominant clone, there can be 104 or 105 different clones detected.

6.1 The "top" slider in the "filter" menu

The "top 50" clones are the clones that are in the first 50 ones in at least one sample. As soon as one clone is in this "top 50" list, it is displayed for every sample, even if its concentration is very low in other samples. Most of the time, a "top 50" is enough. The hidden clones are thus the one that never reach the 50 first clones. With a default installation, the slider can be set to display clones until the "top 100" on the grid (and, on the graph, until "top 20").

However, in some cames, one may want to track some clones that are never in the "top 100", as for example:

  • a standard/spike with low concentration
  • a clone in a MRD following of a patient without the diagnostic point

(Upcoming feature). If clone is "tagged" in the .vidjil or in the .analysis file, it will always be shown even if it does not meet the "top" filter.

6.2 The "smaller clones"

There is a virtual clone per locus in the clone list which groups all clones that are hidden (because of the "top" or because of hiding some tags). The sum of ratios in the list of clones is always 100%: thus the "smaller clones" changes when one use the "filter" menu.

Note that the ratios include the "smaller clones": if a clone is reported to have 10.54%, this 10.54% ratio relates to the number of analyzed reads, including the hidden clones.

7 How can I assess the quality of the data and the analysis ?

To make sure that the PCR, the sequencing and the RepSeq analysis went well, several elements can be controlled.

7.1 Number of analyzed reads

A first control is to check the number of “analyzed reads” in the info panel (top left box). This shows the number of reads where the underlying RepSeq algorithm found some V(D)J recombination in the selected sample.

With DNA-Seq sequencing with specific V(D)J primers, ratios above 90% usually mean very good results. Smaller ratios, especially under 60%, often mean that something went wrong. On the other side, capture with many probes or RNA-Seq strategies usually lead to datasets with less than 0.1% V(D)J recombinations.

The “info“ button further detail the causes of non-analysis (for the built-in Vidjil algorithm, UNSEG, see detail on algo.org). There can be several causes leading to bad ratios:

7.1.1 Analysis or biological causes

  • The data actually contains other germline/locus that what was searched for (solution: relauch the processing, or ask that we relaunch it, with the correct germline sequences). See locus.org for information on the analyzable human locus with the built-in Vidjil algorithm, and contact us if you would like to analyze data from species that are not currently available.
  • There are incomplete/exceptional recombinations (Vidjil can process some of them, config multi+inc, see locus.org for details)
  • There are too many hypersomatic mutations (usually Vidjil can process mutations until 10% mutation rate… above that threshold, some sequences may be lost).
  • There are chimeric sequences or translocations (Vidjil does not process all of these sequences).

7.1.2 PCR or sequencing causes

  • The read length is too short and the reads do not span the junction zone (UNSEG too few V/J or UNSEG only V/J). (the built-in Vidjil algorithm detects a “window” including the CDR3. By default this window is 50bp long, so the read needs be that long centered on the junction).
  • In particular, for paired-end sequencing, one of the ends can lead to reads not fully containing the CDR3 region. Solutions are to merge the ends with very conservative parameters (see 4.3.0.2.1 above), to ignore this end, or to extend the read length.
  • There were too many PCR or sequencing errors (this can be asserted by inspecting the related clones, checking if there is a large dispersion around the main clone)

7.2 Control with standard/spike

  • If your sample included a standard/spike control, you should first identify the main standard sequence (if that is not already done) and specify its expected concentration (by clicking on the “★” button). Then the data is normalized according to that sequence.
  • You can (de)activate normalization in the settings menu.

7.3 Steadiness verification

  • When assessing different PCR primers, PCR enzymes, PCR cycles, one may want to see how regular the concentrations are among the samples.
  • When following a patient one may want to identify any clone that is emerging.
  • To do so, you may want to change the color system, in the “color by” menu select “abundance”. The color ranges from red (high concentration) to purple (low concentration) and allows to easily spot on the graph any large change in concentration.

7.4 Clone coverage

In the built-in Vidjil algorithm, the clone coverage is the ratio of the length of the clone consensus sequence to the median read length in the clone. A consensus sequence is displayed for each clone (see What is the sequence displayed for each clone?). Its length should be representative of the read lengths among that clone. A clone can be constituted of thousands of reads of various lengths. We expect the consensus sequence to be close to the median read length of the clone. The clone coverage is such a measure: having a clone coverage between .85 and 1 is quite frequent. On the contrary, if it is .5 it means that the consensus sequence length is half shorter than the median read length in the clone.

There is a bad clone coverage (< 0.5) when reads do share the same window (it is how Vidjil defines a clone) and when they have frequent discrepancies outside of the window. Such cases have been observed with chimeric reads which share the same V(D)J recombinations in their first half and have totally different and unknown sequences in their second half.

In the web application, the clones with a low clone coverage (< 0.5) are displayed in the list with an orange I on the right. You can also visualize the clones according to their clone coverage by selecting for example “clone coverage/GC content” in the preset menu of the “plot” box.

8 Keyboard shortcuts

and navigate between samples
Shift-← and Shift-→ decrease or increase the number of displayed clones
numeric keypad, 0-9 switch between available plot presets
a: TRA  
b: TRB  
g: TRG  
d: TRD, TRD+ change the selected germline/locus
h: IGH, IGH+  
l: IGL  
k: IGK, IGK+  
x: xxx  

Note: You can select just one locus by holding the Shift key while pressing the letter corresponding to the locus of interest.

Ctrl-s save the analysis (when connected to a database)
Shift-p open the 'patient' window (when connected to a database)

9 References

If you use Vidjil for your research, please cite the following references:

Marc Duez et al., “Vidjil: A web platform for analysis of high-throughput repertoire sequencing”, PLOS ONE 2016, 11(11):e0166126 http://dx.doi.org/10.1371/journal.pone.0166126

Mathieu Giraud, Mikaël Salson, et al., “Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing”, BMC Genomics 2014, 15:409 http://dx.doi.org/10.1186/1471-2164-15-409

Author: The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Ryan and Tatiana)

Created: 2017-04-21 Fri 09:20

Emacs 24.3.1 (Org mode 8.2.4)

Validate