LAGAN Toolkit User Manual

LAGAN TOOLKIT USER MANUAL

The LAGAN Tookit is a set of alignment programs for comparative genomics. The three main components are a pairwise aligner (LAGAN), a multiple aligner (M-LAGAN), and a glocal aligner (Shuffle-LAGAN). The results can be visualized using the VISTA server, as well as the novel Phylo-VISTA tool.

LAGAN

Input
The input to LAGAN consists of two sequence files, in FASTA format. You can optionally provide a name for each sequence.
We will run RepeatMasker on the sequences for you, but you should select the closest organism. "Simple" masking will mask simple (low complexity) repeats, but not interspersed elements. Generally this is sufficient, but you may want to premask your sequence yourself if the server does not support masking for one of the sequences you are aligning. Generally it is sufficient to mask all but one sequence from the sequences you are aligning (hence, if you are aligning human, mouse, and fugu,s you can mask just the human and mouse sequences).
You can also ask the server to reverse-complement the second sequence, if you suspect there is homology on the opposite DNA strand.
Output
The output will be sent to you by e-mail. The alignment can be sent in one of three formats: BLAST-like text format, Multi-FASTA format, or as a binary file. You can also choose to compress the output using gzip or Zip.
Visualization
You can request a visualization of your alignment by the VISTA server. Phylo-VISTA visualization is only available for M-LAGAN, not pairwise LAGAN alignments (see below for details), but you can submit two sequences as an M-LAGAN job to see the result in Phylo-VISTA.

Multi-LAGAN

Input
The input to LAGAN consists of several sequence files, in FASTA format. You must provide a name for each sequence, and specify a phylogenetic tree that relates all of them. The tree must be pairwise (the tree is a parenthetical statement, where each set of parentheses must contain exactly two elements, each of which is a sequence name or a parenthetical statement).

We will run RepeatMasker on the sequences for you, but you should select the closest organism. "Simple" masking will mask simple (low complexity) repeats, but not interspersed elements. Generally this is sufficient, but you may want to premask your sequence yourself if the server does not support masking for one of the sequences you are aligning. Generally it is sufficient to mask all but one sequence from the sequences you are aligning (hence if you are aligning human, mouse, and fugu you can mask just the human and mouse sequences).
Output
The output will be sent to you by e-mail. The alignment can be sent either in a linear text format for multiple alignments or Multi-FASTA format. You can also choose to compress the output using gzip or Zip.
Visualization
You can request a visualization of your alignment by the VISTA server. You can also get a Phylo-VISTA visualization of the alignment by following a link included in your e-mail. Phylo-VISTA is a novel interactive tool for multiple alignment visualization that uses the phylogenetic relationship between the sequence to better display the sequence similarities.

Shuffle-LAGAN

Coming soon

VISTA

LAGAN can visualize your alignments through the VISTA server. Note that LAGAN and VISTA are not affiliated: if you request an alignment directly on the VISTA page it will be done using the AVID aligner, not LAGAN.

Annotation file
The annotation file can be specified in the GFF format, or a different, simpler format: The start and the finish for each gene, as well as the name, should be listed on one line. A greater than (>) or less than (<) sign should be placed before this line to indicate whether the gene is transcribed from the plus strand or minus strand, respectively. The numbering should always be according to the plus strand. The exons should then be listed individually with the word "exon", after the start and finish of each exon. UTRs are annotated the same way an exon is, with the word "utr" replacing "exon". For example:
```
< 106481 116661 unknown 
106481 106497 utr 
107983 108069 exon 
109884 110033 exon 
111865 112023 exon 
114352 114562 exon 
116587 116661 utr
> 39424 42368 GDF 9 
39424 39820 exon 
41401 42368 exon
> 77817 81088 hypothetical 
77817 78820 utr 
79538 80107 exon 
80193 80334 exon 
80435 80707 exon 
80829 81088 exon
```
Conservation criteria
The window and conservation criteria are used as follows: conserved segments with percentidentity X and length Y are defined to be regions in which every contiguous subsegment of length Y was at least X% identical to its paired sequence. These segments are then merged to define the conserved regions. Exons are treated differently. Those which are shorter than the length cutoff but still matching the percentage criteria are also included in the list of conserved regions.

Formats

FASTA format
A file in FASTA format starts with a header line, which starts with a ">" sign, folowed by the sequence name (one word, no spaces) and any comments. Second and all subsequent lines should contain the letters of the sequence. LAGAN only accepts input sequences with the letters "ACTGN".
If you request an alignment in the multi-FASTA (MFA) format, it will have all the sequences in a single file, with a "-" used within each sequence to indicate where gaps were inserted in the alignment. The ">" symbol indicates the next fasta header line, and hence the start of the next sequence.
Shuffle-LAGAN creates alignments in the XMFA (eXtended Multi-FAsta) format. The XMFA format allows one to record multiple local alignments in a single file. It differs from the regular MFA format in two ways:

Text format

This format is perhaps optimal for reading by eye, here is an example:


              9730             9740      9750      9760      9770
seq1     ACTCCAGACTCCCTGGT-------TCCTCACCTCCCCGCCCCCTCACCACCCCCACCGAG
         ||||||||||  |  ||       ||  |  | || | | ||    |  ||||  || ||
seq2     ACTCCAGACTTACCAGTCTGGGTCTCAACCTCCCCTCCCTCCTCGCCACCCCCACCCCAG
              14480     14490     14500     14510     14520     14530

           9780      9790      9800        9810      9820      9830
seq1     GCGCTCCGAATTTCCTGCCCGACCGAGGC--CCGGCTCGGGCGGGTGGAGGAGGGCTGGC
         ||||||||  ||||||||||||| | |||  |||  | ||||||||||||||||| ||||
seq2     GCGCTCCGGGTTTCCTGCCCGACTGGGGCCTCCGTTTGGGGCGGGTGGAGGAGGGGTGGC
              14540     14550     14560     14570     14580     14590

Binary format
Binary format is meant for computer processing (and is understood by VISTA). Each byte is divided into two halves, one for each sequence. The half-byte (4 bits) is then assigned a value from 0 to 5, standing for "-ACTGN", in that order.

For any questions not answered on the site please contact the authors

LAGAN

Multi-LAGAN

Shuffle-LAGAN

VISTA

Formats

For any questions not answered on the site please contact the authors