LAGAN TOOLKIT USER MANUAL
The LAGAN Tookit is a set of alignment programs for comparative genomics. The three main components are a pairwise aligner (LAGAN), a multiple aligner (M-LAGAN), and a glocal aligner (Shuffle-LAGAN). The results can be visualized using the VISTA server, as well as the novel Phylo-VISTA tool.
We will run RepeatMasker on the sequences for you, but you should select the closest organism. "Simple" masking will mask simple (low complexity) repeats, but not interspersed elements. Generally this is sufficient, but you may want to premask your sequence yourself if the server does not support masking for one of the sequences you are aligning. Generally it is sufficient to mask all but one sequence from the sequences you are aligning (hence, if you are aligning human, mouse, and fugu,s you can mask just the human and mouse sequences).
You can also ask the server to reverse-complement the second sequence, if you suspect there is homology on the opposite DNA strand.
We will run RepeatMasker on the sequences for you, but you should select the closest organism. "Simple" masking will mask simple (low complexity) repeats, but not interspersed elements. Generally this is sufficient, but you may want to premask your sequence yourself if the server does not support masking for one of the sequences you are aligning. Generally it is sufficient to mask all but one sequence from the sequences you are aligning (hence if you are aligning human, mouse, and fugu you can mask just the human and mouse sequences).
The annotation file can be specified in the GFF format, or a different, simpler format: The start and the finish for each gene, as well as the name, should be listed on one line. A greater than (>) or less than (<) sign should be placed before this line to indicate whether the gene is transcribed from the plus strand or minus strand, respectively. The numbering should always be according to the plus strand. The exons should then be listed individually with the word "exon", after the start and finish of each exon. UTRs are annotated the same way an exon is, with the word "utr" replacing "exon". For example:
< 106481 116661 unknown 106481 106497 utr 107983 108069 exon 109884 110033 exon 111865 112023 exon 114352 114562 exon 116587 116661 utr > 39424 42368 GDF 9 39424 39820 exon 41401 42368 exon > 77817 81088 hypothetical 77817 78820 utr 79538 80107 exon 80193 80334 exon 80435 80707 exon 80829 81088 exon
The window and conservation criteria are used as follows: conserved segments with percentidentity X and length Y are defined to be regions in which every contiguous subsegment of length Y was at least X% identical to its paired sequence. These segments are then merged to define the conserved regions. Exons are treated differently. Those which are shorter than the length cutoff but still matching the percentage criteria are also included in the list of conserved regions.
A file in FASTA format starts with a header line, which starts with a ">" sign, folowed by the sequence name (one word, no spaces) and any comments. Second and all subsequent lines should contain the letters of the sequence. LAGAN only accepts input sequences with the letters "ACTGN".
If you request an alignment in the multi-FASTA (MFA) format, it will have all the sequences in a single file, with a "-" used within each sequence to indicate where gaps were inserted in the alignment. The ">" symbol indicates the next fasta header line, and hence the start of the next sequence.
Shuffle-LAGAN creates alignments in the XMFA (eXtended
Multi-FAsta) format. The XMFA format allows one to record multiple
local alignments in a single file. It differs from the regular MFA
format in two ways:
This format is perhaps optimal for reading by eye, here is an example:
9730 9740 9750 9760 9770 seq1 ACTCCAGACTCCCTGGT-------TCCTCACCTCCCCGCCCCCTCACCACCCCCACCGAG |||||||||| | || || | | || | | || | |||| || || seq2 ACTCCAGACTTACCAGTCTGGGTCTCAACCTCCCCTCCCTCCTCGCCACCCCCACCCCAG 14480 14490 14500 14510 14520 14530 9780 9790 9800 9810 9820 9830 seq1 GCGCTCCGAATTTCCTGCCCGACCGAGGC--CCGGCTCGGGCGGGTGGAGGAGGGCTGGC |||||||| ||||||||||||| | ||| ||| | ||||||||||||||||| |||| seq2 GCGCTCCGGGTTTCCTGCCCGACTGGGGCCTCCGTTTGGGGCGGGTGGAGGAGGGGTGGC 14540 14550 14560 14570 14580 14590
Binary format is meant for computer processing (and is understood by VISTA). Each byte is divided into two halves, one for each sequence. The half-byte (4 bits) is then assigned a value from 0 to 5, standing for "-ACTGN", in that order.