| |||||||||||||||||||||||
ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and computes alignments of distantly related proteins with low similarity. Extra afford is taken to locate frameshift positions. ProSplign algorithm is an integral component of the NCBI Eukaryotic Genome Annotation Pipeline, which has been used to annotate critical genomes that include many different plant and animal species (such as human, mouse, cow etc.). The Pipeline was used by the Sea Urchin Genome Sequencing center for sequence analysis of the 814-megabase genome of the sea urchin Strongylocentrotus purpuratus that was published in Science in 2006. The integration of ProSplign with the genome annotation pipeline significantly improved the quality of genome annotation over existing available methods. Due to the success of the method it was used to annotate Tribolium castaneum (Nature, 2008), Taurine Cattle (Science, 2009), Acyrthosiphon Pisum (PLoS Biology, 2010), Nasonia (Science, 2010), and many other genomes. Also ProSplign is a central part of the automatic pipeline for Influenza virus genomes, an important part of the Influenza Genome Sequencing Project. Sponsored by the National Institutes of Health, the Influenza Project is an international collaboration of critical importance for the public health. It has already led to multiple new discoveries about the recent evolution and pathogenesis of influenza, which have been published in leading journals including Journal of Virology, PLoS Biology, and Nature. | |||||||||||||||||||||||
ProSplign is a utility for computing the alignment of proteins to genomic nucleotide sequence. This alignment can include eukaryotic splicing. At the heart of the program is a global alignment algorithm that specifically accounts for introns and splice signals. It is due to this algorithm that ProSplign is accurate in determining splice sites and tolerant to sequencing errors. ProSplign uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and then to speed up the core dynamic programming. Please follow one of the links below or navigate using the menu bar at the top of this page.
This web site is a single-point source of information on ProSplign, the tool for computing protein-to-genomic alignments that include an effort to account for mRNA splicing. ProSplign was developed with the following goals in mind:
ProSplign is used in the NCBI Eukaryotic Genome Annotation Pipeline to compute spliced protein alignments and in the NCBI Prokaryotic Genome Annotation Pipeline to find frameshifted genes and to locate frameshift positions on genome. ProSplign is available for use in a number of different ways. There is no online version of ProSplign. You must download and install the console version which is available for Linux (and may also be available for a few other platforms - please request). You can also link to ProSplign library from your own applications in a portable way since ProSplign is a part of the NCBI C++ Toolkit. And finally, ProSplign is available as a plugin for NCBI Genome Workbench. Reference: ProSplign - Protein to Genomic Alignment Tool. B. Kiryutin, A. Souvorov, T. Tatusova. Manuscript in preparation | |||||||||||||||||||||||
Binaries (updated 02/23/15) Sources Graphical view | |||||||||||||||||||||||
Using the console version
| |||||||||||||||||||||||
Algorithmic detailsProSplign works with input sequences on a pairwise basis. In other words, exon/intron structures are determined independently for each query and subject. The dynamic programming alone is accurate in determining splice junctions but computationally expensive. Also, if copies of a gene share same genomic sequence and strand, direct application may produce incorrect results by connecting exons from different copies. Thus, for every input query/subject pair, it is important to localize genes on the genomic sequence which ProSplign achieves with the algorithm to compartmentize the BLAST hits. The compartmentization step starts with computing protein-to-genomic blast hits. These give initial insight into the structure of compartments. Hits are separated into two same-strand sets and then compartments are identified within each strand. To do so, we formally define the optimization problem in terms of genomic sequence coverage and then solve it with a dynamic programming algorithm whose running time is short compared to the core dynamic programming described above. | |||||||||||||||||||||||
Frequently Asked QuestionsQ: Why am I getting "Unable to locate XXX" exceptions? Q: What does 'No compartment found' log file message mean? What is compartment? | |||||||||||||||||||||||
|