IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1000392.html
   My bibliography  Save this article

Fast Statistical Alignment

Author

Listed:
  • Robert K Bradley
  • Adam Roberts
  • Michael Smoot
  • Sudeep Juvekar
  • Jaeyoung Do
  • Colin Dewey
  • Ian Holmes
  • Lior Pachter

Abstract

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.Author Summary: Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.

Suggested Citation

  • Robert K Bradley & Adam Roberts & Michael Smoot & Sudeep Juvekar & Jaeyoung Do & Colin Dewey & Ian Holmes & Lior Pachter, 2009. "Fast Statistical Alignment," PLOS Computational Biology, Public Library of Science, vol. 5(5), pages 1-15, May.
  • Handle: RePEc:plo:pcbi00:1000392
    DOI: 10.1371/journal.pcbi.1000392
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000392
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1000392&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1000392?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Elena Rivas & Sean R Eddy, 2008. "Probabilistic Phylogenetic Inference with Insertions and Deletions," PLOS Computational Biology, Public Library of Science, vol. 4(9), pages 1-21, September.
    2. Manolis Kellis & Nick Patterson & Matthew Endrizzi & Bruce Birren & Eric S. Lander, 2003. "Sequencing and comparison of yeast species to identify genes and regulatory elements," Nature, Nature, vol. 423(6937), pages 241-254, May.
    3. Saurabh Sinha & Xin He, 2007. "MORPH: Probabilistic Alignment Combined with Hidden Markov Models of cis-Regulatory Modules," PLOS Computational Biology, Public Library of Science, vol. 3(11), pages 1-15, November.
    4. Michael Worobey & Marlea Gemmel & Dirk E. Teuwen & Tamara Haselkorn & Kevin Kunstman & Michael Bunce & Jean-Jacques Muyembe & Jean-Marie M. Kabongo & Raphaël M. Kalengayi & Eric Van Marck & M. Thomas , 2008. "Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960," Nature, Nature, vol. 455(7213), pages 661-664, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Stephen F Altschul & John C Wootton & Elena Zaslavsky & Yi-Kuo Yu, 2010. "The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment," PLOS Computational Biology, Public Library of Science, vol. 6(7), pages 1-17, July.
    2. Michiaki Hamada & Hisanori Kiryu & Wataru Iwasaki & Kiyoshi Asai, 2011. "Generalized Centroid Estimators in Bioinformatics," PLOS ONE, Public Library of Science, vol. 6(2), pages 1-20, February.
    3. Erick Moreno-Centeno & Richard M. Karp, 2013. "The Implicit Hitting Set Approach to Solve Combinatorial Optimization Problems with an Application to Multigenome Alignment," Operations Research, INFORMS, vol. 61(2), pages 453-468, April.
    4. Lewis Stevens & Isaac Martínez-Ugalde & Erna King & Martin Wagah & Dominic Absolon & Rowan Bancroft & Pablo Gonzalez de la Rosa & Jessica L. Hall & Manuela Kieninger & Agnieszka Kloch & Sarah Pelan & , 2023. "Ancient diversity in host-parasite interaction genes in a model parasitic nematode," Nature Communications, Nature, vol. 14(1), pages 1-19, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tao Song & Hong Gu, 2014. "Discriminative Motif Discovery via Simulated Evolution and Random Under-Sampling," PLOS ONE, Public Library of Science, vol. 9(2), pages 1-10, February.
    2. Qi Dai & Lihua Li & Xiaoqing Liu & Yuhua Yao & Fukun Zhao & Michael Zhang, 2011. "Integrating Overlapping Structures and Background Information of Words Significantly Improves Biological Sequence Comparison," PLOS ONE, Public Library of Science, vol. 6(11), pages 1-10, November.
    3. Alexander Kawrykow & Gary Roumanis & Alfred Kam & Daniel Kwak & Clarence Leung & Chu Wu & Eleyine Zarour & Phylo players & Luis Sarmenta & Mathieu Blanchette & Jérôme Waldispühl, 2012. "Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment," PLOS ONE, Public Library of Science, vol. 7(3), pages 1-9, March.
    4. Alessandro L. V. Coradini & Christopher Ne Ville & Zachary A. Krieger & Joshua Roemer & Cara Hull & Shawn Yang & Daniel T. Lusk & Ian M. Ehrenreich, 2023. "Building synthetic chromosomes from natural DNA," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    5. Anthony Mveyange & Christian Skovsgaard & Tine Lesner, 2015. "Does HIV/AIDS matter for economic growth in sub-Saharan Africa?," WIDER Working Paper Series wp-2015-086, World Institute for Development Economic Research (UNU-WIDER).
    6. Matthew Gandy, 2022. "THE ZOONOTIC CITY: Urban Political Ecology and the Pandemic Imaginary," International Journal of Urban and Regional Research, Wiley Blackwell, vol. 46(2), pages 202-219, March.
    7. Valerie Storms & Marleen Claeys & Aminael Sanchez & Bart De Moor & Annemieke Verstuyf & Kathleen Marchal, 2010. "The Effect of Orthology and Coregulation on Detecting Regulatory Motifs," PLOS ONE, Public Library of Science, vol. 5(2), pages 1-11, February.
    8. Rahul Siddharthan & Eric D Siggia & Erik van Nimwegen, 2005. "PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny," PLOS Computational Biology, Public Library of Science, vol. 1(7), pages 1-23, December.
    9. Harri Lähdesmäki & Alistair G Rust & Ilya Shmulevich, 2008. "Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources," PLOS ONE, Public Library of Science, vol. 3(3), pages 1-24, March.
    10. Leelavati Narlikar & Raluca Gordân & Alexander J Hartemink, 2007. "A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast," PLOS Computational Biology, Public Library of Science, vol. 3(11), pages 1-10, November.
    11. J Roman Arguello & Carolina Sellanes & Yann Ru Lou & Robert A Raguso, 2013. "Can Yeast (S. cerevisiae) Metabolic Volatiles Provide Polymorphic Signaling?," PLOS ONE, Public Library of Science, vol. 8(8), pages 1-12, August.
    12. Marcella M. Alsan & David M. Cutler, 2010. "Why did HIV decline in Uganda?," NBER Working Papers 16171, National Bureau of Economic Research, Inc.
    13. Fabio Pardi & Nick Goldman, 2005. "Species Choice for Comparative Genomics: Being Greedy Works," PLOS Genetics, Public Library of Science, vol. 1(6), pages 1-1, December.
    14. Krishna B. S. Swamy & Hsin-Yi Lee & Carmina Ladra & Chien-Fu Jeff Liu & Jung-Chi Chao & Yi-Yun Chen & Jun-Yi Leu, 2022. "Proteotoxicity caused by perturbed protein complexes underlies hybrid incompatibility in yeast," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    15. Isiaq Oseni & Ibrahim Odusanya & Sakiru Akinbode, 2022. "Effectiveness of Foreign Aid for Health in Reducing HIV Prevalence in Sub-Saharan Africa," South-Eastern Europe Journal of Economics, Association of Economic Universities of South and Eastern Europe and the Black Sea Region, vol. 20(2), pages 141-158.
    16. Anthony Mveyange & Christian Skovsgaard & Tine Lesner, 2015. "Does HIV/AIDS matter for economic growth in sub-Saharan Africa?," WIDER Working Paper Series 086, World Institute for Development Economic Research (UNU-WIDER).
    17. Aridaman Pandit & Somdatta Sinha, 2011. "Differential Trends in the Codon Usage Patterns in HIV-1 Genes," PLOS ONE, Public Library of Science, vol. 6(12), pages 1-10, December.
    18. Rebecca Katz & Sangeeta Mookherji & Morgan Kaminski & Vibhuti Haté & Julie E. Fischer, 2012. "Urban Governance of Disease," Administrative Sciences, MDPI, vol. 2(2), pages 1-13, April.
    19. Eilon Sharon & Shai Lubliner & Eran Segal, 2008. "A Feature-Based Approach to Modeling Protein–DNA Interactions," PLOS Computational Biology, Public Library of Science, vol. 4(8), pages 1-17, August.
    20. Siewert Elizabeth A & Kechris Katerina J, 2009. "Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-34, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1000392. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.