IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/0010067.html
   My bibliography  Save this article

PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny

Author

Listed:
  • Rahul Siddharthan
  • Eric D Siggia
  • Erik van Nimwegen

Abstract

A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.Synopsis: Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until recently motif finders would typically take one of the following two general approaches. Given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a “phylogenetic footprinting” approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.

Suggested Citation

  • Rahul Siddharthan & Eric D Siggia & Erik van Nimwegen, 2005. "PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny," PLOS Computational Biology, Public Library of Science, vol. 1(7), pages 1-23, December.
  • Handle: RePEc:plo:pcbi00:0010067
    DOI: 10.1371/journal.pcbi.0010067
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0010067
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.0010067&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.0010067?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Christopher T. Harbison & D. Benjamin Gordon & Tong Ihn Lee & Nicola J. Rinaldi & Kenzie D. Macisaac & Timothy W. Danford & Nancy M. Hannett & Jean-Bosco Tagne & David B. Reynolds & Jane Yoo & Ezra G., 2004. "Transcriptional regulatory code of a eukaryotic genome," Nature, Nature, vol. 431(7004), pages 99-104, September.
    2. Manolis Kellis & Nick Patterson & Matthew Endrizzi & Bruce Birren & Eric S. Lander, 2003. "Sequencing and comparison of yeast species to identify genes and regulatory elements," Nature, Nature, vol. 423(6937), pages 241-254, May.
    3. Antonis Rokas & Barry L. Williams & Nicole King & Sean B. Carroll, 2003. "Genome-scale approaches to resolving incongruence in molecular phylogenies," Nature, Nature, vol. 425(6960), pages 798-804, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Harri Lähdesmäki & Alistair G Rust & Ilya Shmulevich, 2008. "Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources," PLOS ONE, Public Library of Science, vol. 3(3), pages 1-24, March.
    2. Ivan Dotu & Scott I Adamson & Benjamin Coleman & Cyril Fournier & Emma Ricart-Altimiras & Eduardo Eyras & Jeffrey H Chuang, 2018. "SARNAclust: Semi-automatic detection of RNA protein binding motifs from immunoprecipitation data," PLOS Computational Biology, Public Library of Science, vol. 14(3), pages 1-25, March.
    3. Jia Lu & Xiaoyi Cao & Sheng Zhong, 2018. "A likelihood approach to testing hypotheses on the co-evolution of epigenome and genome," PLOS Computational Biology, Public Library of Science, vol. 14(12), pages 1-28, December.
    4. Saeed Omidi & Mihaela Zavolan & Mikhail Pachkov & Jeremie Breda & Severin Berger & Erik van Nimwegen, 2017. "Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors," PLOS Computational Biology, Public Library of Science, vol. 13(7), pages 1-22, July.
    5. Aqil M Azmi & Abdulrakeeb Al-Ssulami, 2014. "Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs," PLOS ONE, Public Library of Science, vol. 9(5), pages 1-9, May.
    6. Timothy E Reddy & Charles DeLisi & Boris E Shakhnovich, 2007. "Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites," PLOS Computational Biology, Public Library of Science, vol. 3(5), pages 1-11, May.
    7. Kenzie D MacIsaac & Ernest Fraenkel, 2006. "Practical Strategies for Discovering Regulatory DNA Sequence Motifs," PLOS Computational Biology, Public Library of Science, vol. 2(4), pages 1-10, April.
    8. Siewert Elizabeth A & Kechris Katerina J, 2009. "Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-34, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Harri Lähdesmäki & Alistair G Rust & Ilya Shmulevich, 2008. "Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources," PLOS ONE, Public Library of Science, vol. 3(3), pages 1-24, March.
    2. Leelavati Narlikar & Raluca Gordân & Alexander J Hartemink, 2007. "A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast," PLOS Computational Biology, Public Library of Science, vol. 3(11), pages 1-10, November.
    3. Eilon Sharon & Shai Lubliner & Eran Segal, 2008. "A Feature-Based Approach to Modeling Protein–DNA Interactions," PLOS Computational Biology, Public Library of Science, vol. 4(8), pages 1-17, August.
    4. Siewert Elizabeth A & Kechris Katerina J, 2009. "Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-34, September.
    5. Kenzie D MacIsaac & Ernest Fraenkel, 2006. "Practical Strategies for Discovering Regulatory DNA Sequence Motifs," PLOS Computational Biology, Public Library of Science, vol. 2(4), pages 1-10, April.
    6. Wang Yuancheng & Degnan James H, 2011. "Performance of Matrix Representation with Parsimony for Inferring Species from Gene Trees," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-39, May.
    7. Tao Song & Hong Gu, 2014. "Discriminative Motif Discovery via Simulated Evolution and Random Under-Sampling," PLOS ONE, Public Library of Science, vol. 9(2), pages 1-10, February.
    8. Zing Tsung-Yeh Tsai & Shin-Han Shiu & Huai-Kuang Tsai, 2015. "Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast," PLOS Computational Biology, Public Library of Science, vol. 11(8), pages 1-22, August.
    9. Gross, Eitan, 2015. "Effect of environmental stress on regulation of gene expression in the yeast," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 430(C), pages 224-235.
    10. Alexander Kawrykow & Gary Roumanis & Alfred Kam & Daniel Kwak & Clarence Leung & Chu Wu & Eleyine Zarour & Phylo players & Luis Sarmenta & Mathieu Blanchette & Jérôme Waldispühl, 2012. "Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment," PLOS ONE, Public Library of Science, vol. 7(3), pages 1-9, March.
    11. Armita Nourmohammad & Michael Lässig, 2011. "Formation of Regulatory Modules by Local Sequence Duplication," PLOS Computational Biology, Public Library of Science, vol. 7(10), pages 1-12, October.
    12. Martín Espariz & Federico A Zuljan & Luis Esteban & Christian Magni, 2016. "Taxonomic Identity Resolution of Highly Phylogenetically Related Strains and Selection of Phylogenetic Markers by Using Genome-Scale Methods: The Bacillus pumilus Group Case," PLOS ONE, Public Library of Science, vol. 11(9), pages 1-17, September.
    13. Alessandro L. V. Coradini & Christopher Ne Ville & Zachary A. Krieger & Joshua Roemer & Cara Hull & Shawn Yang & Daniel T. Lusk & Ian M. Ehrenreich, 2023. "Building synthetic chromosomes from natural DNA," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    14. Wei-Sheng Wu & Fu-Jou Lai, 2016. "Detecting Cooperativity between Transcription Factors Based on Functional Coherence and Similarity of Their Target Gene Sets," PLOS ONE, Public Library of Science, vol. 11(9), pages 1-12, September.
    15. Valerie Storms & Marleen Claeys & Aminael Sanchez & Bart De Moor & Annemieke Verstuyf & Kathleen Marchal, 2010. "The Effect of Orthology and Coregulation on Detecting Regulatory Motifs," PLOS ONE, Public Library of Science, vol. 5(2), pages 1-11, February.
    16. Robert K Bradley & Adam Roberts & Michael Smoot & Sudeep Juvekar & Jaeyoung Do & Colin Dewey & Ian Holmes & Lior Pachter, 2009. "Fast Statistical Alignment," PLOS Computational Biology, Public Library of Science, vol. 5(5), pages 1-15, May.
    17. Jens Keilwagen & Jan Grau & Ivan A Paponov & Stefan Posch & Marc Strickert & Ivo Grosse, 2011. "De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference," PLOS Computational Biology, Public Library of Science, vol. 7(2), pages 1-13, February.
    18. Guo-Cheng Yuan & Jun S Liu, 2008. "Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion," PLOS Computational Biology, Public Library of Science, vol. 4(1), pages 1-11, January.
    19. Saket Navlakha & Anthony Gitter & Ziv Bar-Joseph, 2012. "A Network-based Approach for Predicting Missing Pathway Interactions," PLOS Computational Biology, Public Library of Science, vol. 8(8), pages 1-13, August.
    20. Jeremiah J Faith & Boris Hayete & Joshua T Thaden & Ilaria Mogno & Jamey Wierzbowski & Guillaume Cottarel & Simon Kasif & James J Collins & Timothy S Gardner, 2007. "Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles," PLOS Biology, Public Library of Science, vol. 5(1), pages 1-13, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:0010067. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.