IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/0030243.html
   My bibliography  Save this article

Predicting Gene Expression from Sequence: A Reexamination

Author

Listed:
  • Yuan Yuan
  • Lei Guo
  • Lei Shen
  • Jun S Liu

Abstract

Although much of the information regarding genes' expressions is encoded in the genome, deciphering such information has been very challenging. We reexamined Beer and Tavazoie's (BT) approach to predict mRNA expression patterns of 2,587 genes in Saccharomyces cerevisiae from the information in their respective promoter sequences. Instead of fitting complex Bayesian network models, we trained naïve Bayes classifiers using only the sequence-motif matching scores provided by BT. Our simple models correctly predict expression patterns for 79% of the genes, based on the same criterion and the same cross-validation (CV) procedure as BT, which compares favorably to the 73% accuracy of BT. The fact that our approach did not use position and orientation information of the predicted binding sites but achieved a higher prediction accuracy, motivated us to investigate a few biological predictions made by BT. We found that some of their predictions, especially those related to motif orientations and positions, are at best circumstantial. For example, the combinatorial rules suggested by BT for the PAC and RRPE motifs are not unique to the cluster of genes from which the predictive model was inferred, and there are simpler rules that are statistically more significant than BT's ones. We also show that CV procedure used by BT to estimate their method's prediction accuracy is inappropriate and may have overestimated the prediction accuracy by about 10%.: Through binding to certain sequence-specific sites upstream of the target genes, a special class of proteins called transcription factors (TFs) control transcription activities, i.e., expression amounts, of the downstream genes. The DNA sequence patterns bound by TFs are called motifs. It has been shown in an article by Beer and Tavazoie (BT) published in Cell in 2004 that a gene's expression pattern can be well-predicted based only on its upstream sequence information in the form of matching scores of a set of sequence motifs and the location and orientation of corresponding predicted binding sites. Here we report a new naïve Bayes method for such a prediction task. Compared to BT's work, our model is simpler, more robust, and achieves a higher prediction accuracy using only the motif matching score. In our method, the location and orientation information do not further help the prediction in a global way. Our result also casts doubt on several biological hypotheses generated by BT based on their model. Finally, we show that the cross-validation procedure used by BT to estimate their method's prediction accuracy is inappropriate and may have overestimated the accuracy by about 10%.

Suggested Citation

  • Yuan Yuan & Lei Guo & Lei Shen & Jun S Liu, 2007. "Predicting Gene Expression from Sequence: A Reexamination," PLOS Computational Biology, Public Library of Science, vol. 3(11), pages 1-7, November.
  • Handle: RePEc:plo:pcbi00:0030243
    DOI: 10.1371/journal.pcbi.0030243
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030243
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.0030243&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.0030243?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Sunduz Keles & Mark van der Laan & Chris Vulpe, 2004. "Regulatory Motif Finding by Logic Regression," U.C. Berkeley Division of Biostatistics Working Paper Series 1145, Berkeley Electronic Press.
    2. Kenzie D MacIsaac & Ernest Fraenkel, 2006. "Practical Strategies for Discovering Regulatory DNA Sequence Motifs," PLOS Computational Biology, Public Library of Science, vol. 2(4), pages 1-10, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ana Helena Tavares & Jakob Raymaekers & Peter J. Rousseeuw & Paula Brito & Vera Afreixo, 2020. "Clustering genomic words in human DNA using peaks and trends of distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(1), pages 57-76, March.
    2. Tuglus Catherine & van der Laan Mark J., 2011. "Repeated Measures Semiparametric Regression Using Targeted Maximum Likelihood Methodology with Application to Transcription Factor Activity Discovery," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-31, January.
    3. Harri Lähdesmäki & Alistair G Rust & Ilya Shmulevich, 2008. "Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources," PLOS ONE, Public Library of Science, vol. 3(3), pages 1-24, March.
    4. Luis Carvalho, 2013. "Bayesian Centroid Estimation for Motif Discovery," PLOS ONE, Public Library of Science, vol. 8(12), pages 1-12, December.
    5. Sohn, Insuk & Shim, Jooyong & Hwang, Changha & Kim, Sujong & Lee, Jae Won, 2009. "Informative transcription factor selection using support vector machine-based generalized approximate cross validation criteria," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1727-1735, March.
    6. Baierl, Andreas & Futschik, Andreas & Bogdan, Malgorzata & Biecek, Przemyslaw, 2007. "Locating multiple interacting quantitative trait loci using robust model selection," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6423-6434, August.
    7. Insuk Sohn & Jooyong Shim & Changha Hwang & Sujong Kim & Jae Won Lee, 2014. "Transcription factor-binding site identification and gene classification via fusion of the supervised-weighted discrete kernel clustering and support vector machine," Journal of Applied Statistics, Taylor & Francis Journals, vol. 41(3), pages 573-581, March.
    8. Siewert Elizabeth A & Kechris Katerina J, 2009. "Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-34, September.
    9. Fran Lewitter, 2007. "Moving Education Forward," PLOS Computational Biology, Public Library of Science, vol. 3(1), pages 1-2, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:0030243. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.