IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1000605.html
   My bibliography  Save this article

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

Author

Listed:
  • Alexandra M Schnoes
  • Shoshana D Brown
  • Igor Dodevski
  • Patricia C Babbitt

Abstract

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.Author Summary: One of the core elements of modern biological scientific investigation is the universal availability of millions of protein sequences from thousands of different organisms, allowing for exciting new investigations into biological questions. These sequences, found in large primary sequence databases such as GenBank NR or UniProt/TrEMBL, in secondary databases such as the valuable pathways database KEGG, or in highly curated databases such as UniProt/Swiss-Prot, are often annotated by computationally predicted protein functions. The scale of the available predicted function information is enormous but the accuracy of these predictions is essentially unknown. We investigate the critical question of the accuracy of functional predictions in these four public databases. We used 37 well-characterized enzyme families as a gold standard for comparing the accuracy of functional annotations in these databases. We find that function prediction error (i.e., misannotation) is a serious problem in all but the manually curated database Swiss-Prot. We discuss several approaches for mitigating the consequences of these high levels of misannotation.

Suggested Citation

  • Alexandra M Schnoes & Shoshana D Brown & Igor Dodevski & Patricia C Babbitt, 2009. "Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies," PLOS Computational Biology, Public Library of Science, vol. 5(12), pages 1-13, December.
  • Handle: RePEc:plo:pcbi00:1000605
    DOI: 10.1371/journal.pcbi.1000605
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000605
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1000605&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1000605?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Holly J Atkinson & John H Morris & Thomas E Ferrin & Patricia C Babbitt, 2009. "Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies," PLOS ONE, Public Library of Science, vol. 4(2), pages 1-14, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Rui Fa & Domenico Cozzetto & Cen Wan & David T Jones, 2018. "Predicting human protein function with multi-task deep neural networks," PLOS ONE, Public Library of Science, vol. 13(6), pages 1-16, June.
    2. Elisa Boari de Lima & Wagner Meira Júnior & Raquel Cardoso de Melo-Minardi, 2016. "Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering," PLOS Computational Biology, Public Library of Science, vol. 12(6), pages 1-32, June.
    3. Michal Brylinski & Daswanth Lingam, 2012. "eThread: A Highly Optimized Machine Learning-Based Approach to Meta-Threading and the Modeling of Protein Tertiary Structures," PLOS ONE, Public Library of Science, vol. 7(11), pages 1-12, November.
    4. Akira R Kinjo & Haruki Nakamura, 2012. "Composite Structural Motifs of Binding Sites for Delineating Biological Functions of Proteins," PLOS ONE, Public Library of Science, vol. 7(2), pages 1-11, February.
    5. Matthew N Benedict & Michael B Mundy & Christopher S Henry & Nicholas Chia & Nathan D Price, 2014. "Likelihood-Based Gene Annotations for Gap Filling and Quality Assessment in Genome-Scale Metabolic Models," PLOS Computational Biology, Public Library of Science, vol. 10(10), pages 1-14, October.
    6. Thomas J Sharpton & Samantha J Riesenfeld & Steven W Kembel & Joshua Ladau & James P O'Dwyer & Jessica L Green & Jonathan A Eisen & Katherine S Pollard, 2011. "PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data," PLOS Computational Biology, Public Library of Science, vol. 7(1), pages 1-13, January.
    7. Yuval Bussi & Ruti Kapon & Ziv Reich, 2021. "Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy," PLOS ONE, Public Library of Science, vol. 16(10), pages 1-27, October.
    8. Wing-Cheong Wong & Sebastian Maurer-Stroh & Frank Eisenhaber, 2010. "More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology," PLOS Computational Biology, Public Library of Science, vol. 6(7), pages 1-19, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marco Orlando & Patrick C F Buchholz & Marina Lotti & Jürgen Pleiss, 2021. "The GH19 Engineering Database: Sequence diversity, substrate scope, and evolution in glycoside hydrolase family 19," PLOS ONE, Public Library of Science, vol. 16(10), pages 1-30, October.
    2. Juan Pablo Bascur & Suzan Verberne & Nees Jan Eck & Ludo Waltman, 2023. "Academic information retrieval using citation clusters: in-depth evaluation based on systematic reviews," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(5), pages 2895-2921, May.
    3. Bryan Korithoski & Oralia Kolaczkowski & Krishanu Mukherjee & Reema Kola & Chandra Earl & Bryan Kolaczkowski, 2015. "Evolution of a Novel Antiviral Immune-Signaling Interaction by Partial-Gene Duplication," PLOS ONE, Public Library of Science, vol. 10(9), pages 1-26, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1000605. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.