IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/0030160.html
   My bibliography  Save this article

Automated Protein Subfamily Identification and Classification

Author

Listed:
  • Duncan P Brown
  • Nandini Krishnamurthy
  • Kimmen Sjölander

Abstract

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.: Predicting the function of a gene or protein (gene product) from its primary sequence is a major focus of many bioinformatics methods. In this paper, the authors present a three-stage computational pipeline for gene functional annotation in an evolutionary framework to reduce the systematic errors associated with the standard protocol (annotation transfer from predicted homologs). In the first stage, a functional hierarchy is estimated for each protein family and subfamilies are identified. In the second stage, hidden Markov models (HMMs) (a type of statistical model) are constructed for each subfamily to model both the family-defining and subfamily-specific signatures. In the third stage, subfamily HMMs are used to assign novel sequences to functional subtypes. Extensive experimental validation of these methods shows that predicted subfamilies correspond closely to functional subtypes identified by experts and to conserved clades in phylogenetic trees; that subfamily HMMs increase the separation between homologs and non-homologs in sequence database discrimination tests relative to the use of a single HMM for the family; and that specificity of classification of novel sequences to subfamilies using subfamily HMMs is near perfect (1.5% error rate when sequences are assigned to the top-scoring subfamily, and

Suggested Citation

  • Duncan P Brown & Nandini Krishnamurthy & Kimmen Sjölander, 2007. "Automated Protein Subfamily Identification and Classification," PLOS Computational Biology, Public Library of Science, vol. 3(8), pages 1-13, August.
  • Handle: RePEc:plo:pcbi00:0030160
    DOI: 10.1371/journal.pcbi.0030160
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030160
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.0030160&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.0030160?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Barbara E Engelhardt & Michael I Jordan & Kathryn E Muratore & Steven E Brenner, 2005. "Protein Molecular Function Prediction by Bayesian Phylogenomics," PLOS Computational Biology, Public Library of Science, vol. 1(5), pages 1-1, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Stephen F Altschul & John C Wootton & Elena Zaslavsky & Yi-Kuo Yu, 2010. "The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment," PLOS Computational Biology, Public Library of Science, vol. 6(7), pages 1-17, July.
    2. Elisa Boari de Lima & Wagner Meira Júnior & Raquel Cardoso de Melo-Minardi, 2016. "Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering," PLOS Computational Biology, Public Library of Science, vol. 12(6), pages 1-32, June.
    3. Yi-An Chen & Lokesh P Tripathi & Benoit H Dessailly & Johan Nyström-Persson & Shandar Ahmad & Kenji Mizuguchi, 2014. "Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation," PLOS ONE, Public Library of Science, vol. 9(6), pages 1-11, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Nils Weinhold & Oliver Sander & Francisco S Domingues & Thomas Lengauer & Ingolf Sommer, 2008. "Local Function Conservation in Sequence and Structure Space," PLOS Computational Biology, Public Library of Science, vol. 4(7), pages 1-13, July.
    2. Jianzhu Ma & Sheng Wang & Zhiyong Wang & Jinbo Xu, 2014. "MRFalign: Protein Homology Detection through Alignment of Markov Random Fields," PLOS Computational Biology, Public Library of Science, vol. 10(3), pages 1-12, March.
    3. Adrian Schröder & Johannes Eichner & Jochen Supper & Jonas Eichner & Dierk Wanke & Carsten Henneges & Andreas Zell, 2010. "Predicting DNA-Binding Specificities of Eukaryotic Transcription Factors," PLOS ONE, Public Library of Science, vol. 5(11), pages 1-15, November.
    4. David K Crockett & Stephen R Piccolo & Perry G Ridge & Rebecca L Margraf & Elaine Lyon & Marc S Williams & Joyce A Mitchell, 2011. "Predicting Phenotypic Severity of Uncertain Gene Variants in the RET Proto-Oncogene," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-7, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:0030160. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.