IDEAS home Printed from https://ideas.repec.org/a/plo/pbio00/0050016.html
   My bibliography  Save this article

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Author

Listed:
  • Shibu Yooseph
  • Granger Sutton
  • Douglas B Rusch
  • Aaron L Halpern
  • Shannon J Williamson
  • Karin Remington
  • Jonathan A Eisen
  • Karla B Heidelberg
  • Gerard Manning
  • Weizhong Li
  • Lukasz Jaroszewski
  • Piotr Cieplak
  • Christopher S Miller
  • Huiying Li
  • Susan T Mashiyama
  • Marcin P Joachimiak
  • Christopher van Belle
  • John-Marc Chandonia
  • David A Soergel
  • Yufeng Zhai
  • Kannan Natarajan
  • Shaun Lee
  • Benjamin J Raphael
  • Vineet Bafna
  • Robert Friedman
  • Steven E Brenner
  • Adam Godzik
  • David Eisenberg
  • Jack E Dixon
  • Susan S Taylor
  • Robert L Strausberg
  • Marvin Frazier
  • J Craig Venter

Abstract

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature. : The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature. The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.

Suggested Citation

  • Shibu Yooseph & Granger Sutton & Douglas B Rusch & Aaron L Halpern & Shannon J Williamson & Karin Remington & Jonathan A Eisen & Karla B Heidelberg & Gerard Manning & Weizhong Li & Lukasz Jaroszewski , 2007. "The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families," PLOS Biology, Public Library of Science, vol. 5(3), pages 1-35, March.
  • Handle: RePEc:plo:pbio00:0050016
    DOI: 10.1371/journal.pbio.0050016
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0050016
    Download Restriction: no

    File URL: https://journals.plos.org/plosbiology/article/file?id=10.1371/journal.pbio.0050016&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pbio.0050016?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Gene W. Tyson & Jarrod Chapman & Philip Hugenholtz & Eric E. Allen & Rachna J. Ram & Paul M. Richardson & Victor V. Solovyev & Edward M. Rubin & Daniel S. Rokhsar & Jillian F. Banfield, 2004. "Community structure and metabolism through reconstruction of microbial genomes from the environment," Nature, Nature, vol. 428(6978), pages 37-43, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Natarajan Kannan & Susan S Taylor & Yufeng Zhai & J Craig Venter & Gerard Manning, 2007. "Structural and Functional Diversity of the Microbial Kinome," PLOS Biology, Public Library of Science, vol. 5(3), pages 1-12, March.
    2. Meishun Yu & Menghui Zhang & Runying Zeng & Ruolin Cheng & Rui Zhang & Yanping Hou & Fangfang Kuang & Xuejin Feng & Xiyang Dong & Yinfang Li & Zongze Shao & Min Jin, 2024. "Diversity and potential host-interactions of viruses inhabiting deep-sea seamount sediments," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    3. Morgan N Price & Paramvir S Dehal & Adam P Arkin, 2008. "FastBLAST: Homology Relationships for Millions of Proteins," PLOS ONE, Public Library of Science, vol. 3(10), pages 1-8, October.
    4. Katharina Mir & Steffen Schober, 2014. "Selection Pressure in Alternative Reading Frames," PLOS ONE, Public Library of Science, vol. 9(10), pages 1-7, October.
    5. Yael Baran & Eran Halperin, 2012. "Joint Analysis of Multiple Metagenomic Samples," PLOS Computational Biology, Public Library of Science, vol. 8(2), pages 1-11, February.
    6. Armstrong, Claire W. & Foley, Naomi S. & Tinch, Rob & van den Hove, Sybille, 2012. "Services from the deep: Steps towards valuation of deep sea goods and services," Ecosystem Services, Elsevier, vol. 2(C), pages 2-13.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kelly J. Whaley-Martin & Lin-Xing Chen & Tara Colenbrander Nelson & Jennifer Gordon & Rose Kantor & Lauren E. Twible & Stephanie Marshall & Sam McGarry & Laura Rossi & Benoit Bessette & Christian Baro, 2023. "O2 partitioning of sulfur oxidizing bacteria drives acidity and thiosulfate distributions in mining waters," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    2. Jean-Sebastien Gounot & Minghao Chia & Denis Bertrand & Woei-Yuh Saw & Aarthi Ravikrishnan & Adrian Low & Yichen Ding & Amanda Hui Qi Ng & Linda Wei Lin Tan & Yik-Ying Teo & Henning Seedorf & Niranjan, 2022. "Genome-centric analysis of short and long read metagenomes reveals uncharacterized microbiome diversity in Southeast Asians," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    3. Xiaoquan Su & Weihua Pan & Baoxing Song & Jian Xu & Kang Ning, 2014. "Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization," PLOS ONE, Public Library of Science, vol. 9(3), pages 1-13, March.
    4. Angelina Beavogui & Auriane Lacroix & Nicolas Wiart & Julie Poulain & Tom O. Delmont & Lucas Paoli & Patrick Wincker & Pedro H. Oliveira, 2024. "The defensome of complex bacterial communities," Nature Communications, Nature, vol. 15(1), pages 1-15, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pbio00:0050016. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosbiology (email available below). General contact details of provider: https://journals.plos.org/plosbiology/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.