IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1006721.html
   My bibliography  Save this article

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

Author

Listed:
  • Stephen Woloszynek
  • Zhengqiao Zhao
  • Jian Chen
  • Gail L Rosen

Abstract

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.Author summary: Improvements in the way genomes are sequenced have led to an abundance of microbiome data. With the right approaches, researchers use these data to thoroughly characterize how microbes interact with each other and their host, but sequencing data is of a form (sequences of letters) not ideal for many data analysis approaches. We therefore present an approach to transform sequencing data into arrays of numbers that can capture interesting qualities of the data at the sub-sequence, full-sequence, and sample levels. This allows us to measure the importance of certain microbial sequences with respect to the type of microbe and the condition of the host. Also, representing sequences in this way improves our ability to use other complicated modeling approaches. Using microbiome data from human samples, we show that our numeric representations captured differences between various types of microbes, as well as differences in the body site location from which the samples were collected.

Suggested Citation

  • Stephen Woloszynek & Zhengqiao Zhao & Jian Chen & Gail L Rosen, 2019. "16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses," PLOS Computational Biology, Public Library of Science, vol. 15(2), pages 1-25, February.
  • Handle: RePEc:plo:pcbi00:1006721
    DOI: 10.1371/journal.pcbi.1006721
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006721
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1006721&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1006721?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    2. Rui Wang & Naihua Xiu & Kim-Chuan Toh, 2021. "Subspace quadratic regularization method for group sparse multinomial logistic regression," Computational Optimization and Applications, Springer, vol. 79(3), pages 531-559, July.
    3. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    4. Chen, Le-Yu & Lee, Sokbae, 2018. "Best subset binary prediction," Journal of Econometrics, Elsevier, vol. 206(1), pages 39-56.
    5. Chuliá, Helena & Garrón, Ignacio & Uribe, Jorge M., 2024. "Daily growth at risk: Financial or real drivers? The answer is not always the same," International Journal of Forecasting, Elsevier, vol. 40(2), pages 762-776.
    6. Sung Jae Jun & Sokbae Lee, 2024. "Causal Inference Under Outcome-Based Sampling with Monotonicity Assumptions," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 42(3), pages 998-1009, July.
    7. Xiangwei Li & Thomas Delerue & Ben Schöttker & Bernd Holleczek & Eva Grill & Annette Peters & Melanie Waldenberger & Barbara Thorand & Hermann Brenner, 2022. "Derivation and validation of an epigenetic frailty risk score in population-based cohorts of older adults," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    8. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    9. Mai Li & Ying Lin & Qianmei Feng & Wenjiang Fu & Shenglin Peng & Siwei Chen & Mahesh Paidpilli & Chirag Goel & Eduard Galstyan & Venkat Selvamanickam, 2025. "Quantile regression-enriched event modeling framework for dropout analysis in high-temperature superconductor manufacturing," Journal of Intelligent Manufacturing, Springer, vol. 36(5), pages 3009-3030, June.
    10. Heng Chen & Daniel F. Heitjan, 2022. "Analysis of local sensitivity to nonignorability with missing outcomes and predictors," Biometrics, The International Biometric Society, vol. 78(4), pages 1342-1352, December.
    11. S Ariane Christie & Amanda S Conroy & Rachael A Callcut & Alan E Hubbard & Mitchell J Cohen, 2019. "Dynamic multi-outcome prediction after injury: Applying adaptive machine learning for precision medicine in trauma," PLOS ONE, Public Library of Science, vol. 14(4), pages 1-13, April.
    12. Zhu Wang, 2022. "MM for penalized estimation," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 31(1), pages 54-75, March.
    13. Ida Kubiszewski & Kenneth Mulder & Diane Jarvis & Robert Costanza, 2022. "Toward better measurement of sustainable development and wellbeing: A small number of SDG indicators reliably predict life satisfaction," Sustainable Development, John Wiley & Sons, Ltd., vol. 30(1), pages 139-148, February.
    14. Gustavo A. Alonso-Silverio & Víctor Francisco-García & Iris P. Guzmán-Guzmán & Elías Ventura-Molina & Antonio Alarcón-Paredes, 2021. "Toward Non-Invasive Estimation of Blood Glucose Concentration: A Comparative Performance," Mathematics, MDPI, vol. 9(20), pages 1-13, October.
    15. Christopher Kath & Florian Ziel, 2018. "The value of forecasts: Quantifying the economic gains of accurate quarter-hourly electricity price forecasts," Papers 1811.08604, arXiv.org.
    16. Naimoli, Antonio, 2022. "Modelling the persistence of Covid-19 positivity rate in Italy," Socio-Economic Planning Sciences, Elsevier, vol. 82(PA).
    17. Gurgul Henryk & Machno Artur, 2017. "Trade Pattern on Warsaw Stock Exchange and Prediction of Number of Trades," Statistics in Transition New Series, Statistics Poland, vol. 18(1), pages 91-114, March.
    18. Ahmed Ismaïl & Hartikainen Anna-Liisa & Järvelin Marjo-Riitta & Richardson Sylvia, 2011. "False Discovery Rate Estimation for Stability Selection: Application to Genome-Wide Association Studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-20, November.
    19. Vitaly Meursault & Daniel Moulton & Larry Santucci & Nathan Schor, 2022. "One Threshold Doesn’t Fit All: Tailoring Machine Learning Predictions of Consumer Default for Lower-Income Areas," Working Papers 22-39, Federal Reserve Bank of Philadelphia.
    20. Michael Funke & Kadri Männasoo & Helery Tasane, 2023. "Regional Economic Impacts of the Øresund Cross-Border Fixed Link: Cui Bono?," CESifo Working Paper Series 10557, CESifo.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1006721. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.