IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0014802.html
   My bibliography  Save this article

Genetic Classification of Populations Using Supervised Learning

Author

Listed:
  • Michael Bridges
  • Elizabeth A Heron
  • Colm O'Dushlaine
  • Ricardo Segurado
  • The International Schizophrenia Consortium (ISC)
  • Derek Morris
  • Aiden Corvin
  • Michael Gill
  • Carlos Pinto

Abstract

There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case–control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

Suggested Citation

  • Michael Bridges & Elizabeth A Heron & Colm O'Dushlaine & Ricardo Segurado & The International Schizophrenia Consortium (ISC) & Derek Morris & Aiden Corvin & Michael Gill & Carlos Pinto, 2011. "Genetic Classification of Populations Using Supervised Learning," PLOS ONE, Public Library of Science, vol. 6(5), pages 1-12, May.
  • Handle: RePEc:plo:pone00:0014802
    DOI: 10.1371/journal.pone.0014802
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014802
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0014802&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0014802?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. David Reich & Kumarasamy Thangaraj & Nick Patterson & Alkes L. Price & Lalji Singh, 2009. "Reconstructing Indian population history," Nature, Nature, vol. 461(7263), pages 489-494, September.
    2. Baik, Jinho & Silverstein, Jack W., 2006. "Eigenvalues of large sample covariance matrices of spiked population models," Journal of Multivariate Analysis, Elsevier, vol. 97(6), pages 1382-1408, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gyaneshwer Chaubey & Anurag Kadian & Saroj Bala & Vadlamudi Raghavendra Rao, 2015. "Genetic Affinity of the Bhil, Kol and Gond Mentioned in Epic Ramayana," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-11, June.
    2. Yata, Kazuyoshi & Aoshima, Makoto, 2013. "PCA consistency for the power spiked model in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 334-354.
    3. Jung, Sungkyu & Sen, Arusharka & Marron, J.S., 2012. "Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 190-203.
    4. Forzani, Liliana & Gieco, Antonella & Tolmasky, Carlos, 2017. "Likelihood ratio test for partial sphericity in high and ultra-high dimensions," Journal of Multivariate Analysis, Elsevier, vol. 159(C), pages 18-38.
    5. Kay Young McChesney, 2015. "Teaching Diversity," SAGE Open, , vol. 5(4), pages 21582440156, October.
    6. Hachem, Walid & Loubaton, Philippe & Mestre, Xavier & Najim, Jamal & Vallet, Pascal, 2013. "A subspace estimator for fixed rank perturbations of large random matrices," Journal of Multivariate Analysis, Elsevier, vol. 114(C), pages 427-447.
    7. Couillet, Romain, 2015. "Robust spiked random matrices and a robust G-MUSIC estimator," Journal of Multivariate Analysis, Elsevier, vol. 140(C), pages 139-161.
    8. Rozaimi Mohamad Razali & Juan Rodriguez-Flores & Mohammadmersad Ghorbani & Haroon Naeem & Waleed Aamer & Elbay Aliyev & Ali Jubran & Andrew G. Clark & Khalid A. Fakhro & Younes Mokrab, 2021. "Thousands of Qatari genomes inform human migration history and improve imputation of Arab haplotypes," Nature Communications, Nature, vol. 12(1), pages 1-16, December.
    9. Mark S Hibbins & Matthew W Hahn, 2021. "The effects of introgression across thousands of quantitative traits revealed by gene expression in wild tomatoes," PLOS Genetics, Public Library of Science, vol. 17(11), pages 1-20, November.
    10. David B. Stern & Nathan W. Anderson & Juanita A. Diaz & Carol Eunmi Lee, 2022. "Genome-wide signatures of synergistic epistasis during parallel adaptation in a Baltic Sea copepod," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    11. Joongyeub Yeo & George Papanicolaou, 2016. "Random matrix approach to estimation of high-dimensional factor models," Papers 1611.05571, arXiv.org, revised Nov 2017.
    12. S Justin Carlus & Saumya Sarkar & Sandeep Kumar Bansal & Vertika Singh & Kiran Singh & Rajesh Kumar Jha & Nirmala Sadasivam & Sri Revathy Sadasivam & P S Gireesha & Kumarasamy Thangaraj & Singh Rajend, 2016. "Is MTHFR 677 C>T Polymorphism Clinically Important in Polycystic Ovarian Syndrome (PCOS)? A Case-Control Study, Meta-Analysis and Trial Sequential Analysis," PLOS ONE, Public Library of Science, vol. 11(3), pages 1-15, March.
    13. Deo, Rohit S., 2016. "On the Tracy–Widom approximation of studentized extreme eigenvalues of Wishart matrices," Journal of Multivariate Analysis, Elsevier, vol. 147(C), pages 265-272.
    14. Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
    15. Benaych-Georges, Florent & Nadakuditi, Raj Rao, 2012. "The singular values and vectors of low rank perturbations of large rectangular random matrices," Journal of Multivariate Analysis, Elsevier, vol. 111(C), pages 120-135.
    16. Brendan P. W. Ames & Mingyi Hong, 2016. "Alternating direction method of multipliers for penalized zero-variance discriminant analysis," Computational Optimization and Applications, Springer, vol. 64(3), pages 725-754, July.
    17. Patrick K. Kimes & Yufeng Liu & David Neil Hayes & James Stephen Marron, 2017. "Statistical significance for hierarchical clustering," Biometrics, The International Biometric Society, vol. 73(3), pages 811-821, September.
    18. Edoardo Saccenti & Marieke E. Timmerman, 2017. "Considering Horn’s Parallel Analysis from a Random Matrix Theory Point of View," Psychometrika, Springer;The Psychometric Society, vol. 82(1), pages 186-209, March.
    19. Ding, Xiucai & Ji, Hong Chang, 2023. "Spiked multiplicative random matrices and principal components," Stochastic Processes and their Applications, Elsevier, vol. 163(C), pages 25-60.
    20. Shu Wang & Jia-Ren Lin & Eduardo D Sontag & Peter K Sorger, 2019. "Inferring reaction network structure from single-cell, multiplex data, using toric systems theory," PLOS Computational Biology, Public Library of Science, vol. 15(12), pages 1-25, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0014802. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.