IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0014802.html
   My bibliography  Save this article

Genetic Classification of Populations Using Supervised Learning

Author

Listed:
  • Michael Bridges
  • Elizabeth A Heron
  • Colm O'Dushlaine
  • Ricardo Segurado
  • The International Schizophrenia Consortium (ISC)
  • Derek Morris
  • Aiden Corvin
  • Michael Gill
  • Carlos Pinto

Abstract

There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case–control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

Suggested Citation

  • Michael Bridges & Elizabeth A Heron & Colm O'Dushlaine & Ricardo Segurado & The International Schizophrenia Consortium (ISC) & Derek Morris & Aiden Corvin & Michael Gill & Carlos Pinto, 2011. "Genetic Classification of Populations Using Supervised Learning," PLOS ONE, Public Library of Science, vol. 6(5), pages 1-12, May.
  • Handle: RePEc:plo:pone00:0014802
    DOI: 10.1371/journal.pone.0014802
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014802
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0014802&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0014802?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. David Reich & Kumarasamy Thangaraj & Nick Patterson & Alkes L. Price & Lalji Singh, 2009. "Reconstructing Indian population history," Nature, Nature, vol. 461(7263), pages 489-494, September.
    2. Baik, Jinho & Silverstein, Jack W., 2006. "Eigenvalues of large sample covariance matrices of spiked population models," Journal of Multivariate Analysis, Elsevier, vol. 97(6), pages 1382-1408, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gyaneshwer Chaubey & Anurag Kadian & Saroj Bala & Vadlamudi Raghavendra Rao, 2015. "Genetic Affinity of the Bhil, Kol and Gond Mentioned in Epic Ramayana," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-11, June.
    2. Yata, Kazuyoshi & Aoshima, Makoto, 2013. "PCA consistency for the power spiked model in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 334-354.
    3. Jung, Sungkyu & Sen, Arusharka & Marron, J.S., 2012. "Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 190-203.
    4. Forzani, Liliana & Gieco, Antonella & Tolmasky, Carlos, 2017. "Likelihood ratio test for partial sphericity in high and ultra-high dimensions," Journal of Multivariate Analysis, Elsevier, vol. 159(C), pages 18-38.
    5. Hachem, Walid & Loubaton, Philippe & Mestre, Xavier & Najim, Jamal & Vallet, Pascal, 2013. "A subspace estimator for fixed rank perturbations of large random matrices," Journal of Multivariate Analysis, Elsevier, vol. 114(C), pages 427-447.
    6. S Justin Carlus & Saumya Sarkar & Sandeep Kumar Bansal & Vertika Singh & Kiran Singh & Rajesh Kumar Jha & Nirmala Sadasivam & Sri Revathy Sadasivam & P S Gireesha & Kumarasamy Thangaraj & Singh Rajend, 2016. "Is MTHFR 677 C>T Polymorphism Clinically Important in Polycystic Ovarian Syndrome (PCOS)? A Case-Control Study, Meta-Analysis and Trial Sequential Analysis," PLOS ONE, Public Library of Science, vol. 11(3), pages 1-15, March.
    7. Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
    8. Brendan P. W. Ames & Mingyi Hong, 2016. "Alternating direction method of multipliers for penalized zero-variance discriminant analysis," Computational Optimization and Applications, Springer, vol. 64(3), pages 725-754, July.
    9. Ding, Xiucai & Ji, Hong Chang, 2023. "Spiked multiplicative random matrices and principal components," Stochastic Processes and their Applications, Elsevier, vol. 163(C), pages 25-60.
    10. Shu Wang & Jia-Ren Lin & Eduardo D Sontag & Peter K Sorger, 2019. "Inferring reaction network structure from single-cell, multiplex data, using toric systems theory," PLOS Computational Biology, Public Library of Science, vol. 15(12), pages 1-25, December.
    11. Feldman, Michael J., 2023. "Spiked singular values and vectors under extreme aspect ratios," Journal of Multivariate Analysis, Elsevier, vol. 196(C).
    12. Bo Zhang & Jiti Gao & Guangming Pan & Yanrong Yang, 2019. "Spiked Eigenvalues of High-Dimensional Separable Sample Covariance Matrices," Monash Econometrics and Business Statistics Working Papers 31/19, Monash University, Department of Econometrics and Business Statistics.
    13. Zhijun Wu & Yuqing Lou & Wei Jin & Yan Liu & Lin Lu & Guoping Lu, 2012. "The Pro12Ala Polymorphism in the Peroxisome Proliferator-Activated Receptor Gamma-2 Gene (PPARγ2) Is Associated with Increased Risk of Coronary Artery Disease: A Meta-Analysis," PLOS ONE, Public Library of Science, vol. 7(12), pages 1-14, December.
    14. Buzbas, Erkan Ozge & Verdu, Paul, 2018. "Inference on admixture fractions in a mechanistic model of recurrent admixture," Theoretical Population Biology, Elsevier, vol. 122(C), pages 149-157.
    15. Gunjan Sharma & Rakesh Tamang & Ruchira Chaudhary & Vipin Kumar Singh & Anish M Shah & Sharath Anugula & Deepa Selvi Rani & Alla G Reddy & Muthukrishnan Eaaswarkhanth & Gyaneshwer Chaubey & Lalji Sing, 2012. "Genetic Affinities of the Central Indian Tribal Populations," PLOS ONE, Public Library of Science, vol. 7(2), pages 1-8, February.
    16. Li, Weiming & Zhu, Junpeng, 2023. "CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures," Journal of Multivariate Analysis, Elsevier, vol. 193(C).
    17. Jeffrey D. Wall & J. Fah Sathirapongsasuti & Ravi Gupta & Asif Rasheed & Radha Venkatesan & Saurabh Belsare & Ramesh Menon & Sameer Phalke & Anuradha Mittal & John Fang & Deepak Tanneeru & Manjari Des, 2023. "South Asian medical cohorts reveal strong founder effects and high rates of homozygosity," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    18. Collins, Benoît & Matsumoto, Sho & Saad, Nadia, 2014. "Integration of invariant matrices and moments of inverses of Ginibre and Wishart matrices," Journal of Multivariate Analysis, Elsevier, vol. 126(C), pages 1-13.
    19. Passemier, Damien & Yao, Jianfeng, 2014. "Estimation of the number of spikes, possibly equal, in the high-dimensional case," Journal of Multivariate Analysis, Elsevier, vol. 127(C), pages 173-183.
    20. Paul, Debashis & Silverstein, Jack W., 2009. "No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix," Journal of Multivariate Analysis, Elsevier, vol. 100(1), pages 37-57, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0014802. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.