IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v78y2022i2p499-511.html
   My bibliography  Save this article

On polygenic risk scores for complex traits prediction

Author

Listed:
  • Bingxin Zhao
  • Fei Zou

Abstract

Polygenic risk scores (PRS) have gained substantial attention for complex traits prediction in genome‐wide association studies (GWAS). Motivated by the polygenic model of complex traits, we study the statistical properties of PRS under the high‐dimensional but sparsity free setting where the triplet (n,p,m)→(∞,∞,∞)$(n,p,m) \rightarrow (\infty , \infty , \infty )$ with n,p,m$n, p, m$ being the sample size, the number of assayed single‐nucleotide polymorphisms (SNPs), and the number of assayed causal SNPs, respectively. First, we derive asymptotic results on the out‐of‐sample (prediction) R‐squared for PRS. These results help understand the widespread observed gap between the in‐sample heritability (or partial R‐squared due to the genetic features) estimate and the out‐of‐sample R‐squared for most complex traits. Next, we investigate how features should be selected (e.g., by a p‐value threshold) for constructing optimal PRS. We reveal that the optimal threshold depends largely on the genetic architecture underlying the complex trait and the sample size of the training GWAS, or the m/n$m/n$ ratio. For highly polygenic traits with a large m/n$m/n$ ratio, it is difficult to separate causal and null SNPs and stringent feature selection in principle often leads to poor PRS prediction. We numerically illustrate the theoretical results with intensive simulation studies and real data analysis on 33 complex traits with a wide range of genetic architectures in the UK Biobank database.

Suggested Citation

  • Bingxin Zhao & Fei Zou, 2022. "On polygenic risk scores for complex traits prediction," Biometrics, The International Biometric Society, vol. 78(2), pages 499-511, June.
  • Handle: RePEc:bla:biomet:v:78:y:2022:i:2:p:499-511
    DOI: 10.1111/biom.13466
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13466
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13466?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Frank Dudbridge, 2013. "Power and Predictive Accuracy of Polygenic Risk Scores," PLOS Genetics, Public Library of Science, vol. 9(3), pages 1-17, March.
    2. Seunggeun Lee & Fred A. Wright & Fei Zou, 2011. "Control of Population Stratification by Correlation-Selected Principal Components," Biometrics, The International Biometric Society, vol. 67(3), pages 967-974, September.
    3. Teri A. Manolio & Francis S. Collins & Nancy J. Cox & David B. Goldstein & Lucia A. Hindorff & David J. Hunter & Mark I. McCarthy & Erin M. Ramos & Lon R. Cardon & Aravinda Chakravarti & Judy H. Cho &, 2009. "Finding the missing heritability of complex diseases," Nature, Nature, vol. 461(7265), pages 747-753, October.
    4. Jianqing Fan & Jinchi Lv, 2008. "Sure independence screening for ultrahigh dimensional feature space," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(5), pages 849-911, November.
    5. Matthew Warren, 2018. "The approach to predictive medicine that is taking genomics research by storm," Nature, Nature, vol. 562(7726), pages 181-183, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bingxin Zhao & Fei Zou & Hongtu Zhu, 2023. "Cross‐trait prediction accuracy of summary statistics in genome‐wide association studies," Biometrics, The International Biometric Society, vol. 79(2), pages 841-853, June.
    2. Meng An & Haixiang Zhang, 2023. "High-Dimensional Mediation Analysis for Time-to-Event Outcomes with Additive Hazards Model," Mathematics, MDPI, vol. 11(24), pages 1-11, December.
    3. Tomohiro Ando & Ruey S. Tsay, 2009. "Model selection for generalized linear models with factor‐augmented predictors," Applied Stochastic Models in Business and Industry, John Wiley & Sons, vol. 25(3), pages 207-235, May.
    4. Shuichi Kawano, 2014. "Selection of tuning parameters in bridge regression models via Bayesian information criterion," Statistical Papers, Springer, vol. 55(4), pages 1207-1223, November.
    5. Jing Zhang & Qihua Wang & Xuan Wang, 2022. "Surrogate-variable-based model-free feature screening for survival data under the general censoring mechanism," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 74(2), pages 379-397, April.
    6. Sauvenier, Mathieu & Van Bellegem, Sébastien, 2023. "Direction Identification and Minimax Estimation by Generalized Eigenvalue Problem in High Dimensional Sparse Regression," LIDAM Discussion Papers CORE 2023005, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    7. Mitchell, Brittany L. & Hansell, Narelle K. & McAloney, Kerrie & Martin, Nicholas G. & Wright, Margaret J. & Renteria, Miguel E. & Grasby, Katrina L., 2022. "Polygenic influences associated with adolescent cognitive skills," Intelligence, Elsevier, vol. 94(C).
    8. Ahmed Ismaïl & Hartikainen Anna-Liisa & Järvelin Marjo-Riitta & Richardson Sylvia, 2011. "False Discovery Rate Estimation for Stability Selection: Application to Genome-Wide Association Studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-20, November.
    9. Ilias Georgakopoulos-Soares & Chengyu Deng & Vikram Agarwal & Candace S. Y. Chan & Jingjing Zhao & Fumitaka Inoue & Nadav Ahituv, 2023. "Transcription factor binding site orientation and order are major drivers of gene regulatory activity," Nature Communications, Nature, vol. 14(1), pages 1-16, December.
    10. Emre Demirkaya & Yang Feng & Pallavi Basu & Jinchi Lv, 2022. "Large-scale model selection in misspecified generalized linear models [Information theory and an extension of the maximum likelihood principle]," Biometrika, Biometrika Trust, vol. 109(1), pages 123-136.
    11. Shan Luo & Zehua Chen, 2014. "Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(507), pages 1229-1240, September.
    12. Shi Chen & Wolfgang Karl Hardle & Brenda L'opez Cabrera, 2020. "Regularization Approach for Network Modeling of German Power Derivative Market," Papers 2009.09739, arXiv.org.
    13. Wang, Christina Dan & Chen, Zhao & Lian, Yimin & Chen, Min, 2022. "Asset selection based on high frequency Sharpe ratio," Journal of Econometrics, Elsevier, vol. 227(1), pages 168-188.
    14. Laurent Ferrara & Anna Simoni, 2023. "When are Google Data Useful to Nowcast GDP? An Approach via Preselection and Shrinkage," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 41(4), pages 1188-1202, October.
    15. Borup, Daniel & Christensen, Bent Jesper & Mühlbach, Nicolaj Søndergaard & Nielsen, Mikkel Slot, 2023. "Targeting predictors in random forest regression," International Journal of Forecasting, Elsevier, vol. 39(2), pages 841-868.
    16. Linh H. Nghiem & Francis K.C. Hui & Samuel Müller & A.H. Welsh, 2023. "Screening methods for linear errors‐in‐variables models in high dimensions," Biometrics, The International Biometric Society, vol. 79(2), pages 926-939, June.
    17. Caroline Jardet & Baptiste Meunier, 2022. "Nowcasting world GDP growth with high‐frequency data," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 41(6), pages 1181-1200, September.
    18. Peter Bühlmann & Jacopo Mandozzi, 2014. "High-dimensional variable screening and bias in subsequent inference, with an empirical comparison," Computational Statistics, Springer, vol. 29(3), pages 407-430, June.
    19. Sangjin Kim & Jong-Min Kim, 2019. "Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data," Mathematics, MDPI, vol. 7(6), pages 1-16, May.
    20. Anders Bredahl Kock, 2012. "On the Oracle Property of the Adaptive Lasso in Stationary and Nonstationary Autoregressions," CREATES Research Papers 2012-05, Department of Economics and Business Economics, Aarhus University.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:78:y:2022:i:2:p:499-511. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.