IDEAS home Printed from https://ideas.repec.org/a/bla/jorssb/v82y2020i5p1273-1300.html
   My bibliography  Save this article

A simple new approach to variable selection in regression, with application to genetic fine mapping

Author

Listed:
  • Gao Wang
  • Abhishek Sarkar
  • Peter Carbonetto
  • Matthew Stephens

Abstract

We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model—the ‘sum of single effects’ model, called ‘SuSiE’—which comes from writing the sparse vector of regression coefficients as a sum of ‘single‐effect’ vectors, each with one non‐zero element. We also introduce a corresponding new fitting procedure—iterative Bayesian stepwise selection (IBSS)—which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods but, instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under SuSiE. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a credible set of variables for each selection. Our methods are particularly well suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and we illustrate their application to fine mapping genetic variants influencing alternative splicing in human cell lines. We also discuss the potential and challenges for applying these methods to generic variable‐selection problems.

Suggested Citation

  • Gao Wang & Abhishek Sarkar & Peter Carbonetto & Matthew Stephens, 2020. "A simple new approach to variable selection in regression, with application to genetic fine mapping," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(5), pages 1273-1300, December.
  • Handle: RePEc:bla:jorssb:v:82:y:2020:i:5:p:1273-1300
    DOI: 10.1111/rssb.12388
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssb.12388
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssb.12388?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. David M. Blei & Alp Kucukelbir & Jon D. McAuliffe, 2017. "Variational Inference: A Review for Statisticians," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 859-877, April.
    2. Matteo Sesia & Eugene Katsevich & Stephen Bates & Emmanuel Candès & Chiara Sabatti, 2020. "Multi-resolution localization of causal variants across the genome," Nature Communications, Nature, vol. 11(1), pages 1-10, December.
    3. Killick, Rebecca & Eckley, Idris A., 2014. "changepoint: An R Package for Changepoint Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 58(i03).
    4. Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Econometrics, MDPI, vol. 6(4), pages 1-27, November.
    5. Gerhard Moser & Sang Hong Lee & Ben J Hayes & Michael E Goddard & Naomi R Wray & Peter M Visscher, 2015. "Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model," PLOS Genetics, Public Library of Science, vol. 11(4), pages 1-22, April.
    6. Matthew Stephens, 2013. "A Unified Framework for Association Analysis with Multiple Related Phenotypes," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-19, July.
    7. Matteo Sesia & Eugene Katsevich & Stephen Bates & Emmanuel Candès & Chiara Sabatti, 2020. "Publisher Correction: Multi-resolution localization of causal variants across the genome," Nature Communications, Nature, vol. 11(1), pages 1-1, December.
    8. Jeffrey T Leek & John D Storey, 2007. "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-12, September.
    9. Matthew Stephens, 2000. "Dealing with label switching in mixture models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 62(4), pages 795-809.
    10. Clive J Hoggart & John C Whittaker & Maria De Iorio & David J Balding, 2008. "Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies," PLOS Genetics, Public Library of Science, vol. 4(7), pages 1-8, July.
    11. Xiang Zhou & Peter Carbonetto & Matthew Stephens, 2013. "Polygenic Modeling with Bayesian Sparse Linear Mixed Models," PLOS Genetics, Public Library of Science, vol. 9(2), pages 1-14, February.
    12. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    13. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    14. Loann D. Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," AMSE Working Papers 1852, Aix-Marseille School of Economics, France.
    15. Nicolai Meinshausen, 2008. "Hierarchical testing of variable importance," Biometrika, Biometrika Trust, vol. 95(2), pages 265-278.
    16. Erdman, Chandra & Emerson, John W., 2007. "bcp: An R Package for Performing a Bayesian Analysis of Change Point Problems," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 23(i03).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. N. Hernández & J. Soenksen & P. Newcombe & M. Sandhu & I. Barroso & C. Wallace & J. L. Asimit, 2021. "The flashfm approach for fine-mapping multiple quantitative traits," Nature Communications, Nature, vol. 12(1), pages 1-14, December.
    2. Mary P. LaPierre & Katherine Lawler & Svenja Godbersen & I. Sadaf Farooqi & Markus Stoffel, 2022. "MicroRNA-7 regulates melanocortin circuits involved in mammalian energy homeostasis," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    3. Isabelle Austin-Zimmerman & Daniel F. Levey & Olga Giannakopoulou & Joseph D. Deak & Marco Galimberti & Keyrun Adhikari & Hang Zhou & Spiros Denaxas & Haritz Irizar & Karoline Kuchenbaecker & Andrew M, 2023. "Genome-wide association studies and cross-population meta-analyses investigating short and long sleep duration," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    4. Santini, Alberto & Malaguti, Enrico, 2024. "The min-Knapsack problem with compactness constraints and applications in statistics," European Journal of Operational Research, Elsevier, vol. 312(1), pages 385-397.
    5. Joel T. Rämö & Tuomo Kiiskinen & Richard Seist & Kristi Krebs & Masahiro Kanai & Juha Karjalainen & Mitja Kurki & Eija Hämäläinen & Paavo Häppölä & Aki S. Havulinna & Heidi Hautakangas & Reedik Mägi &, 2023. "Genome-wide screen of otosclerosis in population biobanks: 27 loci and shared associations with skeletal structure," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    6. Qingbo S. Wang & Ryuya Edahiro & Ho Namkoong & Takanori Hasegawa & Yuya Shirai & Kyuto Sonehara & Hiromu Tanaka & Ho Lee & Ryunosuke Saiki & Takayoshi Hyugaji & Eigo Shimizu & Kotoe Katayama & Masahir, 2022. "The whole blood transcriptional regulation landscape in 465 COVID-19 infected samples from Japan COVID-19 Task Force," Nature Communications, Nature, vol. 13(1), pages 1-19, December.
    7. Satu Strausz & Erik Abner & Grace Blacker & Sarah Galloway & Paige Hansen & Qingying Feng & Brandon T. Lee & Samuel E. Jones & Hele Haapaniemi & Sten Raak & George Ronald Nahass & Erin Sanders & Pille, 2024. "SCGB1D2 inhibits growth of Borrelia burgdorferi and affects susceptibility to Lyme disease," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    8. Sylvia Hartmann & Summaira Yasmeen & Benjamin M. Jacobs & Spiros Denaxas & Munir Pirmohamed & Eric R. Gamazon & Mark J. Caulfield & Harry Hemingway & Maik Pietzner & Claudia Langenberg, 2023. "ADRA2A and IRX1 are putative risk genes for Raynaud’s phenomenon," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    9. Ya Cui & Frederick J. Arnold & Fanglue Peng & Dan Wang & Jason Sheng Li & Sebastian Michels & Eric J. Wagner & Albert R. Spada & Wei Li, 2023. "Alternative polyadenylation transcriptome-wide association study identifies APA-linked susceptibility genes in brain disorders," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    10. Yanyu Xiao & Jingjing Wang & Jiaqi Li & Peijing Zhang & Jingyu Li & Yincong Zhou & Qing Zhou & Ming Chen & Xin Sheng & Zhihong Liu & Xiaoping Han & Guoji Guo, 2023. "An analytical framework for decoding cell type-specific genetic variation of gene regulation," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    11. Yunfeng Huang & Dora Bodnar & Chia-Yen Chen & Gabriela Sanchez-Andrade & Mark Sanderson & Jun Shi & Katherine G. Meilleur & Matthew E. Hurles & Sebastian S. Gerety & Ellen A. Tsai & Heiko Runz, 2023. "Rare genetic variants impact muscle strength," Nature Communications, Nature, vol. 14(1), pages 1-8, December.
    12. Marion Patxot & Daniel Trejo Banos & Athanasios Kousathanas & Etienne J. Orliac & Sven E. Ojavee & Gerhard Moser & Alexander Holloway & Julia Sidorenko & Zoltan Kutalik & Reedik Mägi & Peter M. Vissch, 2021. "Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits," Nature Communications, Nature, vol. 12(1), pages 1-16, December.
    13. Elmo C. Saarentaus & Juha Karjalainen & Joel T. Rämö & Tuomo Kiiskinen & Aki S. Havulinna & Juha Mehtonen & Heidi Hautakangas & Sanni Ruotsalainen & Max Tamlander & Nina Mars & Sanna Toppila-Salmi & M, 2023. "Inflammatory and infectious upper respiratory diseases associate with 41 genomic loci and type 2 inflammation," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    14. Nathan LaPierre & Kodi Taraszka & Helen Huang & Rosemary He & Farhad Hormozdiari & Eleazar Eskin, 2021. "Identifying causal variants by fine mapping across multiple studies," PLOS Genetics, Public Library of Science, vol. 17(9), pages 1-19, September.
    15. Mingxuan Cai & Zhiwei Wang & Jiashun Xiao & Xianghong Hu & Gang Chen & Can Yang, 2023. "XMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    16. Wenhan Chen & Yang Wu & Zhili Zheng & Ting Qi & Peter M. Visscher & Zhihong Zhu & Jian Yang, 2021. "Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors," Nature Communications, Nature, vol. 12(1), pages 1-10, December.
    17. Eeva Sliz & Jaakko S. Tyrmi & Nilufer Rahmioglu & Krina T. Zondervan & Christian M. Becker & Outi Uimari & Johannes Kettunen, 2023. "Evidence of a causal effect of genetic tendency to gain muscle mass on uterine leiomyomata," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    18. Linda Ottensmann & Rubina Tabassum & Sanni E. Ruotsalainen & Mathias J. Gerl & Christian Klose & Elisabeth Widén & Kai Simons & Samuli Ripatti & Matti Pirinen, 2023. "Genome-wide association analysis of plasma lipidome identifies 495 genetic associations," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    19. Joaquim Fernando Pinto da Costa & Manuel Cabral, 2022. "Statistical Methods with Applications in Data Mining: A Review of the Most Recent Works," Mathematics, MDPI, vol. 10(6), pages 1-22, March.
    20. Alan Selewa & Kaixuan Luo & Michael Wasney & Linsin Smith & Xiaotong Sun & Chenwei Tang & Heather Eckart & Ivan P. Moskowitz & Anindita Basu & Xin He & Sebastian Pott, 2023. "Single-cell genomics improves the discovery of risk variants and genes of atrial fibrillation," Nature Communications, Nature, vol. 14(1), pages 1-18, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Fakhri J. Hasanov & Muhammad Javid & Frederick L. Joutz, 2022. "Saudi Non-Oil Exports before and after COVID-19: Historical Impacts of Determinants and Scenario Analysis," Sustainability, MDPI, vol. 14(4), pages 1-38, February.
    2. Kimia Keshanian & Daniel Zantedeschi & Kaushik Dutta, 2022. "Features Selection as a Nash-Bargaining Solution: Applications in Online Advertising and Information Systems," INFORMS Journal on Computing, INFORMS, vol. 34(5), pages 2485-2501, September.
    3. Aneiros, Germán & Novo, Silvia & Vieu, Philippe, 2022. "Variable selection in functional regression models: A review," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    4. Szefer Elena & Graham Jinko & Lu Donghuan & Beg Mirza Faisal & Nathoo Farouk, 2017. "Multivariate association between single-nucleotide polymorphisms in Alzgene linkage regions and structural changes in the brain: discovery, refinement and validation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(5-6), pages 349-365, December.
    5. Gary Koop & Dimitris Korobilis, 2023. "Bayesian Dynamic Variable Selection In High Dimensions," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 64(3), pages 1047-1074, August.
    6. Feihong Xia & Rabikar Chatterjee & Jerrold H. May, 2019. "Using Conditional Restricted Boltzmann Machines to Model Complex Consumer Shopping Patterns," Marketing Science, INFORMS, vol. 38(4), pages 711-727, July.
    7. Zemin Zheng & Jinchi Lv & Wei Lin, 2021. "Nonsparse Learning with Latent Variables," Operations Research, INFORMS, vol. 69(1), pages 346-359, January.
    8. McMahan Christopher & Bridges William & Joyner Chase & Lund Robert & Baurley James & Kacamarga Muhamad Fitra & Pardamean Carissa & Pardamean Bens, 2017. "A Bayesian hierarchical model for identifying significant polygenic effects while controlling for confounding and repeated measures," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(5-6), pages 407-419, December.
    9. Rajiv Sambasivan & Sourish Das & Sujit K. Sahu, 2020. "A Bayesian perspective of statistical machine learning for big data," Computational Statistics, Springer, vol. 35(3), pages 893-930, September.
    10. Oliver J. Rutz & Garrett P. Sonnier, 2019. "VANISH regularization for generalized linear models," Quantitative Marketing and Economics (QME), Springer, vol. 17(4), pages 415-437, December.
    11. Petropoulos, Fotios & Apiletti, Daniele & Assimakopoulos, Vassilios & Babai, Mohamed Zied & Barrow, Devon K. & Ben Taieb, Souhaib & Bergmeir, Christoph & Bessa, Ricardo J. & Bijak, Jakub & Boylan, Joh, 2022. "Forecasting: theory and practice," International Journal of Forecasting, Elsevier, vol. 38(3), pages 705-871.
      • Fotios Petropoulos & Daniele Apiletti & Vassilios Assimakopoulos & Mohamed Zied Babai & Devon K. Barrow & Souhaib Ben Taieb & Christoph Bergmeir & Ricardo J. Bessa & Jakub Bijak & John E. Boylan & Jet, 2020. "Forecasting: theory and practice," Papers 2012.03854, arXiv.org, revised Jan 2022.
    12. Charlotte Soneson & Sarah Gerster & Mauro Delorenzi, 2014. "Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation," PLOS ONE, Public Library of Science, vol. 9(6), pages 1-13, June.
    13. Cox Lwaka Tamba & Yuan-Li Ni & Yuan-Ming Zhang, 2017. "Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies," PLOS Computational Biology, Public Library of Science, vol. 13(1), pages 1-20, January.
    14. Heather E Wheeler & Kaanan P Shah & Jonathon Brenner & Tzintzuni Garcia & Keston Aquino-Michaels & GTEx Consortium & Nancy J Cox & Dan L Nicolae & Hae Kyung Im, 2016. "Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues," PLOS Genetics, Public Library of Science, vol. 12(11), pages 1-23, November.
    15. Lee, Kuo-Jung & Feldkircher, Martin & Chen, Yi-Chi, 2021. "Variable selection in finite mixture of regression models with an unknown number of components," Computational Statistics & Data Analysis, Elsevier, vol. 158(C).
    16. Thangjam, Aditya & Jaipuria, Sanjita & Dadabada, Pradeep Kumar, 2023. "Time-Varying approaches for Long-Term Electric Load Forecasting under economic shocks," Applied Energy, Elsevier, vol. 333(C).
    17. Niloy Biswas & Anirban Bhattacharya & Pierre E. Jacob & James E. Johndrow, 2022. "Coupling‐based convergence assessment of some Gibbs samplers for high‐dimensional Bayesian regression with shrinkage priors," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(3), pages 973-996, July.
    18. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    19. Oxana Babecka Kucharcukova & Jan Bruha, 2016. "Nowcasting the Czech Trade Balance," Working Papers 2016/11, Czech National Bank.
    20. Carstensen, Kai & Heinrich, Markus & Reif, Magnus & Wolters, Maik H., 2020. "Predicting ordinary and severe recessions with a three-state Markov-switching dynamic factor model," International Journal of Forecasting, Elsevier, vol. 36(3), pages 829-850.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssb:v:82:y:2020:i:5:p:1273-1300. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.