IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1010180.html
   My bibliography  Save this article

BOSO: A novel feature selection algorithm for linear regression with high-dimensional data

Author

Listed:
  • Luis V Valcárcel
  • Edurne San José-Enériz
  • Xabier Cendoya
  • Ángel Rubio
  • Xabier Agirre
  • Felipe Prósper
  • Francisco J Planes

Abstract

With the frenetic growth of high-dimensional datasets in different biomedical domains, there is an urgent need to develop predictive methods able to deal with this complexity. Feature selection is a relevant strategy in machine learning to address this challenge. We introduce a novel feature selection algorithm for linear regression called BOSO (Bilevel Optimization Selector Operator). We conducted a benchmark of BOSO with key algorithms in the literature, finding a superior accuracy for feature selection in high-dimensional datasets. Proof-of-concept of BOSO for predicting drug sensitivity in cancer is presented. A detailed analysis is carried out for methotrexate, a well-studied drug targeting cancer metabolism.Author summary: We present BOSO (Bilevel Optimization Selector Operator), a novel method to conduct feature selection in linear regression models. In machine learning, feature selection consists of identifying the subset of input variables (features) that are correctly associated with the response variable that is aimed to be predicted. An adequate feature selection is particularly relevant for high-dimensional datasets, commonly encountered in biomedical research questions that rely on -omics data, e.g. predictive models of drug sensitivity, resistance or toxicity, construction of gene regulatory networks, biomarker selection or association studies. The need of feature selection is emphasized in many of these complex problems, since the number of features is greater than the number of samples, which makes it harder to obtain accurate and general predictive models. In this context, we show that the models derived by BOSO make a better combination of accuracy and simplicity than competing approaches in the literature. The relevance of BOSO is illustrated in the prediction of drug sensitivity of cancer cell lines, using RNA-seq data and drug screenings from GDSC (Genomics of Drug Sensitivity in Cancer) database. BOSO obtains linear regression models with a similar level of accuracy but involving a substantially lower number of features, which simplifies the interpretation and validation of predictive models.

Suggested Citation

  • Luis V Valcárcel & Edurne San José-Enériz & Xabier Cendoya & Ángel Rubio & Xabier Agirre & Felipe Prósper & Francisco J Planes, 2022. "BOSO: A novel feature selection algorithm for linear regression with high-dimensional data," PLOS Computational Biology, Public Library of Science, vol. 18(5), pages 1-29, May.
  • Handle: RePEc:plo:pcbi00:1010180
    DOI: 10.1371/journal.pcbi.1010180
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010180
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010180&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1010180?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jiahua Chen & Zehua Chen, 2008. "Extended Bayesian information criteria for model selection with large model spaces," Biometrika, Biometrika Trust, vol. 95(3), pages 759-771.
    2. Pietro Belotti & Pierre Bonami & Matteo Fischetti & Andrea Lodi & Michele Monaci & Amaya Nogales-Gómez & Domenico Salvagnin, 2016. "On handling indicator constraints in mixed integer programming," Computational Optimization and Applications, Springer, vol. 65(3), pages 545-566, December.
    3. Florian Rohart & Benoît Gautier & Amrit Singh & Kim-Anh Lê Cao, 2017. "mixOmics: An R package for ‘omics feature selection and multiple data integration," PLOS Computational Biology, Public Library of Science, vol. 13(11), pages 1-19, November.
    4. Jeffrey W. Tyner & Cristina E. Tognon & Daniel Bottomly & Beth Wilmot & Stephen E. Kurtz & Samantha L. Savage & Nicola Long & Anna Reister Schultz & Elie Traer & Melissa Abel & Anupriya Agarwal & Auro, 2018. "Functional genomic landscape of acute myeloid leukaemia," Nature, Nature, vol. 562(7728), pages 526-531, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Vanessa R. Marcelino & Caitlin Welsh & Christian Diener & Emily L. Gulliver & Emily L. Rutten & Remy B. Young & Edward M. Giles & Sean M. Gibbons & Chris Greening & Samuel C. Forster, 2023. "Disease-specific loss of microbial cross-feeding interactions in the human gut," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    2. Byron Botha & Rulof Burger & Kevin Kotzé & Neil Rankin & Daan Steenkamp, 2023. "Big data forecasting of South African inflation," Empirical Economics, Springer, vol. 65(1), pages 149-188, July.
    3. Frommlet, Florian & Ruhaltinger, Felix & Twaróg, Piotr & Bogdan, Małgorzata, 2012. "Modified versions of Bayesian Information Criterion for genome-wide association studies," Computational Statistics & Data Analysis, Elsevier, vol. 56(5), pages 1038-1051.
    4. Zak-Szatkowska, Malgorzata & Bogdan, Malgorzata, 2011. "Modified versions of the Bayesian Information Criterion for sparse Generalized Linear Models," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 2908-2924, November.
    5. Elizabeth Heyes & Anna S. Wilhelmson & Anne Wenzel & Gabriele Manhart & Thomas Eder & Mikkel B. Schuster & Edwin Rzepa & Sachin Pundhir & Teresa D’Altri & Anne-Katrine Frank & Coline Gentil & Jakob Wo, 2023. "TET2 lesions enhance the aggressiveness of CEBPA-mutant acute myeloid leukemia by rebalancing GATA2 expression," Nature Communications, Nature, vol. 14(1), pages 1-18, December.
    6. Matthew S. Bramble & Victor Fourcassié & Neerja Vashist & Florence Roux-Dalvai & Yun Zhou & Guy Bumoko & Michel Lupamba Kasendue & D’Andre Spencer & Hilaire Musasa Hanshi-Hatuhu & Vincent Kambale-Mast, 2024. "Glutathione peroxidase 3 is a potential biomarker for konzo," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    7. Gaorong Li & Liugen Xue & Heng Lian, 2012. "SCAD-penalised generalised additive models with non-polynomial dimensionality," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 24(3), pages 681-697.
    8. Xiaotong Shen & Wei Pan & Yunzhang Zhu & Hui Zhou, 2013. "On constrained and regularized high-dimensional regression," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 65(5), pages 807-832, October.
    9. Emre Demirkaya & Yang Feng & Pallavi Basu & Jinchi Lv, 2022. "Large-scale model selection in misspecified generalized linear models [Information theory and an extension of the maximum likelihood principle]," Biometrika, Biometrika Trust, vol. 109(1), pages 123-136.
    10. J. McClatchy & R. Strogantsev & E. Wolfe & H. Y. Lin & M. Mohammadhosseini & B. A. Davis & C. Eden & D. Goldman & W. H. Fleming & P. Conley & G. Wu & L. Cimmino & H. Mohammed & A. Agarwal, 2023. "Clonal hematopoiesis related TET2 loss-of-function impedes IL1β-mediated epigenetic reprogramming in hematopoietic stem and progenitor cells," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    11. Shan Luo & Zehua Chen, 2014. "Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(507), pages 1229-1240, September.
    12. Lu Tang & Ling Zhou & Peter X. K. Song, 2019. "Fusion learning algorithm to combine partially heterogeneous Cox models," Computational Statistics, Springer, vol. 34(1), pages 395-414, March.
    13. Lian, Heng & Du, Pang & Li, YuanZhang & Liang, Hua, 2014. "Partially linear structure identification in generalized additive models with NP-dimensionality," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 197-208.
    14. Molly C. Klanderman & Kathryn B. Newhart & Tzahi Y. Cath & Amanda S. Hering, 2020. "Fault isolation for a complex decentralized waste water treatment facility," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(4), pages 931-951, August.
    15. Tang, Yanlin & Song, Xinyuan & Wang, Huixia Judy & Zhu, Zhongyi, 2013. "Variable selection in high-dimensional quantile varying coefficient models," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 115-132.
    16. Li, Yujie & Li, Gaorong & Lian, Heng & Tong, Tiejun, 2017. "Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models," Journal of Multivariate Analysis, Elsevier, vol. 155(C), pages 133-150.
    17. Yunxiao Chen & Xiaoou Li & Jingchen Liu & Zhiliang Ying, 2017. "Regularized Latent Class Analysis with Application in Cognitive Diagnosis," Psychometrika, Springer;The Psychometric Society, vol. 82(3), pages 660-692, September.
    18. Li, Xinyi & Wang, Li & Nettleton, Dan, 2019. "Sparse model identification and learning for ultra-high-dimensional additive partially linear models," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 204-228.
    19. Jones, Benjamin A., 2018. "Forest-attacking Invasive Species and Infant Health: Evidence From the Invasive Emerald Ash Borer," Ecological Economics, Elsevier, vol. 154(C), pages 282-293.
    20. Zhaoliang Wang & Liugen Xue & Gaorong Li & Fei Lu, 2019. "Spline estimator for ultra-high dimensional partially linear varying coefficient models," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(3), pages 657-677, June.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010180. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.