IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1010180.html
   My bibliography  Save this article

BOSO: A novel feature selection algorithm for linear regression with high-dimensional data

Author

Listed:
  • Luis V Valcárcel
  • Edurne San José-Enériz
  • Xabier Cendoya
  • Ángel Rubio
  • Xabier Agirre
  • Felipe Prósper
  • Francisco J Planes

Abstract

With the frenetic growth of high-dimensional datasets in different biomedical domains, there is an urgent need to develop predictive methods able to deal with this complexity. Feature selection is a relevant strategy in machine learning to address this challenge. We introduce a novel feature selection algorithm for linear regression called BOSO (Bilevel Optimization Selector Operator). We conducted a benchmark of BOSO with key algorithms in the literature, finding a superior accuracy for feature selection in high-dimensional datasets. Proof-of-concept of BOSO for predicting drug sensitivity in cancer is presented. A detailed analysis is carried out for methotrexate, a well-studied drug targeting cancer metabolism.Author summary: We present BOSO (Bilevel Optimization Selector Operator), a novel method to conduct feature selection in linear regression models. In machine learning, feature selection consists of identifying the subset of input variables (features) that are correctly associated with the response variable that is aimed to be predicted. An adequate feature selection is particularly relevant for high-dimensional datasets, commonly encountered in biomedical research questions that rely on -omics data, e.g. predictive models of drug sensitivity, resistance or toxicity, construction of gene regulatory networks, biomarker selection or association studies. The need of feature selection is emphasized in many of these complex problems, since the number of features is greater than the number of samples, which makes it harder to obtain accurate and general predictive models. In this context, we show that the models derived by BOSO make a better combination of accuracy and simplicity than competing approaches in the literature. The relevance of BOSO is illustrated in the prediction of drug sensitivity of cancer cell lines, using RNA-seq data and drug screenings from GDSC (Genomics of Drug Sensitivity in Cancer) database. BOSO obtains linear regression models with a similar level of accuracy but involving a substantially lower number of features, which simplifies the interpretation and validation of predictive models.

Suggested Citation

  • Luis V Valcárcel & Edurne San José-Enériz & Xabier Cendoya & Ángel Rubio & Xabier Agirre & Felipe Prósper & Francisco J Planes, 2022. "BOSO: A novel feature selection algorithm for linear regression with high-dimensional data," PLOS Computational Biology, Public Library of Science, vol. 18(5), pages 1-29, May.
  • Handle: RePEc:plo:pcbi00:1010180
    DOI: 10.1371/journal.pcbi.1010180
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010180
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010180&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1010180?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jiahua Chen & Zehua Chen, 2008. "Extended Bayesian information criteria for model selection with large model spaces," Biometrika, Biometrika Trust, vol. 95(3), pages 759-771.
    2. Pietro Belotti & Pierre Bonami & Matteo Fischetti & Andrea Lodi & Michele Monaci & Amaya Nogales-Gómez & Domenico Salvagnin, 2016. "On handling indicator constraints in mixed integer programming," Computational Optimization and Applications, Springer, vol. 65(3), pages 545-566, December.
    3. Florian Rohart & Benoît Gautier & Amrit Singh & Kim-Anh Lê Cao, 2017. "mixOmics: An R package for ‘omics feature selection and multiple data integration," PLOS Computational Biology, Public Library of Science, vol. 13(11), pages 1-19, November.
    4. Jeffrey W. Tyner & Cristina E. Tognon & Daniel Bottomly & Beth Wilmot & Stephen E. Kurtz & Samantha L. Savage & Nicola Long & Anna Reister Schultz & Elie Traer & Melissa Abel & Anupriya Agarwal & Auro, 2018. "Functional genomic landscape of acute myeloid leukaemia," Nature, Nature, vol. 562(7728), pages 526-531, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Frommlet, Florian & Ruhaltinger, Felix & Twaróg, Piotr & Bogdan, Małgorzata, 2012. "Modified versions of Bayesian Information Criterion for genome-wide association studies," Computational Statistics & Data Analysis, Elsevier, vol. 56(5), pages 1038-1051.
    2. Zak-Szatkowska, Malgorzata & Bogdan, Malgorzata, 2011. "Modified versions of the Bayesian Information Criterion for sparse Generalized Linear Models," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 2908-2924, November.
    3. Matthew S. Bramble & Victor Fourcassié & Neerja Vashist & Florence Roux-Dalvai & Yun Zhou & Guy Bumoko & Michel Lupamba Kasendue & D’Andre Spencer & Hilaire Musasa Hanshi-Hatuhu & Vincent Kambale-Mast, 2024. "Glutathione peroxidase 3 is a potential biomarker for konzo," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    4. Xiaotong Shen & Wei Pan & Yunzhang Zhu & Hui Zhou, 2013. "On constrained and regularized high-dimensional regression," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 65(5), pages 807-832, October.
    5. J. McClatchy & R. Strogantsev & E. Wolfe & H. Y. Lin & M. Mohammadhosseini & B. A. Davis & C. Eden & D. Goldman & W. H. Fleming & P. Conley & G. Wu & L. Cimmino & H. Mohammed & A. Agarwal, 2023. "Clonal hematopoiesis related TET2 loss-of-function impedes IL1β-mediated epigenetic reprogramming in hematopoietic stem and progenitor cells," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    6. Shan Luo & Zehua Chen, 2014. "Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(507), pages 1229-1240, September.
    7. Lu Tang & Ling Zhou & Peter X. K. Song, 2019. "Fusion learning algorithm to combine partially heterogeneous Cox models," Computational Statistics, Springer, vol. 34(1), pages 395-414, March.
    8. Molly C. Klanderman & Kathryn B. Newhart & Tzahi Y. Cath & Amanda S. Hering, 2020. "Fault isolation for a complex decentralized waste water treatment facility," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(4), pages 931-951, August.
    9. Li, Xinyi & Wang, Li & Nettleton, Dan, 2019. "Sparse model identification and learning for ultra-high-dimensional additive partially linear models," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 204-228.
    10. Jones, Benjamin A., 2018. "Forest-attacking Invasive Species and Infant Health: Evidence From the Invasive Emerald Ash Borer," Ecological Economics, Elsevier, vol. 154(C), pages 282-293.
    11. Tae-Hwy Lee & Ekaterina Seregina, 2020. "Learning from Forecast Errors: A New Approach to Forecast Combination," Working Papers 202024, University of California at Riverside, Department of Economics.
    12. Chenchen Ma & Jing Ouyang & Gongjun Xu, 2023. "Learning Latent and Hierarchical Structures in Cognitive Diagnosis Models," Psychometrika, Springer;The Psychometric Society, vol. 88(1), pages 175-207, March.
    13. Gonzalo García-Donato & María Eugenia Castellanos & Alicia Quirós, 2021. "Bayesian Variable Selection with Applications in Health Sciences," Mathematics, MDPI, vol. 9(3), pages 1-16, January.
    14. Yinjun Chen & Hao Ming & Hu Yang, 2024. "Efficient variable selection for high-dimensional multiplicative models: a novel LPRE-based approach," Statistical Papers, Springer, vol. 65(6), pages 3713-3737, August.
    15. Rebecca Anderson & Lance D. Miller & Scott Isom & Jeff W. Chou & Kristin M. Pladna & Nathaniel J. Schramm & Leslie R. Ellis & Dianna S. Howard & Rupali R. Bhave & Megan Manuel & Sarah Dralle & Susan L, 2022. "Phase II trial of cytarabine and mitoxantrone with devimistat in acute myeloid leukemia," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    16. Sakyajit Bhattacharya & Paul McNicholas, 2014. "A LASSO-penalized BIC for mixture model selection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 45-61, March.
    17. Luo, Shan & Chen, Zehua, 2014. "Edge detection in sparse Gaussian graphical models," Computational Statistics & Data Analysis, Elsevier, vol. 70(C), pages 138-152.
    18. Neubeck, Markus & Karbach, Julia & Könen, Tanja, 2022. "Network models of cognitive abilities in younger and older adults," Intelligence, Elsevier, vol. 90(C).
    19. Sacha Epskamp & Mijke Rhemtulla & Denny Borsboom, 2017. "Generalized Network Psychometrics: Combining Network and Latent Variable Models," Psychometrika, Springer;The Psychometric Society, vol. 82(4), pages 904-927, December.
    20. Xiandeng Jiang & Le Chang & Yanlin Shi, 2023. "Housing price diffusions in mainland China: evidence from a spatially penalized graphical VAR model," Empirical Economics, Springer, vol. 64(2), pages 765-795, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010180. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.