IDEAS home Printed from https://ideas.repec.org/p/pra/mprapa/72772.html
   My bibliography  Save this paper

Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets

Author

Listed:
  • Lagani, Vincenzo
  • Athineou, Giorgos
  • Farcomeni, Alessio
  • Tsagris, Michail
  • Tsamardinos, Ioannis

Abstract

The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. Under that respect SES subsumes and extends previous feature selection algorithms, like the maxmin parent children algorithm. SES is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data-analysis tasks, namely classi�cation, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.

Suggested Citation

  • Lagani, Vincenzo & Athineou, Giorgos & Farcomeni, Alessio & Tsagris, Michail & Tsamardinos, Ioannis, 2016. "Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets," MPRA Paper 72772, University Library of Munich, Germany.
  • Handle: RePEc:pra:mprapa:72772
    as

    Download full text from publisher

    File URL: https://mpra.ub.uni-muenchen.de/72772/1/MPRA_paper_72772.pdf
    File Function: original version
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Zeileis, Achim & Kleiber, Christian & Jackman, Simon, 2008. "Regression Models for Count Data in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 27(i08).
    2. Dethlefsen, Claus & Højsgaard, Søren, 2005. "A Common Platform for Graphical Models in R: The gRbase Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 14(i17).
    3. Alexander Statnikov & Constantin F Aliferis, 2010. "Analysis and Computational Dissection of Molecular Signature Multiplicity," PLOS Computational Biology, Public Library of Science, vol. 6(5), pages 1-9, May.
    4. Müssel, Christoph & Lausser, Ludwig & Maucher, Markus & Kestler, Hans A., 2012. "Multi-Objective Parameter Selection for Classifiers," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 46(i05).
    5. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    6. Calcagno, Vincent & de Mazancourt, Claire, 2010. "glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 34(i12).
    7. Scutari, Marco, 2010. "Learning Bayesian Networks with the bnlearn R Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 35(i03).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daniela Marella & Paola Vicard, 2022. "Bayesian network structural learning from complex survey data: a resampling based approach," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(4), pages 981-1013, October.
    2. Sašo Karakatič, 2020. "EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms," Mathematics, MDPI, vol. 8(6), pages 1-29, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Scutari Marco & Balding David & Mackay Ian, 2013. "Improving the efficiency of genomic selection," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 12(4), pages 517-527, August.
    2. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    3. Rui Wang & Naihua Xiu & Kim-Chuan Toh, 2021. "Subspace quadratic regularization method for group sparse multinomial logistic regression," Computational Optimization and Applications, Springer, vol. 79(3), pages 531-559, July.
    4. Prabal Das & D. A. Sachindra & Kironmala Chanda, 2022. "Machine Learning-Based Rainfall Forecasting with Multiple Non-Linear Feature Selection Algorithms," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 36(15), pages 6043-6071, December.
    5. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    6. Chen, Le-Yu & Lee, Sokbae, 2018. "Best subset binary prediction," Journal of Econometrics, Elsevier, vol. 206(1), pages 39-56.
    7. Totterman, Stephen, 2021. "Vehicle-based recreation and compliance for three beaches in northern New South Wales," OSF Preprints ja8h6, Center for Open Science.
    8. Sung Jae Jun & Sokbae Lee, 2020. "Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions," Papers 2004.08318, arXiv.org, revised Oct 2023.
    9. Bernard W T Coetzee & Kevin J Gaston & Steven L Chown, 2014. "Local Scale Comparisons of Biodiversity as a Test for Global Protected Area Ecological Performance: A Meta-Analysis," PLOS ONE, Public Library of Science, vol. 9(8), pages 1-11, August.
    10. Xiangwei Li & Thomas Delerue & Ben Schöttker & Bernd Holleczek & Eva Grill & Annette Peters & Melanie Waldenberger & Barbara Thorand & Hermann Brenner, 2022. "Derivation and validation of an epigenetic frailty risk score in population-based cohorts of older adults," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    11. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    12. Heng Chen & Daniel F. Heitjan, 2022. "Analysis of local sensitivity to nonignorability with missing outcomes and predictors," Biometrics, The International Biometric Society, vol. 78(4), pages 1342-1352, December.
    13. S Ariane Christie & Amanda S Conroy & Rachael A Callcut & Alan E Hubbard & Mitchell J Cohen, 2019. "Dynamic multi-outcome prediction after injury: Applying adaptive machine learning for precision medicine in trauma," PLOS ONE, Public Library of Science, vol. 14(4), pages 1-13, April.
    14. Christian Kleiber & Achim Zeileis, 2016. "Visualizing Count Data Regressions Using Rootograms," The American Statistician, Taylor & Francis Journals, vol. 70(3), pages 296-303, July.
    15. Zhu Wang, 2022. "MM for penalized estimation," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 31(1), pages 54-75, March.
    16. Ida Kubiszewski & Kenneth Mulder & Diane Jarvis & Robert Costanza, 2022. "Toward better measurement of sustainable development and wellbeing: A small number of SDG indicators reliably predict life satisfaction," Sustainable Development, John Wiley & Sons, Ltd., vol. 30(1), pages 139-148, February.
    17. Roland R. Ramsahai, 2020. "Connecting actuarial judgment to probabilistic learning techniques with graph theory," Papers 2007.15475, arXiv.org.
    18. Gustavo A. Alonso-Silverio & Víctor Francisco-García & Iris P. Guzmán-Guzmán & Elías Ventura-Molina & Antonio Alarcón-Paredes, 2021. "Toward Non-Invasive Estimation of Blood Glucose Concentration: A Comparative Performance," Mathematics, MDPI, vol. 9(20), pages 1-13, October.
    19. Christopher Kath & Florian Ziel, 2018. "The value of forecasts: Quantifying the economic gains of accurate quarter-hourly electricity price forecasts," Papers 1811.08604, arXiv.org.
    20. Tang, Kayu & Parsons, David J. & Jude, Simon, 2019. "Comparison of automatic and guided learning for Bayesian networks to analyse pipe failures in the water distribution system," Reliability Engineering and System Safety, Elsevier, vol. 186(C), pages 24-36.

    More about this item

    Keywords

    feature selection; constraint-based algorithms; multiple predictive signatures;
    All these keywords.

    JEL classification:

    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:pra:mprapa:72772. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Joachim Winter (email available below). General contact details of provider: https://edirc.repec.org/data/vfmunde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.