IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v71y2014icp681-693.html
   My bibliography  Save this article

Analysis of feature selection stability on high dimension and small sample data

Author

Listed:
  • Dernoncourt, David
  • Hanczar, Blaise
  • Zucker, Jean-Daniel

Abstract

Feature selection is an important step when building a classifier on high dimensional data. As the number of observations is small, the feature selection tends to be unstable. It is common that two feature subsets, obtained from different datasets but dealing with the same classification problem, do not overlap significantly. Although it is a crucial problem, few works have been done on the selection stability. The behavior of feature selection is analyzed in various conditions, not exclusively but with a focus on t-score based feature selection approaches and small sample data. The analysis is in three steps: the first one is theoretical using a simple mathematical model; the second one is empirical and based on artificial data; and the last one is based on real data. These three analyses lead to the same results and give a better understanding of the feature selection problem in high dimension data.

Suggested Citation

  • Dernoncourt, David & Hanczar, Blaise & Zucker, Jean-Daniel, 2014. "Analysis of feature selection stability on high dimension and small sample data," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 681-693.
  • Handle: RePEc:eee:csdana:v:71:y:2014:i:c:p:681-693
    DOI: 10.1016/j.csda.2013.07.012
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947313002570
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2013.07.012?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Anne-Claire Haury & Pierre Gestraud & Jean-Philippe Vert, 2011. "The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures," PLOS ONE, Public Library of Science, vol. 6(12), pages 1-12, December.
    2. Pavel Pudil & Petr Somol, 2008. "Identifying the most Informative Variables for Decision-Making Problems - a Survey of Recent Approaches and Accompanying Problems," Acta Oeconomica Pragensia, Prague University of Economics and Business, vol. 2008(4), pages 37-55.
    3. Yao, Weixin & Wang, Qin, 2013. "Robust variable selection through MAVE," Computational Statistics & Data Analysis, Elsevier, vol. 63(C), pages 42-49.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yu, Lean & Zhang, Xiaoming, 2021. "Can small sample dataset be used for efficient internet loan credit risk assessment? Evidence from online peer to peer lending," Finance Research Letters, Elsevier, vol. 38(C).
    2. Kristof Lommers & Ouns El Harzli & Jack Kim, 2021. "Confronting Machine Learning With Financial Research," Papers 2103.00366, arXiv.org, revised Mar 2021.
    3. Abpeykar, Shadi & Ghatee, Mehdi & Zare, Hadi, 2019. "Ensemble decision forest of RBF networks via hybrid feature clustering approach for high-dimensional data classification," Computational Statistics & Data Analysis, Elsevier, vol. 131(C), pages 12-36.
    4. Pierre Michel & Nicolas Ngo & Jean-François Pons & Stéphane Delliaux & Roch Giorgi, 2021. "A filter approach for feature selection in classification: application to automatic atrial fibrillation detection in electrocardiogram recordings," Post-Print hal-03222439, HAL.
    5. David Juárez-Varón & Victoria Tur-Viñes & Alejandro Rabasa-Dolado & Kristina Polotskaya, 2020. "An Adaptive Machine Learning Methodology Applied to Neuromarketing Analysis: Prediction of Consumer Behaviour Regarding the Key Elements of the Packaging Design of an Educational Toy," Social Sciences, MDPI, vol. 9(9), pages 1-23, September.
    6. He, Yan-Lin & Wang, Ping-Jiang & Zhang, Ming-Qing & Zhu, Qun-Xiong & Xu, Yuan, 2018. "A novel and effective nonlinear interpolation virtual sample generation method for enhancing energy prediction and analysis on small data problem: A case study of Ethylene industry," Energy, Elsevier, vol. 147(C), pages 418-427.
    7. Xianlong Zhang & Fei Zhang & Hsiang-te Kung & Ping Shi & Ayinuer Yushanjiang & Shidan Zhu, 2018. "Estimation of the Fe and Cu Contents of the Surface Water in the Ebinur Lake Basin Based on LIBS and a Machine Learning Algorithm," IJERPH, MDPI, vol. 15(11), pages 1-20, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lv, Jing & Yang, Hu & Guo, Chaohui, 2015. "An efficient and robust variable selection method for longitudinal generalized linear models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 74-88.
    2. Rekabdarkolaee, Hossein Moradi & Boone, Edward & Wang, Qin, 2017. "Robust estimation and variable selection in sufficient dimension reduction," Computational Statistics & Data Analysis, Elsevier, vol. 108(C), pages 146-157.
    3. Nataliya Sokolovska & Olivier Teytaud & Salwa Rizkalla & MicroObese consortium & Karine Clément & Jean-Daniel Zucker, 2015. "Sparse Zero-Sum Games as Stable Functional Feature Selection," PLOS ONE, Public Library of Science, vol. 10(9), pages 1-16, September.
    4. Zhang, Jing & Wang, Qin & Mays, D'Arcy, 2021. "Robust MAVE through nonconvex penalized regression," Computational Statistics & Data Analysis, Elsevier, vol. 160(C).
    5. Giuseppe Jurman & Samantha Riccadonna & Roberto Visintainer & Cesare Furlanello, 2012. "Algebraic Comparison of Partial Lists in Bioinformatics," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-20, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:71:y:2014:i:c:p:681-693. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.