IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v348y2025i1d10.1007_s10479-023-05271-z.html
   My bibliography  Save this article

Compactness score: a fast filter method for unsupervised feature selection

Author

Listed:
  • Peican Zhu

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

  • Xin Hou

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

  • Keke Tang

    (Cyberspace Institute of Advanced Technology, Guangzhou University)

  • Zhen Wang

    (Northwestern Polytechnical University (NWPU)
    School of Cybersecurity, Northwestern Polytechnical University (NWPU))

  • Feiping Nie

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

Abstract

The rapid development of big data era incurs the generation of huge amount of data day by day in various fields. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in practical applications. Therefore, an efficient big data analytical method is urgently necessary. For feature engineering, feature selection seems to be an important research topic which is anticipated to select “excellent” features from candidate ones. The implementation of feature selection can not only achieve the purpose of dimensionality reduction, but also improve the computational efficiency and result performance of the model. In many classification tasks, researchers found that data seem to be usually close to each other if they are from the same class; thus, local compactness is of great importance for the evaluation of a feature. Based on this discovery, we propose a fast unsupervised feature selection algorithm, named Compactness Score (CSUFS), to select desired features. To prove the superiority of the proposed algorithm, several public data sets are considered with extensive experiments being performed. The experiments are presented by applying feature subsets selected through several different algorithms to the clustering task. The performance of clustering tasks is indicated by two well-known evaluation metrics, while the efficiency is reflected by the corresponding running time. As demonstrated, our proposed algorithm is more accurate and efficient compared with existing ones.

Suggested Citation

  • Peican Zhu & Xin Hou & Keke Tang & Zhen Wang & Feiping Nie, 2025. "Compactness score: a fast filter method for unsupervised feature selection," Annals of Operations Research, Springer, vol. 348(1), pages 299-315, May.
  • Handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05271-z
    DOI: 10.1007/s10479-023-05271-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10479-023-05271-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10479-023-05271-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Hoai An Le Thi & Manh Cuong Nguyen, 2017. "DCA based algorithms for feature selection in multi-class support vector machine," Annals of Operations Research, Springer, vol. 249(1), pages 273-300, February.
    2. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    3. Mostafa Rezaei & Ivor Cribben & Michele Samorani, 2021. "A clustering-based feature selection method for automatically generated relational attributes," Annals of Operations Research, Springer, vol. 303(1), pages 233-263, August.
    4. Onur Şeref & Ya-Ju Fan & Elan Borenstein & Wanpracha A. Chaovalitwongse, 2018. "Information-theoretic feature selection with discrete $$k$$ k -median clustering," Annals of Operations Research, Springer, vol. 263(1), pages 93-118, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Luca Scrucca, 2014. "Graphical tools for model-based mixture discriminant analysis," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(2), pages 147-165, June.
    2. Maugis, C. & Celeux, G. & Martin-Magniette, M.-L., 2011. "Variable selection in model-based discriminant analysis," Journal of Multivariate Analysis, Elsevier, vol. 102(10), pages 1374-1387, November.
    3. Alessandro Casa & Andrea Cappozzo & Michael Fop, 2022. "Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 648-674, November.
    4. Wang, Ketong & Porter, Michael D., 2018. "Optimal Bayesian clustering using non-negative matrix factorization," Computational Statistics & Data Analysis, Elsevier, vol. 128(C), pages 395-411.
    5. Scindhiya Laxmi & S. K. Gupta & Sumit Kumar, 2024. "Intuitionistic fuzzy least square twin support vector machines for pattern classification," Annals of Operations Research, Springer, vol. 339(3), pages 1329-1378, August.
    6. Fulvia Pennoni & Francesco Bartolucci & Silvia Pandolfi, 2024. "Erratum to: Variable Selection for Hidden Markov Models with Continuous Variables and Missing Data," Journal of Classification, Springer;The Classification Society, vol. 41(3), pages 590-590, November.
    7. Fabio Centofanti & Antonio Lepore & Biagio Palumbo, 2024. "Sparse and smooth functional data clustering," Statistical Papers, Springer, vol. 65(2), pages 795-825, April.
    8. Alaleh Razmjoo & Petros Xanthopoulos & Qipeng Phil Zheng, 2019. "Feature importance ranking for classification in mixed online environments," Annals of Operations Research, Springer, vol. 276(1), pages 315-330, May.
    9. Faizal Hafiz & Jan Broekaert & Davide Torre & Akshya Swain, 2024. "A multi-criteria approach to evolve sparse neural architectures for stock market forecasting," Annals of Operations Research, Springer, vol. 336(1), pages 1219-1263, May.
    10. Paul McLaughlin & Brian C. Franczak & Adam B. Kashlak, 2024. "Unsupervised Classification with a Family of Parsimonious Contaminated Shifted Asymmetric Laplace Mixtures," Journal of Classification, Springer;The Classification Society, vol. 41(1), pages 65-93, March.
    11. Kazim Topuz & Behrooz Davazdahemami & Dursun Delen, 2024. "A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases," Annals of Operations Research, Springer, vol. 341(1), pages 673-697, October.
    12. Floriello, Davide & Vitelli, Valeria, 2017. "Sparse clustering of functional data," Journal of Multivariate Analysis, Elsevier, vol. 154(C), pages 1-18.
    13. Anzanello, Michel J. & Fogliatto, Flavio S., 2011. "Selecting the best clustering variables for grouping mass-customized products involving workers' learning," International Journal of Production Economics, Elsevier, vol. 130(2), pages 268-276, April.
    14. Hoai An Le Thi & Tao Pham Dinh, 2024. "Open issues and recent advances in DC programming and DCA," Journal of Global Optimization, Springer, vol. 88(3), pages 533-590, March.
    15. Dolnicar, Sara & Grün, Bettina & Leisch, Friedrich, 2016. "Increasing sample size compensates for data problems in segmentation studies," Journal of Business Research, Elsevier, vol. 69(2), pages 992-999.
    16. Sahin, Özge & Czado, Claudia, 2022. "Vine copula mixture models and clustering for non-Gaussian data," Econometrics and Statistics, Elsevier, vol. 22(C), pages 136-158.
    17. F. Benedetto & L. Mastroeni & P. Vellucci, 2021. "Modeling the flow of information between financial time-series by an entropy-based approach," Annals of Operations Research, Springer, vol. 299(1), pages 1235-1252, April.
    18. Hivert, Benjamin & Agniel, Denis & Thiébaut, Rodolphe & Hejblum, Boris P., 2024. "Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data," Computational Statistics & Data Analysis, Elsevier, vol. 193(C).
    19. Laura C. Dawkins & Daniel B. Williamson & Stewart W. Barr & Sally R. Lampkin, 2020. "‘What drives commuter behaviour?': a Bayesian clustering approach for understanding opposing behaviours in social surveys," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(1), pages 251-280, January.
    20. Roberto Rocci & Maurizio Vichi & Monia Ranalli, 2025. "Mixture models for simultaneous classification and reduction of three-way data," Computational Statistics, Springer, vol. 40(1), pages 469-507, January.

    More about this item

    Keywords

    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05271-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.