IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v348y2025i1d10.1007_s10479-023-05271-z.html
   My bibliography  Save this article

Compactness score: a fast filter method for unsupervised feature selection

Author

Listed:
  • Peican Zhu

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

  • Xin Hou

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

  • Keke Tang

    (Cyberspace Institute of Advanced Technology, Guangzhou University)

  • Zhen Wang

    (Northwestern Polytechnical University (NWPU)
    School of Cybersecurity, Northwestern Polytechnical University (NWPU))

  • Feiping Nie

    (Northwestern Polytechnical University (NWPU)
    Northwestern Polytechnical University (NWPU))

Abstract

The rapid development of big data era incurs the generation of huge amount of data day by day in various fields. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in practical applications. Therefore, an efficient big data analytical method is urgently necessary. For feature engineering, feature selection seems to be an important research topic which is anticipated to select “excellent” features from candidate ones. The implementation of feature selection can not only achieve the purpose of dimensionality reduction, but also improve the computational efficiency and result performance of the model. In many classification tasks, researchers found that data seem to be usually close to each other if they are from the same class; thus, local compactness is of great importance for the evaluation of a feature. Based on this discovery, we propose a fast unsupervised feature selection algorithm, named Compactness Score (CSUFS), to select desired features. To prove the superiority of the proposed algorithm, several public data sets are considered with extensive experiments being performed. The experiments are presented by applying feature subsets selected through several different algorithms to the clustering task. The performance of clustering tasks is indicated by two well-known evaluation metrics, while the efficiency is reflected by the corresponding running time. As demonstrated, our proposed algorithm is more accurate and efficient compared with existing ones.

Suggested Citation

  • Peican Zhu & Xin Hou & Keke Tang & Zhen Wang & Feiping Nie, 2025. "Compactness score: a fast filter method for unsupervised feature selection," Annals of Operations Research, Springer, vol. 348(1), pages 299-315, May.
  • Handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05271-z
    DOI: 10.1007/s10479-023-05271-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10479-023-05271-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10479-023-05271-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    2. Mostafa Rezaei & Ivor Cribben & Michele Samorani, 2021. "A clustering-based feature selection method for automatically generated relational attributes," Annals of Operations Research, Springer, vol. 303(1), pages 233-263, August.
    3. Hoai An Le Thi & Manh Cuong Nguyen, 2017. "DCA based algorithms for feature selection in multi-class support vector machine," Annals of Operations Research, Springer, vol. 249(1), pages 273-300, February.
    4. Onur Şeref & Ya-Ju Fan & Elan Borenstein & Wanpracha A. Chaovalitwongse, 2018. "Information-theoretic feature selection with discrete $$k$$ k -median clustering," Annals of Operations Research, Springer, vol. 263(1), pages 93-118, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maugis, C. & Celeux, G. & Martin-Magniette, M.-L., 2011. "Variable selection in model-based discriminant analysis," Journal of Multivariate Analysis, Elsevier, vol. 102(10), pages 1374-1387, November.
    2. Scindhiya Laxmi & S. K. Gupta & Sumit Kumar, 2024. "Intuitionistic fuzzy least square twin support vector machines for pattern classification," Annals of Operations Research, Springer, vol. 339(3), pages 1329-1378, August.
    3. Kazim Topuz & Behrooz Davazdahemami & Dursun Delen, 2024. "A Bayesian belief network-based analytics methodology for early-stage risk detection of novel diseases," Annals of Operations Research, Springer, vol. 341(1), pages 673-697, October.
    4. Anzanello, Michel J. & Fogliatto, Flavio S., 2011. "Selecting the best clustering variables for grouping mass-customized products involving workers' learning," International Journal of Production Economics, Elsevier, vol. 130(2), pages 268-276, April.
    5. Hoai An Le Thi & Tao Pham Dinh, 2024. "Open issues and recent advances in DC programming and DCA," Journal of Global Optimization, Springer, vol. 88(3), pages 533-590, March.
    6. Dolnicar, Sara & Grün, Bettina & Leisch, Friedrich, 2016. "Increasing sample size compensates for data problems in segmentation studies," Journal of Business Research, Elsevier, vol. 69(2), pages 992-999.
    7. Sahin, Özge & Czado, Claudia, 2022. "Vine copula mixture models and clustering for non-Gaussian data," Econometrics and Statistics, Elsevier, vol. 22(C), pages 136-158.
    8. F. Benedetto & L. Mastroeni & P. Vellucci, 2021. "Modeling the flow of information between financial time-series by an entropy-based approach," Annals of Operations Research, Springer, vol. 299(1), pages 1235-1252, April.
    9. M. Tanveer & T. Rajani & R. Rastogi & Y. H. Shao & M. A. Ganaie, 2024. "Comprehensive review on twin support vector machines," Annals of Operations Research, Springer, vol. 339(3), pages 1223-1268, August.
    10. Paul D. McNicholas, 2016. "Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 33(3), pages 331-373, October.
    11. Giuliano Galimberti & Lorenzo Nuzzi & Gabriele Soffritti, 2021. "Covariance matrix estimation of the maximum likelihood estimator in multivariate clusterwise linear regression," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 235-268, March.
    12. Abbaszadehpeivasti, Hadi, 2024. "Performance analysis of optimization methods for machine learning," Other publications TiSEM 3050a62d-1a1f-494e-99ef-7, Tilburg University, School of Economics and Management.
    13. Crook Oliver M. & Gatto Laurent & Kirk Paul D. W., 2019. "Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 18(6), pages 1-20, December.
    14. Melnykov, Volodymyr, 2016. "Model-based biclustering of clickstream data," Computational Statistics & Data Analysis, Elsevier, vol. 93(C), pages 31-45.
    15. Monia Ranalli & Roberto Rocci, 2017. "A Model-Based Approach to Simultaneous Clustering and Dimensional Reduction of Ordinal Data," Psychometrika, Springer;The Psychometric Society, vol. 82(4), pages 1007-1034, December.
    16. Cappozzo, Andrea & Greselin, Francesca & Murphy, Thomas Brendan, 2021. "Robust variable selection for model-based learning in presence of adulteration," Computational Statistics & Data Analysis, Elsevier, vol. 158(C).
    17. Hadi Abbaszadehpeivasti & Etienne Klerk & Moslem Zamani, 2024. "On the Rate of Convergence of the Difference-of-Convex Algorithm (DCA)," Journal of Optimization Theory and Applications, Springer, vol. 202(1), pages 475-496, July.
    18. Bouveyron, Charles & Brunet-Saumard, Camille, 2014. "Model-based clustering of high-dimensional data: A review," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 52-78.
    19. Wilson Toussile & Elisabeth Gassiat, 2009. "Variable selection in model-based clustering using multilocus genotype data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 3(2), pages 109-134, September.
    20. Giuliano Galimberti & Gabriele Soffritti, 2020. "Seemingly unrelated clusterwise linear regression," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 235-260, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05271-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.