IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v308y2023i2p555-567.html
   My bibliography  Save this article

Dendrograms, minimum spanning trees and feature selection

Author

Listed:
  • Labbé, Martine
  • Landete, Mercedes
  • Leal, Marina

Abstract

Feature selection is a fundamental process to avoid overfitting and to reduce the size of databases without significant loss of information that applies to hierarchical clustering. Dendrograms are graphical representations of hierarchical clustering algorithms that for single linkage clustering can be interpreted as minimum spanning trees in the complete network defined by the database. In this work, we introduce the problem that determines jointly a set of features and a dendrogram, according to the single linkage method. We propose different formulations that include the minimum spanning tree problem constraints as well as the feature selection constraints. Different bounds on the objective function are studied. For one of the models, several families of valid inequalities are proposed and the problem of separating them is studied. For another formulation, a decomposition algorithm is designed. In an extensive computational study, the effectiveness of the different models is discussed, the model with valid inequalities is compared with the decomposition algorithm. The computational results also illustrate that the integration of feature selection to the optimization model allows to keep a satisfactory percentage of information.

Suggested Citation

  • Labbé, Martine & Landete, Mercedes & Leal, Marina, 2023. "Dendrograms, minimum spanning trees and feature selection," European Journal of Operational Research, Elsevier, vol. 308(2), pages 555-567.
  • Handle: RePEc:eee:ejores:v:308:y:2023:i:2:p:555-567
    DOI: 10.1016/j.ejor.2022.11.031
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0377221722008906
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2022.11.031?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lee, In Gyu & Yoon, Sang Won & Won, Daehan, 2022. "A Mixed Integer Linear Programming Support Vector Machine for Cost-Effective Group Feature Selection: Branch-Cut-and-Price Approach," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1055-1068.
    2. Jiménez-Cordero, Asunción & Morales, Juan Miguel & Pineda, Salvador, 2021. "A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification," European Journal of Operational Research, Elsevier, vol. 293(1), pages 24-35.
    3. Stefano Benati & Sergio García & Justo Puerto, 2018. "Mixed integer linear programming and heuristic methods for feature selection in clustering," Journal of the Operational Research Society, Taylor & Francis Journals, vol. 69(9), pages 1379-1395, September.
    4. Benítez-Peña, Sandra & Bogetoft, Peter & Romero Morales, Dolores, 2020. "Feature Selection in Data Envelopment Analysis: A Mathematical Optimization approach," Omega, Elsevier, vol. 96(C).
    5. Wang, Shaobin & Liu, Haimeng & Pu, Haixia & Yang, Hao, 2020. "Spatial disparity and hierarchical cluster analysis of final energy consumption in China," Energy, Elsevier, vol. 197(C).
    6. Jiang, He & Luo, Shihua & Dong, Yao, 2021. "Simultaneous feature selection and clustering based on square root optimization," European Journal of Operational Research, Elsevier, vol. 289(1), pages 214-231.
    7. Witten, Daniela M. & Tibshirani, Robert, 2010. "A Framework for Feature Selection in Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 105(490), pages 713-726.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Díaz, Verónica & Montoya, Ricardo & Maldonado, Sebastián, 2023. "Preference estimation under bounded rationality: Identification of attribute non-attendance in stated-choice data using a support vector machines approach," European Journal of Operational Research, Elsevier, vol. 304(2), pages 797-812.
    2. Chen, Huadun & Du, Qianxi & Huo, Tengfei & Liu, Peiran & Cai, Weiguang & Liu, Bingsheng, 2023. "Spatiotemporal patterns and driving mechanism of carbon emissions in China's urban residential building sector," Energy, Elsevier, vol. 263(PE).
    3. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    4. Yujia Li & Xiangrui Zeng & Chien‐Wei Lin & George C. Tseng, 2022. "Simultaneous estimation of cluster number and feature sparsity in high‐dimensional cluster analysis," Biometrics, The International Biometric Society, vol. 78(2), pages 574-585, June.
    5. Dong Liu & Changwei Zhao & Yong He & Lei Liu & Ying Guo & Xinsheng Zhang, 2023. "Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix‐variate fMRI data," Biometrics, The International Biometric Society, vol. 79(3), pages 2246-2259, September.
    6. Jeffrey Andrews & Paul McNicholas, 2014. "Variable Selection for Clustering and Classification," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 136-153, July.
    7. Yu Mao & Yonglin Li & Deyi Xu & Yaqi Wu & Jinhua Cheng, 2022. "Spatial-Temporal Evolution of Total Factor Productivity in Logistics Industry of the Yangtze River Economic Belt, China," Sustainability, MDPI, vol. 14(5), pages 1-16, February.
    8. Wang, Na & Fu, Xiaodong & Wang, Shaobin & Yang, Hao & Li, Zhen, 2022. "Convergence characteristics and distribution patterns of residential electricity consumption in China: An urban-rural gap perspective," Energy, Elsevier, vol. 254(PB).
    9. Goodell, John W. & Ben Jabeur, Sami & Saâdaoui, Foued & Nasir, Muhammad Ali, 2023. "Explainable artificial intelligence modeling to forecast bitcoin prices," International Review of Financial Analysis, Elsevier, vol. 88(C).
    10. Lifang Zhang & Jianzhou Wang & Zhenkun Liu, 2023. "Power grid operation optimization and forecasting using a combined forecasting system," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(1), pages 124-153, January.
    11. Jiang, Ping & Liu, Zhenkun & Wang, Jianzhou & Zhang, Lifang, 2021. "Decomposition-selection-ensemble forecasting system for energy futures price forecasting based on multi-objective version of chaos game optimization algorithm," Resources Policy, Elsevier, vol. 73(C).
    12. Kai Deng & Xin Zhang, 2022. "Tensor envelope mixture model for simultaneous clustering and multiway dimension reduction," Biometrics, The International Biometric Society, vol. 78(3), pages 1067-1079, September.
    13. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    14. Valdes, Javier & Masip Macia, Yunesky & Dorner, Wolfgang & Ramirez Camargo, Luis, 2021. "Unsupervised grouping of industrial electricity demand profiles: Synthetic profiles for demand-side management applications," Energy, Elsevier, vol. 215(PA).
    15. Georgios Tsaples & Jason Papathanasiou & Andreas C. Georgiou, 2022. "An Exploratory DEA and Machine Learning Framework for the Evaluation and Analysis of Sustainability Composite Indicators in the EU," Mathematics, MDPI, vol. 10(13), pages 1-27, June.
    16. Seiya Maki & Satoshi Ohnishi & Minoru Fujii & Naohiro Goto & Lu Sun, 2022. "Using waste to supply steam for industry transition: Selection of target industries through economic evaluation and statistical analysis," Journal of Industrial Ecology, Yale University, vol. 26(4), pages 1475-1486, August.
    17. Peter Radchenko & Gourab Mukherjee, 2017. "Convex clustering via l 1 fusion penalization," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(5), pages 1527-1546, November.
    18. Carrizosa, Emilio & Kurishchenko, Kseniia & Marín, Alfredo & Romero Morales, Dolores, 2022. "Interpreting clusters via prototype optimization," Omega, Elsevier, vol. 107(C).
    19. Djellouli, Nassima & Abdelli, Latifa & Elheddad, Mohamed & Ahmed, Rizwan & Mahmood, Haider, 2022. "The effects of non-renewable energy, renewable energy, economic growth, and foreign direct investment on the sustainability of African countries," Renewable Energy, Elsevier, vol. 183(C), pages 676-686.
    20. Zhiguang Huo & Li Zhu & Tianzhou Ma & Hongcheng Liu & Song Han & Daiqing Liao & Jinying Zhao & George Tseng, 2020. "Two-Way Horizontal and Vertical Omics Integration for Disease Subtype Discovery," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(1), pages 1-22, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:308:y:2023:i:2:p:555-567. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.