IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v14y2020i2d10.1007_s11634-019-00371-w.html
   My bibliography  Save this article

A robust approach to model-based classification based on trimming and constraints

Author

Listed:
  • Andrea Cappozzo

    (University of Milano-Bicocca)

  • Francesca Greselin

    (University of Milano-Bicocca)

  • Thomas Brendan Murphy

    (University College Dublin)

Abstract

In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.

Suggested Citation

  • Andrea Cappozzo & Francesca Greselin & Thomas Brendan Murphy, 2020. "A robust approach to model-based classification based on trimming and constraints," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 327-354, June.
  • Handle: RePEc:spr:advdac:v:14:y:2020:i:2:d:10.1007_s11634-019-00371-w
    DOI: 10.1007/s11634-019-00371-w
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-019-00371-w
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-019-00371-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Salvatore Ingrassia, 2004. "A likelihood-based constrained algorithm for multivariate normal mixture models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 13(2), pages 151-166, September.
    2. Nema Dean & Thomas Brendan Murphy & Gerard Downey, 2006. "Using unlabelled data to update classification rules with applications in food authenticity studies," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 55(1), pages 1-14, January.
    3. Luis Angel García-Escudero & Alfonso Gordaliza & Francesca Greselin & Salvatore Ingrassia & Agustín Mayo-Iscar, 2018. "Eigenvalues and constraints in mixture modeling: geometric and computational issues," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 203-233, June.
    4. Fritz, Heinrich & García-Escudero, Luis A. & Mayo-Iscar, Agustín, 2013. "A fast algorithm for robust constrained clustering," Computational Statistics & Data Analysis, Elsevier, vol. 61(C), pages 124-136.
    5. L. García-Escudero & A. Gordaliza & A. Mayo-Iscar, 2014. "A constrained robust proposal for mixture modeling avoiding spurious solutions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 27-43, March.
    6. Luis García-Escudero & Alfonso Gordaliza & Carlos Matrán & Agustín Mayo-Iscar, 2010. "A review of robust clustering methods," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 4(2), pages 89-109, September.
    7. Dankmar Böhning & Ekkehart Dietz & Rainer Schaub & Peter Schlattmann & Bruce Lindsay, 1994. "The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 46(2), pages 373-388, June.
    8. García-Escudero, Luis Angel & Gordaliza, Alfonso & Greselin, Francesca & Ingrassia, Salvatore & Mayo-Iscar, Agustín, 2016. "The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 99(C), pages 131-147.
    9. Ryan Browne & Paul McNicholas, 2014. "Estimating common principal components in high dimensions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(2), pages 217-226, June.
    10. Neykov, N. & Filzmoser, P. & Dimova, R. & Neytchev, P., 2007. "Robust fitting of mixtures using the trimmed likelihood estimator," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 299-308, September.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Cappozzo, Andrea & Greselin, Francesca & Murphy, Thomas Brendan, 2021. "Robust variable selection for model-based learning in presence of adulteration," Computational Statistics & Data Analysis, Elsevier, vol. 158(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. García-Escudero, Luis Angel & Gordaliza, Alfonso & Greselin, Francesca & Ingrassia, Salvatore & Mayo-Iscar, Agustín, 2016. "The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 99(C), pages 131-147.
    2. Luis Angel García-Escudero & Alfonso Gordaliza & Francesca Greselin & Salvatore Ingrassia & Agustín Mayo-Iscar, 2018. "Eigenvalues and constraints in mixture modeling: geometric and computational issues," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 203-233, June.
    3. Marco Riani & Andrea Cerioli & Domenico Perrotta & Francesca Torti, 2015. "Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 9(4), pages 461-481, December.
    4. Pietro Coretto & Christian Hennig, 2016. "Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1648-1659, October.
    5. Angelo Mazza & Antonio Punzo, 2020. "Mixtures of multivariate contaminated normal regression models," Statistical Papers, Springer, vol. 61(2), pages 787-822, April.
    6. Francesca Torti & Domenico Perrotta & Marco Riani & Andrea Cerioli, 2019. "Assessing trimming methodologies for clustering linear regression data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 227-257, March.
    7. Cong, Lin & Yao, Weixin, 2021. "A Likelihood Ratio Test of a Homoscedastic Multivariate Normal Mixture Against a Heteroscedastic Multivariate Normal Mixture," Econometrics and Statistics, Elsevier, vol. 18(C), pages 79-88.
    8. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    9. L. García-Escudero & A. Gordaliza & A. Mayo-Iscar, 2013. "Comments on: model-based clustering and classification with non-normal mixture distributions," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 22(4), pages 459-461, November.
    10. Shaikh Mateen & McNicholas Paul D & Desmond Anthony F, 2010. "A Pseudo-EM Algorithm for Clustering Incomplete Longitudinal Data," The International Journal of Biostatistics, De Gruyter, vol. 6(1), pages 1-17, March.
    11. C. Ruwet & L. García-Escudero & A. Gordaliza & A. Mayo-Iscar, 2013. "On the breakdown behavior of the TCLUST clustering procedure," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 22(3), pages 466-487, September.
    12. Tang, Yang & Browne, Ryan P. & McNicholas, Paul D., 2015. "Model based clustering of high-dimensional binary data," Computational Statistics & Data Analysis, Elsevier, vol. 87(C), pages 84-101.
    13. Diego Rivera-García & Luis A. García-Escudero & Agustín Mayo-Iscar & Joaquín Ortega, 2019. "Robust clustering for functional data based on trimming and constraints," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 201-225, March.
    14. L. García-Escudero & A. Gordaliza & A. Mayo-Iscar, 2014. "A constrained robust proposal for mixture modeling avoiding spurious solutions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 27-43, March.
    15. Morris, Katherine & McNicholas, Paul D., 2016. "Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures," Computational Statistics & Data Analysis, Elsevier, vol. 97(C), pages 133-150.
    16. Sanjeena Subedi & Antonio Punzo & Salvatore Ingrassia & Paul McNicholas, 2013. "Clustering and classification via cluster-weighted factor analyzers," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 7(1), pages 5-40, March.
    17. Andrews, Jeffrey L. & McNicholas, Paul D. & Subedi, Sanjeena, 2011. "Model-based classification via mixtures of multivariate t-distributions," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 520-529, January.
    18. Cristina Tortora & Brian C. Franczak & Ryan P. Browne & Paul D. McNicholas, 2019. "A Mixture of Coalesced Generalized Hyperbolic Distributions," Journal of Classification, Springer;The Classification Society, vol. 36(1), pages 26-57, April.
    19. Cristina Tortora & Paul D. McNicholas & Ryan P. Browne, 2016. "A mixture of generalized hyperbolic factor analyzers," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 10(4), pages 423-440, December.
    20. Antonio Punzo & Paul. D. McNicholas, 2017. "Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model," Journal of Classification, Springer;The Classification Society, vol. 34(2), pages 249-293, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:14:y:2020:i:2:d:10.1007_s11634-019-00371-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.