IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v13y2019i1d10.1007_s11634-018-0331-4.html
   My bibliography  Save this article

Assessing trimming methodologies for clustering linear regression data

Author

Listed:
  • Francesca Torti

    (European Commission)

  • Domenico Perrotta

    (European Commission)

  • Marco Riani

    (University of Parma)

  • Andrea Cerioli

    (University of Parma)

Abstract

We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexible simulation tool for mixtures of regressions, where the user can control the degree of overlap between the groups. Level of trimming and restriction factors are input parameters for which appropriate tuning is required. Since we find that incorrect specification of the second-level trimming in the Trimmed CLUSTering REGression model (TCLUST-REG) can deteriorate the performance of the method, we propose an improvement where the second-level trimming is not fixed in advance but is data dependent. We then compare our adaptive version of TCLUST-REG with the Trimmed Cluster Weighted Restricted Model (TCWRM) which provides a powerful extension of the robust clusterwise regression methodology. Our overall conclusion is that the two methods perform comparably, but with notable differences due to the inherent degree of modeling implied by them.

Suggested Citation

  • Francesca Torti & Domenico Perrotta & Marco Riani & Andrea Cerioli, 2019. "Assessing trimming methodologies for clustering linear regression data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 227-257, March.
  • Handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0331-4
    DOI: 10.1007/s11634-018-0331-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-018-0331-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-018-0331-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Marco Riani & Andrea Cerioli & Domenico Perrotta & Francesca Torti, 2015. "Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 9(4), pages 461-481, December.
    2. Riani, Marco & Perrotta, Domenico & Cerioli, Andrea, 2015. "The Forward Search for Very Large Datasets," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(c01).
    3. Alessio Farcomeni & Francesco Dotto, 2018. "The power of (extended) monitoring in robust clustering," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 651-660, December.
    4. N. Gershenfeld & B. Schoner & E. Metois, 1999. "Cluster-weighted modelling for time-series analysis," Nature, Nature, vol. 397(6717), pages 329-332, January.
    5. Marco Riani & Anthony C. Atkinson & Andrea Cerioli, 2009. "Finding an unknown number of multivariate outliers," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 447-466, April.
    6. Fritz, Heinrich & García-Escudero, Luis A. & Mayo-Iscar, Agustín, 2013. "A fast algorithm for robust constrained clustering," Computational Statistics & Data Analysis, Elsevier, vol. 61(C), pages 124-136.
    7. Melnykov, Volodymyr & Chen, Wei-Chen & Maitra, Ranjan, 2012. "MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 51(i12).
    8. Domenico Perrotta & Francesca Torti, 2018. "Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample”," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 641-649, December.
    9. Wayne DeSarbo & William Cron, 1988. "A maximum likelihood methodology for clusterwise linear regression," Journal of Classification, Springer;The Classification Society, vol. 5(2), pages 249-282, September.
    10. Robert B. Davies, 1980. "The Distribution of a Linear Combination of χ2 Random Variables," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(3), pages 323-333, November.
    11. Hennig, Christian, 2003. "Clusters, outliers, and regression: fixed point clusters," Journal of Multivariate Analysis, Elsevier, vol. 86(1), pages 183-212, July.
    12. García-Escudero, L.A. & Gordaliza, A. & Mayo-Iscar, A. & San Martín, R., 2010. "Robust clusterwise linear regression through trimming," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3057-3069, December.
    13. García-Escudero, Luis Angel & Gordaliza, Alfonso & Greselin, Francesca & Ingrassia, Salvatore & Mayo-Iscar, Agustín, 2016. "The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 99(C), pages 131-147.
    14. Cerioli, Andrea, 2010. "Multivariate Outlier Detection With High-Breakdown Estimators," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 147-156.
    15. Salvatore Ingrassia & Simona Minotti & Giorgio Vittadini, 2012. "Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 363-401, October.
    16. Neykov, N. & Filzmoser, P. & Dimova, R. & Neytchev, P., 2007. "Robust fitting of mixtures using the trimmed likelihood estimator," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 299-308, September.
    17. Andrea Cerioli & Domenico Perrotta, 2014. "Robust clustering around regression lines with high density regions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 5-26, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Andrea Cappozzo & Luis Angel García Escudero & Francesca Greselin & Agustín Mayo-Iscar, 2021. "Parameter Choice, Stability and Validity for Robust Cluster Weighted Modeling," Stats, MDPI, vol. 4(3), pages 1-14, July.
    2. Luca Greco, 2022. "Robust fitting of mixtures of GLMs by weighted likelihood," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 106(1), pages 25-48, March.
    3. Luca Greco & Antonio Lucadamo & Claudio Agostinelli, 2021. "Weighted likelihood latent class linear regression," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(2), pages 711-746, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Francesca Torti & Marco Riani & Gianluca Morelli, 2021. "Semiautomatic robust regression clustering of international trade data," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(3), pages 863-894, September.
    2. Salvatore Ingrassia & Simona Minotti & Giorgio Vittadini, 2012. "Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 363-401, October.
    3. Marco Riani & Andrea Cerioli & Domenico Perrotta & Francesca Torti, 2015. "Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 9(4), pages 461-481, December.
    4. Antonio Punzo & Paul. D. McNicholas, 2017. "Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model," Journal of Classification, Springer;The Classification Society, vol. 34(2), pages 249-293, July.
    5. Luca Greco & Antonio Lucadamo & Claudio Agostinelli, 2021. "Weighted likelihood latent class linear regression," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(2), pages 711-746, June.
    6. Yao, Weixin & Wei, Yan & Yu, Chun, 2014. "Robust mixture regression using the t-distribution," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 116-127.
    7. Wu, Qiang & Yao, Weixin, 2016. "Mixtures of quantile regressions," Computational Statistics & Data Analysis, Elsevier, vol. 93(C), pages 162-176.
    8. Marco Riani & Andrea Cerioli & Francesca Torti, 2014. "On consistency factors and efficiency of robust S-estimators," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 23(2), pages 356-387, June.
    9. Bai, Xiuqin & Yao, Weixin & Boyer, John E., 2012. "Robust fitting of mixture regression models," Computational Statistics & Data Analysis, Elsevier, vol. 56(7), pages 2347-2359.
    10. Andrea Cerasa, 2016. "Combining homogeneous groups of preclassified observations with application to international trade," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(3), pages 229-259, August.
    11. Salvatore D. Tomarchio & Paul D. McNicholas & Antonio Punzo, 2021. "Matrix Normal Cluster-Weighted Models," Journal of Classification, Springer;The Classification Society, vol. 38(3), pages 556-575, October.
    12. Andrea Cappozzo & Francesca Greselin & Thomas Brendan Murphy, 2020. "A robust approach to model-based classification based on trimming and constraints," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(2), pages 327-354, June.
    13. Angelo Mazza & Antonio Punzo, 2020. "Mixtures of multivariate contaminated normal regression models," Statistical Papers, Springer, vol. 61(2), pages 787-822, April.
    14. Silvia Salini & Andrea Cerioli & Fabrizio Laurini & Marco Riani, 2016. "Reliable Robust Regression Diagnostics," International Statistical Review, International Statistical Institute, vol. 84(1), pages 99-127, April.
    15. Andrea Cerioli & Marco Riani & Anthony C. Atkinson & Aldo Corbellini, 2018. "The power of monitoring: how to make the most of a contaminated multivariate sample," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 559-587, December.
    16. Luca Greco, 2022. "Robust fitting of mixtures of GLMs by weighted likelihood," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 106(1), pages 25-48, March.
    17. L. García-Escudero & A. Gordaliza & A. Mayo-Iscar, 2013. "Comments on: model-based clustering and classification with non-normal mixture distributions," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 22(4), pages 459-461, November.
    18. A. Pedro Duarte Silva & Peter Filzmoser & Paula Brito, 2018. "Outlier detection in interval data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 785-822, September.
    19. Anthony C. Atkinson & Marco Riani & Andrea Cerioli, 2018. "Cluster detection and clustering with random start forward searches," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(5), pages 777-798, April.
    20. Anthony C. Atkinson & Aldo Corbellini & Marco Riani, 2017. "Robust Bayesian regression with the forward search: theory and data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(4), pages 869-886, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0331-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.