IDEAS home Printed from https://ideas.repec.org/p/ehl/lserod/72291.html
   My bibliography  Save this paper

Cluster detection and clustering with random start forward searches

Author

Listed:
  • Atkinson, Anthony C.
  • Riani, Marco
  • Cerioli, Andrea

Abstract

The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with 8 clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures.

Suggested Citation

  • Atkinson, Anthony C. & Riani, Marco & Cerioli, Andrea, 2017. "Cluster detection and clustering with random start forward searches," LSE Research Online Documents on Economics 72291, London School of Economics and Political Science, LSE Library.
  • Handle: RePEc:ehl:lserod:72291
    as

    Download full text from publisher

    File URL: http://eprints.lse.ac.uk/72291/
    File Function: Open access version.
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Riani, Marco & Perrotta, Domenico & Cerioli, Andrea, 2015. "The Forward Search for Very Large Datasets," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(c01).
    2. Croux, Christophe & Joossens, Kristel, 2005. "Influence of observations on the misclassification probability in quadratic discriminant analysis," Journal of Multivariate Analysis, Elsevier, vol. 96(2), pages 384-403, October.
    3. Marco Riani & Anthony C. Atkinson & Andrea Cerioli, 2009. "Finding an unknown number of multivariate outliers," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 447-466, April.
    4. Fraley C. & Raftery A.E., 2002. "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 611-631, June.
    5. Hawkins D. M. & Olive D. J., 2002. "Inconsistency of Resampling Algorithms for High-Breakdown Regression Estimators and a New Algorithm," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 136-159, March.
    6. Pison, Greet & Rousseeuw, Peter J. & Filzmoser, Peter & Croux, Christophe, 2003. "Robust factor analysis," Journal of Multivariate Analysis, Elsevier, vol. 84(1), pages 145-172, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Francesca Torti & Aldo Corbellini & Anthony C. Atkinson, 2021. "fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search," Stats, MDPI, vol. 4(2), pages 1-21, April.
    2. Torti, Francesca & Corbellini, Aldo & Atkinson, Anthony C., 2021. "fsdaSAS: a package for robust regression for very large datasets including the batch forward search," LSE Research Online Documents on Economics 109895, London School of Economics and Political Science, LSE Library.
    3. Šárka Brodinová & Peter Filzmoser & Thomas Ortner & Christian Breiteneder & Maia Rohm, 2019. "Robust and sparse k-means clustering for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 905-932, December.
    4. Alessio Farcomeni & Antonio Punzo, 2020. "Robust model-based clustering with mild and gross outliers," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(4), pages 989-1007, December.
    5. Marco Riani & Anthony C. Atkinson & Andrea Cerioli & Aldo Corbellini, 2019. "Comments on: Data science, big data and statistics," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(2), pages 349-352, June.
    6. Andrea Cerioli & Marco Riani & Anthony C. Atkinson & Aldo Corbellini, 2018. "The power of monitoring: how to make the most of a contaminated multivariate sample," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 559-587, December.
    7. Reiko Aoki & Juan P. M. Bustamante & Gilberto A. Paula, 2022. "Local influence diagnostics with forward search in regression analysis," Statistical Papers, Springer, vol. 63(5), pages 1477-1497, October.
    8. Anthony C. Atkinson & Aldo Corbellini & Marco Riani, 2017. "Robust Bayesian regression with the forward search: theory and data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(4), pages 869-886, December.
    9. Brenton R. Clarke & Andrew Grose, 2023. "A further study comparing forward search multivariate outlier methods including ATLA with an application to clustering," Statistical Papers, Springer, vol. 64(2), pages 395-420, April.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marco Riani & Andrea Cerioli & Francesca Torti, 2014. "On consistency factors and efficiency of robust S-estimators," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 23(2), pages 356-387, June.
    2. Anthony C. Atkinson & Aldo Corbellini & Marco Riani, 2017. "Robust Bayesian regression with the forward search: theory and data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(4), pages 869-886, December.
    3. Francesca Torti & Domenico Perrotta & Marco Riani & Andrea Cerioli, 2019. "Assessing trimming methodologies for clustering linear regression data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 227-257, March.
    4. Andrea Cerioli & Marco Riani & Anthony C. Atkinson & Aldo Corbellini, 2018. "The power of monitoring: how to make the most of a contaminated multivariate sample," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 559-587, December.
    5. Pourahmadi, Mohsen & Daniels, Michael J. & Park, Trevor, 2007. "Simultaneous modelling of the Cholesky decomposition of several covariance matrices," Journal of Multivariate Analysis, Elsevier, vol. 98(3), pages 568-587, March.
    6. Stefano Tonellato & Andrea Pastore, 2013. "On the comparison of model-based clustering solutions," Working Papers 2013:05, Department of Economics, University of Venice "Ca' Foscari".
    7. Scrucca, Luca, 2011. "Model-based SIR for dimension reduction," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 3010-3026, November.
    8. Metaxas, Theodore & Kallioras, Dimitris, 2013. "Small and medium-sized firms' competitiveness and territorial characteristics/assets: The cases of Bari, Varna and Thessaloniki," MPRA Paper 52446, University Library of Munich, Germany.
    9. Di Zio, Marco & Guarnera, Ugo & Luzi, Orietta, 2007. "Imputation through finite Gaussian mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 51(11), pages 5305-5316, July.
    10. Sylvia Frühwirth‐Schnatter & Christoph Pamminger & Andrea Weber & Rudolf Winter‐Ebmer, 2012. "Labor market entry and earnings dynamics: Bayesian inference using mixtures‐of‐experts Markov chain clustering," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 27(7), pages 1116-1137, November.
    11. Montanari, Angela & Viroli, Cinzia, 2011. "Maximum likelihood estimation of mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2712-2723, September.
    12. Giovanna Devetag & Sibilla Guida & Luca Polonio, 2016. "An eye-tracking study of feature-based choice in one-shot games," Experimental Economics, Springer;Economic Science Association, vol. 19(1), pages 177-201, March.
    13. Minjung Kyung & Ju-Hyun Park & Ji Yeh Choi, 2022. "Bayesian Mixture Model of Extended Redundancy Analysis," Psychometrika, Springer;The Psychometric Society, vol. 87(3), pages 946-966, September.
    14. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    15. Anthony C. Atkinson & Marco Riani & Aldo Corbellini, 2020. "The analysis of transformations for profit‐and‐loss data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(2), pages 251-275, April.
    16. Salvatore Ingrassia & Simona Minotti & Giorgio Vittadini, 2012. "Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 363-401, October.
    17. So Pyay Thar & Thiagarajah Ramilan & Robert J. Farquharson & Deli Chen, 2021. "Identifying Potential for Decision Support Tools through Farm Systems Typology Analysis Coupled with Participatory Research: A Case for Smallholder Farmers in Myanmar," Agriculture, MDPI, vol. 11(6), pages 1-20, June.
    18. Bianco, Ana & Boente, Graciela & Pires, Ana M. & Rodrigues, Isabel M., 2008. "Robust discrimination under a hierarchy on the scatter matrices," Journal of Multivariate Analysis, Elsevier, vol. 99(6), pages 1332-1357, July.
    19. Domenico Perrotta & Marco Riani & Francesca Torti, 2009. "New robust dynamic plots for regression mixture detection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 3(3), pages 263-279, December.
    20. Theodore Metaxas, 2012. "Urban Advantages and Disadvantages in Southeastern Europe: An Appreciation of Industrial Firms by Using Exploratory Factor Analysis," European Research Studies Journal, European Research Studies Journal, vol. 0(2), pages 81-104.

    More about this item

    Keywords

    brushing; data structure; forward search; graphical methods; linked plots; Mahalanobis distance; MM estimation; outliers; S estimation; Tukey’s biweight.;
    All these keywords.

    JEL classification:

    • C1 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ehl:lserod:72291. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: LSERO Manager (email available below). General contact details of provider: https://edirc.repec.org/data/lsepsuk.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.