IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v13y2019i1d10.1007_s11634-018-0329-y.html
   My bibliography  Save this article

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering

Author

Listed:
  • Sylvia Frühwirth-Schnatter

    (Vienna University of Economics and Business (WU))

  • Gertraud Malsiner-Walli

    (Vienna University of Economics and Business (WU))

Abstract

In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303–324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than K with high probability. The number of clusters is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration.

Suggested Citation

  • Sylvia Frühwirth-Schnatter & Gertraud Malsiner-Walli, 2019. "From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 33-64, March.
  • Handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0329-y
    DOI: 10.1007/s11634-018-0329-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-018-0329-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-018-0329-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Omiros Papaspiliopoulos & Gareth O. Roberts, 2008. "Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models," Biometrika, Biometrika Trust, vol. 95(1), pages 169-186.
    2. Peter J. Green & Sylvia Richardson, 2001. "Modelling Heterogeneity With and Without the Dirichlet Process," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 28(2), pages 355-375, June.
    3. Sylvia. Richardson & Peter J. Green, 1997. "On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 59(4), pages 731-792.
    4. Nicholas G. Polson & James G. Scott & Jesse Windle, 2013. "Bayesian Inference for Logistic Models Using Pólya--Gamma Latent Variables," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(504), pages 1339-1349, December.
    5. Sharon Lee & Geoffrey McLachlan, 2013. "Rejoinder to the discussion of “Model-based clustering and classification with non-normal mixture distributions”," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 22(4), pages 473-479, November.
    6. Sharon Lee & Geoffrey McLachlan, 2013. "Model-based clustering and classification with non-normal mixture distributions," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 22(4), pages 427-454, November.
    7. Adelchi Azzalini & Antonella Capitanio, 2003. "Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t‐distribution," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 65(2), pages 367-389, May.
    8. Linzer, Drew A. & Lewis, Jeffrey B., 2011. "poLCA: An R Package for Polytomous Variable Latent Class Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 42(i10).
    9. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    10. Zoé van Havre & Nicole White & Judith Rousseau & Kerrie Mengersen, 2015. "Overfitting Bayesian Mixture Models with an Unknown Number of Components," PLOS ONE, Public Library of Science, vol. 10(7), pages 1-27, July.
    11. Frühwirth-Schnatter, Sylvia & Wagner, Helga, 2008. "Marginal likelihoods for non-Gaussian models using auxiliary mixture sampling," Computational Statistics & Data Analysis, Elsevier, vol. 52(10), pages 4608-4624, June.
    12. Fernando A. Quintana & Pilar L. Iglesias, 2003. "Bayesian clustering and product partition models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 65(2), pages 557-574, May.
    13. repec:dau:papers:123456789/4648 is not listed on IDEAS
    14. Jeffrey W. Miller & Matthew T. Harrison, 2018. "Mixture Models With a Prior on the Number of Components," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(521), pages 340-356, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Todd E. Clark & Florian Huber & Gary Koop & Massimiliano Marcellino, 2022. "Forecasting US Inflation Using Bayesian Nonparametric Models," Working Papers 22-05, Federal Reserve Bank of Cleveland.
    2. Kaito Shimamura & Shuichi Kawano, 2021. "Bayesian sparse convex clustering via global-local shrinkage priors," Computational Statistics, Springer, vol. 36(4), pages 2671-2699, December.
    3. Florian Huber & Gary Koop, 2023. "Fast and Order-invariant Inference in Bayesian VARs with Non-Parametric Shocks," Working Papers 2309, University of Strathclyde Business School, Department of Economics.
    4. Minjung Kyung & Ju-Hyun Park & Ji Yeh Choi, 2022. "Bayesian Mixture Model of Extended Redundancy Analysis," Psychometrika, Springer;The Psychometric Society, vol. 87(3), pages 946-966, September.
    5. Joao, Igor Custodio & Lucas, André & Schaumburg, Julia & Schwaab, Bernd, 2023. "Dynamic nonparametric clustering of multivariate panel data," Working Paper Series 2780, European Central Bank.
    6. Yong Song & Tomasz Wo'zniak, 2020. "Markov Switching," Papers 2002.03598, arXiv.org.
    7. Jan Vávra & Arnošt Komárek, 2023. "Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 369-406, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Im, Yunju & Tan, Aixin, 2021. "Bayesian subgroup analysis in regression using mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 162(C).
    2. Zhu, Xuwen & Melnykov, Volodymyr, 2018. "Manly transformation in finite mixture modeling," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 190-208.
    3. Wang, Ketong & Porter, Michael D., 2018. "Optimal Bayesian clustering using non-negative matrix factorization," Computational Statistics & Data Analysis, Elsevier, vol. 128(C), pages 395-411.
    4. Murray, Paula M. & Browne, Ryan P. & McNicholas, Paul D., 2017. "Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering," Journal of Multivariate Analysis, Elsevier, vol. 161(C), pages 141-156.
    5. Derek S. Young & Xi Chen & Dilrukshi C. Hewage & Ricardo Nilo-Poyanco, 2019. "Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 1053-1082, December.
    6. Volodymyr Melnykov & Xuwen Zhu, 2019. "An extension of the K-means algorithm to clustering skewed data," Computational Statistics, Springer, vol. 34(1), pages 373-394, March.
    7. Melnykov, Volodymyr & Zhu, Xuwen, 2018. "On model-based clustering of skewed matrix data," Journal of Multivariate Analysis, Elsevier, vol. 167(C), pages 181-194.
    8. Li, Mingyang & Meng, Hongdao & Zhang, Qingpeng, 2017. "A nonparametric Bayesian modeling approach for heterogeneous lifetime data with covariates," Reliability Engineering and System Safety, Elsevier, vol. 167(C), pages 95-104.
    9. Wan-Lun Wang & Tsung-I Lin, 2015. "Robust model-based clustering via mixtures of skew-t distributions with missing information," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 9(4), pages 423-445, December.
    10. Naderi, Mehrdad & Mirfarah, Elham & Wang, Wan-Lun & Lin, Tsung-I, 2023. "Robust mixture regression modeling based on the normal mean-variance mixture distributions," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).
    11. Wraith, Darren & Forbes, Florence, 2015. "Location and scale mixtures of Gaussians with flexible tail behaviour: Properties, inference and application to multivariate clustering," Computational Statistics & Data Analysis, Elsevier, vol. 90(C), pages 61-73.
    12. Azzalini, Adelchi & Browne, Ryan P. & Genton, Marc G. & McNicholas, Paul D., 2016. "On nomenclature for, and the relative merits of, two formulations of skew distributions," Statistics & Probability Letters, Elsevier, vol. 110(C), pages 201-206.
    13. Lee, Sharon X. & McLachlan, Geoffrey J., 2022. "An overview of skew distributions in model-based clustering," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    14. Billio, Monica & Casarin, Roberto & Rossini, Luca, 2019. "Bayesian nonparametric sparse VAR models," Journal of Econometrics, Elsevier, vol. 212(1), pages 97-115.
    15. Yuan Fang & Dimitris Karlis & Sanjeena Subedi, 2022. "Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 510-552, November.
    16. Ludkin, Matthew, 2020. "Inference for a generalised stochastic block model with unknown number of blocks and non-conjugate edge models," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    17. Villani, Mattias & Kohn, Robert & Nott, David J., 2012. "Generalized smooth finite mixtures," Journal of Econometrics, Elsevier, vol. 171(2), pages 121-133.
    18. Stefano Tonellato, 2019. "Bayesian nonparametric clustering as a community detection problem," Working Papers 2019: 20, Department of Economics, University of Venice "Ca' Foscari".
    19. Nicola Loperfido, 2019. "Finite mixtures, projection pursuit and tensor rank: a triangulation," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 145-173, March.
    20. Evelina Gabasova & John Reid & Lorenz Wernisch, 2017. "Clusternomics: Integrative context-dependent clustering for heterogeneous datasets," PLOS Computational Biology, Public Library of Science, vol. 13(10), pages 1-29, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:13:y:2019:i:1:d:10.1007_s11634-018-0329-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.