IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v124y2018icp220-234.html

Overfitting Bayesian mixtures of factor analyzers with an unknown number of components

Author

Listed:
  • Papastamoulis, Panagiotis

Abstract

Recent advances on overfitting Bayesian mixture models provide a solid and straightforward approach for inferring the underlying number of clusters and model parameters in heterogeneous datasets. The applicability of such a framework in clustering correlated high dimensional data is demonstrated. For this purpose an overfitting mixture of factor analyzers is introduced, assuming that the number of factors is fixed. A Markov chain Monte Carlo (MCMC) sampler combined with a prior parallel tempering scheme is used to estimate the posterior distribution of model parameters. The optimal number of factors is estimated using information criteria. Identifiability issues related to the label switching problem are dealt by post-processing the simulated MCMC sample by relabeling algorithms. The method is benchmarked against state-of-the-art software for maximum likelihood estimation of mixtures of factor analyzers using an extensive simulation study. Finally, the applicability of the method is illustrated in publicly available data.

Suggested Citation

  • Papastamoulis, Panagiotis, 2018. "Overfitting Bayesian mixtures of factor analyzers with an unknown number of components," Computational Statistics & Data Analysis, Elsevier, vol. 124(C), pages 220-234.
  • Handle: RePEc:eee:csdana:v:124:y:2018:i:c:p:220-234
    DOI: 10.1016/j.csda.2018.03.007
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947318300550
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2018.03.007?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. repec:bfi:wpaper:2014-014 is not listed on IDEAS
    2. McLachlan, G. J. & Peel, D. & Bean, R. W., 2003. "Modelling high-dimensional data by mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 41(3-4), pages 379-388, January.
    3. Zoé van Havre & Nicole White & Judith Rousseau & Kerrie Mengersen, 2015. "Overfitting Bayesian Mixture Models with an Unknown Number of Components," PLOS ONE, Public Library of Science, vol. 10(7), pages 1-27, July.
    4. Angelika van der Linde, 2005. "DIC in variable selection," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 59(1), pages 45-56, February.
    5. Panagiotis Papastamoulis & George Iliopoulos, 2013. "On the Convergence Rate of Random Permutation Sampler and ECR Algorithm in Missing Data Models," Methodology and Computing in Applied Probability, Springer, vol. 15(2), pages 293-304, June.
    6. Walter Ledermann, 1937. "On the rank of the reduced correlational matrix in multiple-factor analysis," Psychometrika, Springer;The Psychometric Society, vol. 2(2), pages 85-93, June.
    7. Conti, Gabriella & Frühwirth-Schnatter, Sylvia & Heckman, James J. & Piatek, Rémi, 2014. "Bayesian exploratory factor analysis," Journal of Econometrics, Elsevier, vol. 183(1), pages 31-57.
    8. Papastamoulis, Panagiotis & Martin-Magniette, Marie-Laure & Maugis-Rabusseau, Cathy, 2016. "On the estimation of mixtures of Poisson regression models with large number of components," Computational Statistics & Data Analysis, Elsevier, vol. 93(C), pages 97-106.
    9. Papastamoulis, Panagiotis, 2016. "label.switching: An R Package for Dealing with the Label Switching Problem in MCMC Outputs," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 69(c01).
    10. Papastamoulis, Panagiotis & Iliopoulos, George, 2009. "Reversible Jump MCMC in mixtures of normal distributions with the same component means," Computational Statistics & Data Analysis, Elsevier, vol. 53(4), pages 900-911, February.
    11. repec:dau:papers:123456789/4648 is not listed on IDEAS
    12. David J. Spiegelhalter & Nicola G. Best & Bradley P. Carlin & Angelika Linde, 2014. "The deviance information criterion: 12 years on," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(3), pages 485-493, June.
    13. Angelika van der Linde, 2012. "A Bayesian view of model complexity," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(3), pages 253-271, August.
    14. David J. Spiegelhalter & Nicola G. Best & Bradley P. Carlin & Angelika Van Der Linde, 2002. "Bayesian measures of model complexity and fit," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(4), pages 583-639, October.
    15. Matthew Stephens, 2000. "Dealing with label switching in mixture models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 62(4), pages 795-809.
    16. McNicholas, P.D. & Murphy, T.B. & McDaid, A.F. & Frost, D., 2010. "Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 711-723, March.
    17. repec:dau:papers:123456789/6069 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Briana J. K. Stephenson & Amy H. Herring & Andrew F. Olshan, 2022. "Derivation of maternal dietary patterns accounting for regional heterogeneity," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1957-1977, November.
    2. Roy Costilla & Ivy Liu & Richard Arnold & Daniel Fernández, 2019. "Bayesian model-based clustering for longitudinal ordinal data," Computational Statistics, Springer, vol. 34(3), pages 1015-1038, September.
    3. Wan-Lun Wang & Tsung-I Lin, 2020. "Automated learning of mixtures of factor analysis models with missing information," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(4), pages 1098-1124, December.
    4. Shotwell Matthew S & Slate Elizabeth H, 2010. "Bayesian Modeling of Footrace Finishing Times," Journal of Quantitative Analysis in Sports, De Gruyter, vol. 6(3), pages 1-21, July.
    5. Kai Yang & Qingqing Zhang & Xinyang Yu & Xiaogang Dong, 2023. "Bayesian inference for a mixture double autoregressive model," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 77(2), pages 188-207, May.
    6. Kelvyn Jones & David Manley & Ron Johnston & Dewi Owen, 2018. "Modelling residential segregation as unevenness and clustering: A multilevel modelling approach incorporating spatial dependence and tackling the MAUP," Environment and Planning B, , vol. 45(6), pages 1122-1141, November.
    7. Łukasz Lenart & Justyna Mokrzycka-Gajda, 2025. "Imitated student’s t distribution: a Bayesian approach," Statistical Papers, Springer, vol. 66(4), pages 1-44, June.
    8. Anastasios Bellas & Charles Bouveyron & Marie Cottrell & Jérôme Lacaille, 2013. "Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 7(3), pages 281-300, September.
    9. Park, Byung-Jung & Zhang, Yunlong & Lord, Dominique, 2010. "Bayesian mixture modeling approach to account for heterogeneity in speed data," Transportation Research Part B: Methodological, Elsevier, vol. 44(5), pages 662-673, June.
    10. Komárek, Arnost, 2009. "A new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of components and interval-censored data," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 3932-3947, October.
    11. You, Na & Dai, Hongsheng & Wang, Xueqin & Yu, Qingyun, 2024. "Sequential estimation for mixture of regression models for heterogeneous population," Computational Statistics & Data Analysis, Elsevier, vol. 194(C).
    12. Voleti, Sudhir & Srinivasan, V. & Ghosh, Pulak, 2017. "An approach to improve the predictive power of choice-based conjoint analysis," International Journal of Research in Marketing, Elsevier, vol. 34(2), pages 325-335.
    13. Jianbin Tan & Ye Shen & Yang Ge & Leonardo Martinez & Hui Huang, 2023. "Age‐related model for estimating the symptomatic and asymptomatic transmissibility of COVID‐19 patients," Biometrics, The International Biometric Society, vol. 79(3), pages 2525-2536, September.
    14. Montanari, Angela & Viroli, Cinzia, 2011. "Maximum likelihood estimation of mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2712-2723, September.
    15. Marco, Nicholas & Şentürk, Damla & Jeste, Shafali & DiStefano, Charlotte C. & Dickinson, Abigail & Telesca, Donatello, 2024. "Flexible regularized estimation in high-dimensional mixed membership models," Computational Statistics & Data Analysis, Elsevier, vol. 194(C).
    16. Arnab Kumar Maity & Sanjib Basu & Santu Ghosh, 2021. "Bayesian criterion‐based variable selection," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 70(4), pages 835-857, August.
    17. Royce Anders & William Batchelder, 2015. "Cultural Consensus Theory for the Ordinal Data Case," Psychometrika, Springer;The Psychometric Society, vol. 80(1), pages 151-181, March.
    18. Lin, L. & Fong, D.K.H., 2019. "Bayesian multidimensional scaling procedure with variable selection," Computational Statistics & Data Analysis, Elsevier, vol. 129(C), pages 1-13.
    19. Shuhui Guo & Lihua Xiong & Jie Chen & Shenglian Guo & Jun Xia & Ling Zeng & Chong-Yu Xu, 2023. "Nonstationary Regional Flood Frequency Analysis Based on the Bayesian Method," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 37(2), pages 659-681, January.
    20. Antonio Punzo & Paul. D. McNicholas, 2017. "Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model," Journal of Classification, Springer;The Classification Society, vol. 34(2), pages 249-293, July.

    More about this item

    Keywords

    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:124:y:2018:i:c:p:220-234. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.