IDEAS home Printed from https://ideas.repec.org/a/bla/istatr/v71y2003i2p369-382.html
   My bibliography  Save this article

A Bayesian Formulation of Exploratory Data Analysis and Goodness‐of‐fit Testing

Author

Listed:
  • Andrew Gelman

Abstract

Exploratory data analysis (EDA) and Bayesian inference (or, more generally, complex statistical modeling)—which are generally considered as unrelated statistical paradigms—can be particularly effective in combination. In this paper, we present a Bayesian framework for EDA based on posterior predictive checks. We explain how posterior predictive simulations can be used to create reference distributions for EDA graphs, and how this approach resolves some theoretical problems in Bayesian data analysis. We show how the generalization of Bayesian inference to include replicated data yrep and replicated parameters θrep follows a long tradition of generalizations in Bayesian theory. On the theoretical level, we present a predictive Bayesian formulation of goodness‐of‐fit testing, distinguishing between p‐values (posterior probabilities that specified antisymmetric discrepancy measures will exceed 0) and u‐values (data summaries with uniform sampling distributions). We explain that p‐values, unlike u‐values, are Bayesian probability statements in that they condition on observed data. Having reviewed the general theoretical framework, we discuss the implications for statistical graphics and exploratory data analysis, with the goal being to unify exploratory data analysis with more formal statistical methods based on probability models. We interpret various graphical displays as posterior predictive checks and discuss how Bayesian inference can be used to determine reference distributions. The goal of this work is not to downgrade descriptive statistics, or to suggest they be replaced by Bayesian modeling, but rather to suggest how exploratory data analysis fits into the probability‐modeling paradigm. We conclude with a discussion of the implications for practical Bayesian inference. In particular, we anticipate that Bayesian software can be generalized to draw simulations of replicated data and parameters from their posterior predictive distribution, and these can in turn be used to calibrate EDA graphs. Analyse de données exploratrices et inférence (EDA) Bayésienne (ou, en large, modélisation de statistiques complexes)—qui sont généralement considérées comme étant des paradigmes statistiques non relies. Dans cet article, nous présentons un cadre pour l'EDA, base sur des vérifications prédictives a posteriori. Nous expliquons comment les simulations prédictives a posteriori peuvent être utilises pour créer des distributions de référence pour des graphiques d'EDA, et la façon dont cette approche recoud quelques problèmes de l'analyse de données Bayésienne. Nous démontrons comment la généralisation de l'inférence Bayésienne qui inclut des données répliquées et des paramètres répliques suit une longue tradition de généralisation dans la théorie Bayésienne. D'un point de vue théorique, nous présentons une formule Bayésienne prédictive de test d'ajustement, en distinguant entre les “p‐values”(probabilités postérieures que la mesure de la différence de l'antisymetrie spécifiée n'excede pas la valeur 0) et les “u‐values”(résumes de données avec une distribution d'échantillonnage uniforme). Nous expliquons que les “p‐values”, non comme les “u‐values” sont des formules de probabilité Bayesienne car les conditions de données observées sont les mêmes. Ayant revu le cadre général de la théorie, nous discutons des implications pour des graphiques statistiques et des analyses de données exploratrices, en ayant pour but d'unifier les analyses de données exploratrices avec des méthodes de statistiques plus officiels bases sur des modèles de probabilités. Nous interprétons des graphiques vérification prédictive a posteriori, et nous discutons de la façon dont les inférences Bayésiennes peuvent être utilisées ou déterminer des distributions de références. Le but de ce travail n'est pas de renier les statistiques descriptives, ou de suggérer qu'elles soient replacées par des modèles Bayésiens, mais plutôt de suggérer la façon dont les analyses de données exploratrices se rangent dans le modèle probabilitémodelisation. Nous concluons avec une discussion des implications des pratiques des inférences Bayésiennes. En l'occurrence, nous anticipons que les logiciels Bayésiens peuvent tre generalises pour tirer des simulations et répliquer des données et des paramètres de leur distribution prédictive a posteriori, qui peuvent être a leur tour utilisées pour calibrer des graphiques d'EDA.

Suggested Citation

  • Andrew Gelman, 2003. "A Bayesian Formulation of Exploratory Data Analysis and Goodness‐of‐fit Testing," International Statistical Review, International Statistical Institute, vol. 71(2), pages 369-382, August.
  • Handle: RePEc:bla:istatr:v:71:y:2003:i:2:p:369-382
    DOI: 10.1111/j.1751-5823.2003.tb00203.x
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/j.1751-5823.2003.tb00203.x
    Download Restriction: no

    File URL: https://libkey.io/10.1111/j.1751-5823.2003.tb00203.x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Gelman A. & Pasarica C. & Dodhia R., 2002. "Lets Practice What We Preach: Turning Tables into Graphs," The American Statistician, American Statistical Association, vol. 56, pages 121-130, May.
    2. Donald B. Rubin, 1981. "Estimation in Parallel Randomized Experiments," Journal of Educational and Behavioral Statistics, , vol. 6(4), pages 377-401, December.
    3. A. Gelman & Y. Goegebeur & F. Tuerlinckx & I. Van Mechelen, 2000. "Diagnostic checks for discrete data regression models using posterior predictive simulations," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 49(2), pages 247-268.
    4. Merlise Clyde & Edward I. George, 2000. "Flexible empirical Bayes estimation for wavelets," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 62(4), pages 681-698.
    5. Gelman, Andrew & Nolan, Deborah, 2002. "Teaching Statistics: A Bag of Tricks," OUP Catalogue, Oxford University Press, number 9780198572244, Decembrie.
    6. David J. Spiegelhalter & Nicola G. Best & Bradley P. Carlin & Angelika Van Der Linde, 2002. "Bayesian measures of model complexity and fit," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(4), pages 583-639, October.
    7. Gelman, Andrew & Nolan, Deborah, 2002. "Teaching Statistics: A Bag of Tricks," OUP Catalogue, Oxford University Press, number 9780198572251, Decembrie.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gonzalez, Jhonny & Moriarty, John & Palczewski, Jan, 2017. "Bayesian calibration and number of jump components in electricity spot price models," Energy Economics, Elsevier, vol. 65(C), pages 375-388.
    2. Andrew Gelman & Christian Hennig, 2017. "Beyond subjective and objective in statistics," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 180(4), pages 967-1033, October.
    3. C. Jessica E. Metcalf & David A. Stephens & Mark Rees & Svata M. Louda & Kathleen H. Keeler, 2009. "Using Bayesian inference to understand the allocation of resources between sexual and asexual reproduction," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 58(2), pages 143-170, May.
    4. Brun, Mélanie & Abraham, Christophe & Jarry, Marc & Dumas, Jacques & Lange, Frédéric & Prévost, Etienne, 2011. "Estimating an homogeneous series of a population abundance indicator despite changes in data collection procedure: A hierarchical Bayesian modelling approach," Ecological Modelling, Elsevier, vol. 222(5), pages 1069-1079.
    5. Clough, Brian J. & Russell, Matthew B. & Domke, Grant M. & Woodall, Christopher W. & Radtke, Philip J., 2016. "Comparing tree foliage biomass models fitted to a multispecies, felled-tree biomass dataset for the United States," Ecological Modelling, Elsevier, vol. 333(C), pages 79-91.
    6. Bhattacharya, Arnab & Wilson, Simon P., 2018. "Sequential Bayesian inference for static parameters in dynamic state space models," Computational Statistics & Data Analysis, Elsevier, vol. 127(C), pages 187-203.
    7. Deg-Hyo Bae & Kyung-Hwan Son & Jae-Min So, 2017. "Utilization of the Bayesian Method to Improve Hydrological Drought Prediction Accuracy," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 31(11), pages 3527-3541, September.
    8. Haoying Wang & Guohui Wu, 2022. "Modeling discrete choices with large fine-scale spatial data: opportunities and challenges," Journal of Geographical Systems, Springer, vol. 24(3), pages 325-351, July.
    9. Golnaz Shahtahmassebi & Rana Moyeed, 2016. "An application of the generalized Poisson difference distribution to the Bayesian modelling of football scores," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(3), pages 260-273, August.
    10. Koki, Constandina & Leonardos, Stefanos & Piliouras, Georgios, 2022. "Exploring the predictability of cryptocurrencies via Bayesian hidden Markov models," Research in International Business and Finance, Elsevier, vol. 59(C).
    11. David Lunn & Jessica Barrett & Michael Sweeting & Simon Thompson, 2013. "Fully Bayesian hierarchical modelling in two stages, with application to meta-analysis," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 62(4), pages 551-572, August.
    12. Constandina Koki & Stefanos Leonardos & Georgios Piliouras, 2020. "Exploring the Predictability of Cryptocurrencies via Bayesian Hidden Markov Models," Papers 2011.03741, arXiv.org, revised Dec 2020.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Patrizia Ordine & Claudio Lupi, 2009. "Family Income and Students' Mobility," Giornale degli Economisti, GDE (Giornale degli Economisti e Annali di Economia), Bocconi University, vol. 68(1), pages 1-23, April.
    2. Miguel de Carvalho, 2016. "Mean, What do You Mean?," The American Statistician, Taylor & Francis Journals, vol. 70(3), pages 270-274, July.
    3. Cristiano Varin & Manuela Cattelan & David Firth, 2016. "Statistical modelling of citation exchange between statistics journals," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 179(1), pages 1-63, January.
    4. Wang, Y. & Daniels, M.J., 2013. "Bayesian modeling of the dependence in longitudinal data via partial autocorrelations and marginal variances," Journal of Multivariate Analysis, Elsevier, vol. 116(C), pages 130-140.
    5. Antonoyiannakis, Manolis, 2018. "Impact Factors and the Central Limit Theorem: Why citation averages are scale dependent," Journal of Informetrics, Elsevier, vol. 12(4), pages 1072-1088.
    6. George W. Cobb, 2007. "One Possible Frame for Thinking about Experiential Learning," International Statistical Review, International Statistical Institute, vol. 75(3), pages 336-347, December.
    7. Joyee Ghosh & Amy H. Herring & Anna Maria Siega-Riz, 2011. "Bayesian Variable Selection for Latent Class Models," Biometrics, The International Biometric Society, vol. 67(3), pages 917-925, September.
    8. Silvia Montagna & Vanessa Orani & Raffaele Argiento, 2021. "Bayesian isotonic logistic regression via constrained splines: an application to estimating the serve advantage in professional tennis," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(2), pages 573-604, June.
    9. Roberto Behar & Pere Grima & Lluís Marco-Almagro, 2013. "Twenty-Five Analogies for Explaining Statistical Concepts," The American Statistician, Taylor & Francis Journals, vol. 67(1), pages 44-48, February.
    10. Buddhavarapu, Prasad & Bansal, Prateek & Prozzi, Jorge A., 2021. "A new spatial count data model with time-varying parameters," Transportation Research Part B: Methodological, Elsevier, vol. 150(C), pages 566-586.
    11. Mumtaz, Haroon & Theodoridis, Konstantinos, 2017. "Common and country specific economic uncertainty," Journal of International Economics, Elsevier, vol. 105(C), pages 205-216.
    12. Jesse Elliott & Zemin Bai & Shu-Ching Hsieh & Shannon E Kelly & Li Chen & Becky Skidmore & Said Yousef & Carine Zheng & David J Stewart & George A Wells, 2020. "ALK inhibitors for non-small cell lung cancer: A systematic review and network meta-analysis," PLOS ONE, Public Library of Science, vol. 15(2), pages 1-18, February.
    13. Christina Leuker & Thorsten Pachur & Ralph Hertwig & Timothy J. Pleskac, 2019. "Do people exploit risk–reward structures to simplify information processing in risky choice?," Journal of the Economic Science Association, Springer;Economic Science Association, vol. 5(1), pages 76-94, August.
    14. Francois Olivier & Laval Guillaume, 2011. "Deviance Information Criteria for Model Selection in Approximate Bayesian Computation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-25, July.
    15. Raggi, Davide & Bordignon, Silvano, 2012. "Long memory and nonlinearities in realized volatility: A Markov switching approach," Computational Statistics & Data Analysis, Elsevier, vol. 56(11), pages 3730-3742.
    16. Angelica Gianfreda & Francesco Ravazzolo & Luca Rossini, 2023. "Large Time‐Varying Volatility Models for Hourly Electricity Prices," Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, vol. 85(3), pages 545-573, June.
    17. Rubio, F.J. & Steel, M.F.J., 2011. "Inference for grouped data with a truncated skew-Laplace distribution," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3218-3231, December.
    18. Alessandri, Piergiorgio & Mumtaz, Haroon, 2019. "Financial regimes and uncertainty shocks," Journal of Monetary Economics, Elsevier, vol. 101(C), pages 31-46.
    19. Padilla, Juan L. & Azevedo, Caio L.N. & Lachos, Victor H., 2018. "Multidimensional multiple group IRT models with skew normal latent trait distributions," Journal of Multivariate Analysis, Elsevier, vol. 167(C), pages 250-268.
    20. Svetlana V. Tishkovskaya & Paul G. Blackwell, 2021. "Bayesian estimation of heterogeneous environments from animal movement data," Environmetrics, John Wiley & Sons, Ltd., vol. 32(6), September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:istatr:v:71:y:2003:i:2:p:369-382. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/isiiinl.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.