IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v168y2022ics0167947321002218.html
   My bibliography  Save this article

Statistical file-matching of non-Gaussian data: A game theoretic approach

Author

Listed:
  • Ahfock, Daniel
  • Pyne, Saumyadipta
  • McLachlan, Geoffrey J.

Abstract

The statistical file-matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. Missing-data imputation is complicated by the fact that the joint distribution of the variables is nonidentifiable as there are no completely observed cases. Nonparametric imputation methods typically involve an implicit conditional independence assumption that is forced by the missing-data pattern. Parametric imputation does not require conditional independence assumptions, but can be challenging due to identifiability issues and the difficulty of parameter estimation. The identification problem can be studied using game theory, and it is possible to establish a general characterization of the minimax optimal strategy under negative log likelihood loss. For non-Gaussian models, imputation using the minimax optimal strategy can lead to different results compared to generic methods. Computationally feasible procedures for parameter estimation can be implemented using data augmentation schemes and the EM algorithm. Comparisons of the minimax optimal imputation scheme to standard algorithms on real data from flow cytometry show that minimax strategies can better preserve the joint distribution of the variables.

Suggested Citation

  • Ahfock, Daniel & Pyne, Saumyadipta & McLachlan, Geoffrey J., 2022. "Statistical file-matching of non-Gaussian data: A game theoretic approach," Computational Statistics & Data Analysis, Elsevier, vol. 168(C).
  • Handle: RePEc:eee:csdana:v:168:y:2022:i:c:s0167947321002218
    DOI: 10.1016/j.csda.2021.107387
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947321002218
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2021.107387?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. van Buuren, Stef & Groothuis-Oudshoorn, Karin, 2011. "mice: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i03).
    2. Tamer, Elie, 2010. "Partial Identification in Econometrics," Scholarly Articles 34728615, Harvard University Department of Economics.
    3. Little, Roderick J A, 1988. "Missing-Data Adjustments in Large Surveys," Journal of Business & Economic Statistics, American Statistical Association, vol. 6(3), pages 287-296, July.
    4. Pier Luigi Conti & Daniela Marella & Mauro Scanu, 2016. "Statistical Matching Analysis for Complex Survey Data With Applications," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1715-1725, October.
    5. Conti, Pier Luigi & Marella, Daniela & Scanu, Mauro, 2008. "Evaluation of matching noise for imputation techniques based on nonparametric local linear regression estimators," Computational Statistics & Data Analysis, Elsevier, vol. 53(2), pages 354-365, December.
    6. Saumyadipta Pyne & Sharon X Lee & Kui Wang & Jonathan Irish & Pablo Tamayo & Marc-Danie Nazaire & Tarn Duong & Shu-Kay Ng & David Hafler & Ronald Levy & Garry P Nolan & Jill Mesirov & Geoffrey J McLac, 2014. "Joint Modeling and Registration of Cell Populations in Cohorts of High-Dimensional Flow Cytometric Data," PLOS ONE, Public Library of Science, vol. 9(7), pages 1-11, July.
    7. Moriarity, Chris & Scheuren, Fritz, 2003. "A Note on Rubin's Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations," Journal of Business & Economic Statistics, American Statistical Association, vol. 21(1), pages 65-73, January.
    8. Little, Roderick J A, 1988. "Missing-Data Adjustments in Large Surveys: Reply," Journal of Business & Economic Statistics, American Statistical Association, vol. 6(3), pages 300-301, July.
    9. Hyungsik Roger Moon & Frank Schorfheide, 2012. "Bayesian and Frequentist Inference in Partially Identified Models," Econometrica, Econometric Society, vol. 80(2), pages 755-782, March.
    10. Elie Tamer, 2010. "Partial Identification in Econometrics," Annual Review of Economics, Annual Reviews, vol. 2(1), pages 167-195, September.
    11. Mauricio Sadinle, 2017. "Bayesian Estimation of Bipartite Matchings for Record Linkage," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 600-612, April.
    12. Marella, Daniela & Scanu, Mauro & Luigi Conti, Pier, 2008. "On the matching noise of some nonparametric imputation procedures," Statistics & Probability Letters, Elsevier, vol. 78(12), pages 1593-1600, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ahfock, Daniel & Pyne, Saumyadipta & Lee, Sharon X. & McLachlan, Geoffrey J., 2016. "Partial identification in the statistical matching problem," Computational Statistics & Data Analysis, Elsevier, vol. 104(C), pages 79-90.
    2. Claramunt González, Juan & van Delden, Arnout & de Waal, Ton, 2023. "Assessment of the effect of constraints in a new multivariate mixed method for statistical matching," Computational Statistics & Data Analysis, Elsevier, vol. 177(C).
    3. Gerko Vink & Laurence E. Frank & Jeroen Pannekoek & Stef Buuren, 2014. "Predictive mean matching imputation of semicontinuous variables," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 68(1), pages 61-90, February.
    4. Yuan Liao & Anna Simoni, 2012. "Semi-parametric Bayesian Partially Identified Models based on Support Function," Papers 1212.3267, arXiv.org, revised Nov 2013.
    5. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    6. Sasaki, Yuya & Takahashi, Yuya & Xin, Yi & Hu, Yingyao, 2023. "Dynamic discrete choice models with incomplete data: Sharp identification," Journal of Econometrics, Elsevier, vol. 236(1).
    7. Ralf Münnich & Siegfried Gabler & Christian Bruch & Jan Pablo Burgard & Tobias Enderle & Jan-Philipp Kolb & Thomas Zimmermann, 2015. "Tabellenauswertungen im Zensus unter Berücksichtigung fehlender Werte," AStA Wirtschafts- und Sozialstatistisches Archiv, Springer;Deutsche Statistische Gesellschaft - German Statistical Society, vol. 9(3), pages 269-304, December.
    8. Michael D. Teter & Johannes O. Royset & Alexandra M. Newman, 2019. "Modeling uncertainty of expert elicitation for use in risk-based optimization," Annals of Operations Research, Springer, vol. 280(1), pages 189-210, September.
    9. Saeideh Kamgar & Florian Meinfelder & Ralf Münnich & Hamidreza Navvabpour, 2020. "Estimation within the new integrated system of household surveys in Germany," Statistical Papers, Springer, vol. 61(5), pages 2091-2117, October.
    10. Brendan Kline & Elie Tamer, 2016. "Bayesian inference in a class of partially identified models," Quantitative Economics, Econometric Society, vol. 7(2), pages 329-366, July.
    11. Jana Emmenegger & Ralf Münnich & Jannik Schaller, 2022. "Evaluating Data Fusion Methods to Improve Income Modelling," Research Papers in Economics 2022-03, University of Trier, Department of Economics.
    12. Marco Geraci & Alexander McLain, 2018. "Multiple Imputation for Bounded Variables," Psychometrika, Springer;The Psychometric Society, vol. 83(4), pages 919-940, December.
    13. Jensen, Are & Clausen, Tommy H., 2017. "Origins and emergence of exploration and exploitation capabilities in new technology-based firms," Technological Forecasting and Social Change, Elsevier, vol. 120(C), pages 163-175.
    14. Ann-Marie Küchler & Dana Schultchen & Tim Dretzler & Morten Moshagen & David D. Ebert & Harald Baumeister, 2023. "A Three-Armed Randomized Controlled Trial to Evaluate the Effectiveness, Acceptance, and Negative Effects of StudiCare Mindfulness, an Internet- and Mobile-Based Intervention for College Students with," IJERPH, MDPI, vol. 20(4), pages 1-23, February.
    15. Arthur Lewbel, 2019. "The Identification Zoo: Meanings of Identification in Econometrics," Journal of Economic Literature, American Economic Association, vol. 57(4), pages 835-903, December.
    16. Stéphane Bonhomme & Martin Weidner, 2019. "Posterior average effects," CeMMAP working papers CWP43/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    17. Gerko Vink & Stef van Buuren, 2013. "Multiple Imputation of Squared Terms," Sociological Methods & Research, , vol. 42(4), pages 598-607, November.
    18. Renate S M Buisman & Katharina Pittner & Marieke S Tollenaar & Jolanda Lindenberg & Lisa J M van den Berg & Laura H C G Compier-de Block & Joost R van Ginkel & Lenneke R A Alink & Marian J Bakermans-K, 2020. "Intergenerational transmission of child maltreatment using a multi-informant multi-generation family design," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-23, March.
    19. Epstein, Larry G. & Seo, Kyoungwon, 2014. "De Finetti meets Ellsberg," Research in Economics, Elsevier, vol. 68(1), pages 11-26.
    20. Adel Bosch & Steven F. Koch, 2021. "Individual and Household Debt: Does Imputation Choice Matter?," Working Papers 202141, University of Pretoria, Department of Economics.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:168:y:2022:i:c:s0167947321002218. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.