IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2505.00282.html

A Unifying Framework for Robust and Efficient Inference with Unstructured Data

Author

Listed:
  • Jacob Carlson
  • Melissa Dell

Abstract

To analyze unstructured data (text, images, audio, video), economists typically first extract low-dimensional structured features with a neural network. Neural networks do not make generically unbiased predictions, and biases will propagate to estimators that use their predictions. While structured variables extracted from unstructured data have traditionally been treated as proxies - implicitly accepting arbitrary measurement error - this poses various challenges in an era where constantly evolving AI can cheaply extract data. Researcher degrees of freedom (e.g., the choice of neural network architecture, training data or prompts, and numerous implementation details) raise concerns about p-hacking and how to best show robustness, the frequent deprecation of proprietary neural networks complicates reproducibility, and researchers need a principled way to determine how accurate predictions need to be before making costly investments to improve them. To address these challenges, this study develops MAR-S (Missing At Random Structured Data), a semiparametric missing data framework that enables unbiased, efficient, and robust inference with unstructured data, by correcting for neural network prediction error with a validation sample. MAR-S synthesizes and extends existing methods for debiased inference using machine learning predictions and connects them to familiar problems such as causal inference, highlighting valuable parallels. We develop robust and efficient estimators for both descriptive and causal estimands and address inference with aggregated and transformed neural network predictions, a common scenario outside the existing literature.

Suggested Citation

  • Jacob Carlson & Melissa Dell, 2025. "A Unifying Framework for Robust and Efficient Inference with Unstructured Data," Papers 2505.00282, arXiv.org, revised Feb 2026.
  • Handle: RePEc:arx:papers:2505.00282
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2505.00282
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Imbens,Guido W. & Rubin,Donald B., 2015. "Causal Inference for Statistics, Social, and Biomedical Sciences," Cambridge Books, Cambridge University Press, number 9780521885881, November.
    2. Keisuke Hirano & Guido W. Imbens & Geert Ridder, 2003. "Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score," Econometrica, Econometric Society, vol. 71(4), pages 1161-1189, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Xiaohong Chen & Haitian Xie, 2025. "Local Overidentification and Efficiency Gains in Modern Causal Inference and Data Combination," Cowles Foundation Discussion Papers 2467, Cowles Foundation for Research in Economics, Yale University.
    2. Niclas Griesshaber & Jochen Streb, 2025. "Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)," Papers 2512.19675, arXiv.org.
    3. Xiaohong Chen & Haitian Xie, 2025. "On Local Overidentification and Efficiency Gains in Modern Causal Inference and Data Combination," Papers 2510.16683, arXiv.org, revised Feb 2026.
    4. Timothy Christensen & Giovanni Compiani, 2026. "From Unstructured Data to Demand Counterfactuals: Theory and Practice," Papers 2601.05374, arXiv.org.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-dimensional econometrics and regularized GMM," CeMMAP working papers CWP35/18, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    2. Ruoxuan Xiong & Allison Koenecke & Michael Powell & Zhu Shen & Joshua T. Vogelstein & Susan Athey, 2021. "Federated Causal Inference in Heterogeneous Observational Data," Papers 2107.11732, arXiv.org, revised Apr 2023.
    3. Hairu Wang & Yukun Liu & Haiying Zhou, 2025. "Score test for unconfoundedness under a logistic treatment assignment model," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 77(4), pages 517-533, August.
    4. Pedro H. C. Sant'Anna & Xiaojun Song & Qi Xu, 2022. "Covariate distribution balance via propensity scores," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 37(6), pages 1093-1120, September.
    5. Sung Jae Jun & Sokbae Lee, 2024. "Causal Inference Under Outcome-Based Sampling with Monotonicity Assumptions," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 42(3), pages 998-1009, July.
    6. Jiannan Lu & Peng Ding & Tirthankar Dasgupta, 2018. "Treatment Effects on Ordinal Outcomes: Causal Estimands and Sharp Bounds," Journal of Educational and Behavioral Statistics, , vol. 43(5), pages 540-567, October.
    7. Vincent Starck, 2025. "Improving control over unobservables with network data," Papers 2511.00612, arXiv.org.
    8. Jinglong Zhao, 2024. "Experimental Design For Causal Inference Through An Optimization Lens," Papers 2408.09607, arXiv.org, revised Aug 2024.
    9. Graham, Bryan S. & Pinto, Cristine Campos de Xavier, 2022. "Semiparametrically efficient estimation of the average linear regression function," Journal of Econometrics, Elsevier, vol. 226(1), pages 115-138.
    10. Susan Athey & Stefan Wager, 2021. "Policy Learning With Observational Data," Econometrica, Econometric Society, vol. 89(1), pages 133-161, January.
    11. Jianhua Mei & Fu Ouyang & Thomas T. Yang, 2025. "Dimension Reduction for Conditional Density Estimation with Applications to High-Dimensional Causal Inference," Papers 2507.22312, arXiv.org, revised Oct 2025.
    12. repec:osf:socarx:qzm7y_v1 is not listed on IDEAS
    13. Victor Chernozhukov & Mert Demirer & Esther Duflo & Iván Fernández‐Val, 2025. "Fisher–Schultz Lecture: Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, With an Application to Immunization in India," Econometrica, Econometric Society, vol. 93(4), pages 1121-1164, July.
    14. Sung Jae Jun & Sokbae (Simon) Lee, 2020. "Causal inference in case-control studies," CeMMAP working papers CWP19/20, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    15. Rahul Singh & Liyuan Xu & Arthur Gretton, 2020. "Kernel Methods for Causal Functions: Dose, Heterogeneous, and Incremental Response Curves," Papers 2010.04855, arXiv.org, revised Oct 2022.
    16. Difang Huang & Jiti Gao & Tatsushi Oka, 2025. "Semiparametric single-index estimation for average treatment effects," Econometric Reviews, Taylor & Francis Journals, vol. 44(6), pages 843-885, July.
    17. N. Krasnopeeva & E. Nazrullaeva & A. Peresetsky & E. Shchetinin., 2016. "To export or not to export? The link between the exporter status of a firm and its technical efficiency in Russia’s manufacturing sector," VOPROSY ECONOMIKI, N.P. Redaktsiya zhurnala "Voprosy Economiki", vol. 7.
    18. Lin, Zhexiao & Han, Fang, 2025. "On regression-adjusted imputation estimators of average treatment effects," Journal of Econometrics, Elsevier, vol. 251(C).
    19. Guido W. Imbens, 2015. "Matching Methods in Practice: Three Examples," Journal of Human Resources, University of Wisconsin Press, vol. 50(2), pages 373-419.
    20. Sonna Vikhil & K.S. Kavi Kumar, 2025. "Impact Evaluation of Cash Transfer: Case Study of Agriculture, Telangana," Working Papers 2025-278, Madras School of Economics,Chennai,India.
    21. Kevin P. Josey & Elizabeth Juarez‐Colunga & Fan Yang & Debashis Ghosh, 2021. "A framework for covariate balance using Bregman distances," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(3), pages 790-816, September.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2505.00282. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.