IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2601.02400.html

Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis

Author

Listed:
  • Adel Daoud
  • Richard Johansson
  • Connor T. Jerzak

Abstract

Text-based causal inference increasingly employs textual data as proxies for unobserved confounders, yet this approach introduces a previously undertheorized source of bias: treatment leakage. Treatment leakage occurs when text intended to capture confounding information also contains signals predictive of treatment status, thereby inducing post-treatment bias in causal estimates. Critically, this problem can arise even when documents precede treatment assignment, as authors may employ future-referencing language that anticipates subsequent interventions. Despite growing recognition of this issue, no systematic methods exist for identifying and mitigating treatment leakage in text-as-confounder applications. This paper addresses this gap through three contributions. First, we provide formal statistical and set-theoretic definitions of treatment leakage that clarify when and why bias occurs. Second, we propose four text distillation methods -- similarity-based passage removal, distant supervision classification, salient feature removal, and iterative nullspace projection -- designed to eliminate treatment-predictive content while preserving confounder information. Third, we validate these methods through simulations using synthetic text and an empirical application examining International Monetary Fund structural adjustment programs and child mortality. Our findings indicate that moderate distillation optimally balances bias reduction against confounder retention, whereas overly stringent approaches degrade estimate precision.

Suggested Citation

  • Adel Daoud & Richard Johansson & Connor T. Jerzak, 2025. "Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis," Papers 2601.02400, arXiv.org.
  • Handle: RePEc:arx:papers:2601.02400
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2601.02400
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. David Stuckler & Lawrence P King & Sanjay Basu, 2008. "International Monetary Fund Programs and Tuberculosis Outcomes in Post-Communist Countries," PLOS Medicine, Public Library of Science, vol. 5(7), pages 1-12, July.
    2. Adel Daoud & Felipe Jordan & Makkunda Sharma & Fredrik Johansson & Devdatt Dubhashi & Sourabh Paul & Subhashis Banerjee, 2021. "Measuring poverty in India with machine learning and remote sensing," Papers 2202.00109, arXiv.org, revised Oct 2022.
    3. Vreeland,James Raymond, 2003. "The IMF and Economic Development," Cambridge Books, Cambridge University Press, number 9780521016957, November.
    4. Adel Daoud & Felipe Jordán & Makkunda Sharma & Fredrik Johansson & Devdatt Dubhashi & Sourabh Paul & Subhashis Banerjee, 2023. "Using Satellite Images and Deep Learning to Measure Health and Living Standards in India," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 167(1), pages 475-505, June.
    5. Richard K. Crump & V. Joseph Hotz & Guido W. Imbens & Oscar A. Mitnik, 2009. "Dealing with limited overlap in estimation of average treatment effects," Biometrika, Biometrika Trust, vol. 96(1), pages 187-199.
    6. Imbens, Guido W & Angrist, Joshua D, 1994. "Identification and Estimation of Local Average Treatment Effects," Econometrica, Econometric Society, vol. 62(2), pages 467-475, March.
    7. Daoud, Adel & Reinsberg, Bernhard & Kentikelenis, Alexander E. & Stubbs, Thomas H. & King, Lawrence P., 2019. "The International Monetary Fund’s interventions in food and agriculture: An analysis of loans and conditions," Food Policy, Elsevier, vol. 83(C), pages 204-218.
    8. Axel Dreher, 2009. "IMF conditionality: theory and evidence," Public Choice, Springer, vol. 141(1), pages 233-267, October.
    9. Carlos Cinelli & Andrew Forney & Judea Pearl, 2024. "A Crash Course in Good and Bad Controls," Sociological Methods & Research, , vol. 53(3), pages 1071-1104, August.
    10. Margaret E. Roberts & Brandon M. Stewart & Richard A. Nielsen, 2020. "Adjusting for Confounding with Text Matching," American Journal of Political Science, John Wiley & Sons, vol. 64(4), pages 887-903, October.
    11. Pearl Judea, 2015. "Conditioning on Post-treatment Variables," Journal of Causal Inference, De Gruyter, vol. 3(1), pages 131-137.
    12. Imbens,Guido W. & Rubin,Donald B., 2015. "Causal Inference for Statistics, Social, and Biomedical Sciences," Cambridge Books, Cambridge University Press, number 9780521885881, November.
    13. Wang Miao & Zhi Geng & Eric J Tchetgen Tchetgen, 2018. "Identifying causal effects with proxy variables of an unmeasured confounder," Biometrika, Biometrika Trust, vol. 105(4), pages 987-993.
    14. Mozer, Reagan & Miratrix, Luke & Kaufman, Aaron Russell & Jason Anastasopoulos, L., 2020. "Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality," Political Analysis, Cambridge University Press, vol. 28(4), pages 445-468, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sallin, Aurelién, 2021. "Estimating returns to special education: combining machine learning and text analysis to address confounding," Economics Working Paper Series 2109, University of St. Gallen, School of Economics and Political Science.
    2. Daoud, Adel & Johansson, Fredrik, 2019. "Estimating Treatment Heterogeneity of International Monetary Fund Programs on Child Poverty with Generalized Random Forest," SocArXiv awfjt, Center for Open Science.
    3. Aur'elien Sallin, 2021. "Estimating returns to special education: combining machine learning and text analysis to address confounding," Papers 2110.08807, arXiv.org, revised Feb 2022.
    4. Jeffrey Smith & Arthur Sweetman, 2016. "Viewpoint: Estimating the causal effects of policies and programs," Canadian Journal of Economics, Canadian Economics Association, vol. 49(3), pages 871-905, August.
    5. Ting-Chih Hung & Yu-Chang Chen, 2026. "The Proximal Surrogate Index: Long-Term Treatment Effects under Unobserved Confounding," Papers 2601.17712, arXiv.org.
    6. Jiaming Zeng & Michael F. Gensheimer & Daniel L. Rubin & Susan Athey & Ross D. Shachter, 2022. "Uncovering interpretable potential confounders in electronic medical records," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    7. Myoung‐jae Lee, 2021. "Instrument residual estimator for any response variable with endogenous binary treatment," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 83(3), pages 612-635, July.
    8. Tarek Azzam & Michael Bates & David Fairris, 2019. "Do Learning Communities Increase First Year College Retention? Testing Sample Selection and External Validity of Randomized Control Trials," Working Papers 202002, University of California at Riverside, Department of Economics.
    9. Guido Imbens & Stefan Wager, 2019. "Optimized Regression Discontinuity Designs," The Review of Economics and Statistics, MIT Press, vol. 101(2), pages 264-278, May.
    10. Marco Caliendo & Stefan Tübbicke, 2020. "New evidence on long-term effects of start-up subsidies: matching estimates and their robustness," Empirical Economics, Springer, vol. 59(4), pages 1605-1631, October.
    11. Pedro H. C. Sant'Anna & Xiaojun Song & Qi Xu, 2022. "Covariate distribution balance via propensity scores," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 37(6), pages 1093-1120, September.
    12. Caloffi, Annalisa & Freo, Marzia & Ghinoi, Stefano & Mariani, Marco & Rossi, Federica, 2022. "Assessing the effects of a deliberate policy mix: The case of technology and innovation advisory services and innovation vouchers," Research Policy, Elsevier, vol. 51(6).
    13. Mellace, Giovanni & Ventura, Marco, 2019. "Intended and unintended effects of public incentives for innovation. Quasi-experimental evidence from Italy," Discussion Papers on Economics 9/2019, University of Southern Denmark, Department of Economics.
    14. Aliou Diagne & Steven Glover & Ben Groom & Jonathan Phillips, 2012. "Africa's Green Revolution? The determinants of the adoption of NERICAs in West Africa," Working Papers 174, Department of Economics, SOAS University of London, UK.
    15. Gurgen OHANYAN, 2015. "Recent Changes of IMF Conditionality and Its Effects on Social Spending," REVISTA DE MANAGEMENT COMPARAT INTERNATIONAL/REVIEW OF INTERNATIONAL COMPARATIVE MANAGEMENT, Faculty of Management, Academy of Economic Studies, Bucharest, Romania, vol. 16(5), pages 591-602, December.
    16. Ali Burak Güven, 2012. "The IMF, the World Bank, and the Global Economic Crisis: Exploring Paradigm Continuity," Development and Change, International Institute of Social Studies, vol. 43(4), pages 869-898, July.
    17. Daniel Burkhard & Christian P. R. Schmid & Kaspar Wüthrich, 2019. "Financial incentives and physician prescription behavior: Evidence from dispensing regulations," Health Economics, John Wiley & Sons, Ltd., vol. 28(9), pages 1114-1129, September.
    18. Sloczynski, Tymon, 2018. "A General Weighted Average Representation of the Ordinary and Two-Stage Least Squares Estimands," IZA Discussion Papers 11866, IZA Network @ LISER.
    19. Michael Lechner, 2023. "Causal Machine Learning and its use for public policy," Swiss Journal of Economics and Statistics, Springer;Swiss Society of Economics and Statistics, vol. 159(1), pages 1-15, December.
    20. Brantly Callaway & Derek Dyal & Pedro H. C. Sant'Anna & Emmanuel S. Tsyawo, 2025. "Beyond Parallel Trends: An Identification-Strategy-Robust Approach to Causal Inference with Panel Data," Papers 2511.21977, arXiv.org.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2601.02400. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.