IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2605.01311.html

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

Author

Listed:
  • Jikai Jin
  • Vasilis Syrgkanis

Abstract

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.

Suggested Citation

  • Jikai Jin & Vasilis Syrgkanis, 2026. "The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice," Papers 2605.01311, arXiv.org.
  • Handle: RePEc:arx:papers:2605.01311
    as

    Download full text from publisher

    File URL: https://arxiv.org/pdf/2605.01311
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Xuelin Yang & Licong Lin & Susan Athey & Michael I. Jordan & Guido W. Imbens, 2025. "Cross-Validated Causal Inference: a Modern Method to Combine Experimental and Observational Data," Papers 2511.00727, arXiv.org.
    2. Evan T.R. Rosenman & Guillaume Basse & Art B. Owen & Mike Baiocchi, 2023. "Combining observational and experimental datasets using shrinkage estimators," Biometrics, The International Biometric Society, vol. 79(4), pages 2961-2973, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhexiao Lin & Peter J. Bickel & Peng Ding, 2026. "Introducing the b-value: combining unbiased and biased estimators from a sensitivity analysis perspective," Papers 2602.16310, arXiv.org.
    2. Harsh Parikh & Trang Quynh Nguyen & Elizabeth A. Stuart & Kara E. Rudolph & Caleb H. Miles, 2025. "A Cautionary Tale on Integrating Studies with Disparate Outcome Measures for Causal Inference," Papers 2505.11014, arXiv.org.
    3. George Z. Gui, 2024. "Combining Observational and Experimental Data to Improve Efficiency Using Imperfect Instruments," Marketing Science, INFORMS, vol. 43(2), pages 378-391, March.
    4. Irina Degtiar & Tim Layton & Jacob Wallace & Sherri Rose, 2023. "Conditional cross‐design synthesis estimators for generalizability in Medicaid," Biometrics, The International Biometric Society, vol. 79(4), pages 3859-3872, December.
    5. Francisco Blasques & Paolo Gorgi & Siem Jan Koopman & Noah Stegehuis, 2024. "Mitigating Estimation Risk: a Data-Driven Fusion of Experimental and Observational Data," Tinbergen Institute Discussion Papers 24-066/III, Tinbergen Institute.
    6. Xuelin Yang & Licong Lin & Susan Athey & Michael I. Jordan & Guido W. Imbens, 2025. "Cross-Validated Causal Inference: a Modern Method to Combine Experimental and Observational Data," Papers 2511.00727, arXiv.org.
    7. Carlos Fernández-Loría & Foster Provost, 2025. "Observational vs. Experimental Data When Making Automated Decisions Using Machine Learning," INFORMS Joural on Data Science, INFORMS, vol. 4(3), pages 197-229, July.
    8. Quinn Lanners & Cynthia Rudin & Alexander Volfovsky & Harsh Parikh, 2025. "Data Fusion for Partial Identification of Causal Effects," Papers 2505.24296, arXiv.org.
    9. Shosei Sakaguchi, 2025. "The Identification Power of Combining Experimental and Observational Data for Distributional Treatment Effect Parameters," Papers 2508.12206, arXiv.org, revised Apr 2026.
    10. Kevin Han & Han Wu & Linjia Wu & Yu Shi & Canyao Liu, 2024. "Estimating Treatment Effects Using Observational Data and Experimental Data with Non-Overlapping Support," Econometrics, MDPI, vol. 12(3), pages 1-11, September.
    11. Ana Armendariz & Martin Huber, 2026. "Testing Effect Homogeneity and Confounding in High-Dimensional Experimental and Observational Studies," Papers 2602.19703, arXiv.org.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2605.01311. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: https://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.