Author
Listed:
- Sidi Chang
- Peiying Zhu
- Yuxiao Chen
Abstract
LLM-based financial agents increasingly produce investment rationales before the outcomes needed to evaluate them are observable. This creates a delayed-ground-truth evaluation problem: realized returns remain the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting shortcut for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces ValueBlindBench, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueBlindBench clears the aggregate agreement gate at \(\bar{\kappa}_w = 0.7168\) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint\_awareness}, \(\bar{\kappa}_w = 0.2022\)), single-judge rankings are family-dependent, and terse-correct rationales receive a \(\Delta = -2.81\) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The scientific object is therefore not a leaderboard and not a claim to measure true investment skill. ValueBlindBench is a pre-calibration metrology layer for AI-finance evaluation: it governs whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.
Suggested Citation
Sidi Chang & Peiying Zhu & Yuxiao Chen, 2026.
"ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable,"
Papers
2604.25224, arXiv.org, revised May 2026.
Handle:
RePEc:arx:papers:2604.25224
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2604.25224. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.