ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

Published 28 Apr 2026 in cs.AI and q-fin.CP | (2604.25224v2)

Abstract: LLM-based financial agents increasingly produce investment rationales before the outcomes needed to evaluate them are observable. This creates a delayed-ground-truth evaluation problem: realized returns remain the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting shortcut for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces ValueBlindBench, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueBlindBench clears the aggregate agreement gate at (\barκ_w = 0.7168) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\texttt{constraint_awareness}, (\barκ_w = 0.2022)), single-judge rankings are family-dependent, and terse-correct rationales receive a (Δ= -2.81) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The scientific object is therefore not a leaderboard and not a claim to measure true investment skill. ValueBlindBench is a pre-calibration metrology layer for AI-finance evaluation: it governs whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a preregistered protocol that uses quadratic-weighted kappa and multi-dimensional diagnostics to assess LLM-judged investment rationales before returns are observable.
The methodology incorporates adversarial controls targeting both overconfident verbose and terse but accurate rationales, revealing judge bias and measurement instability.
Empirical results demonstrate that ensemble judging and explicit, dimension-specific anchors outperform single-score leaderboards in ensuring reliable AI-finance evaluations.

ValueBlindBench: Formal Stress-Testing for LLM-Judged Investment Rationales

Motivation and Problem Formulation

The evaluation of AI-generated investment rationales faces a core methodological challenge: the outcomes that validate or falsify financial decisions are typically delayed, noisy, and confounded by exogenous market factors. Traditional realized return metrics (P&L) provide eventual ground truth but cannot inform model development, selection, or deployment oversight at operational timescales. LLMs present an accessible mechanism for producing and evaluating rationale statements, yet rely on LLM-judged rubrics means interposing a new, unvalidated measurement instrument between the model and future returns. The central contribution of "ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable" (2604.25224) is to introduce a preregistered, adversarial-audited protocol that gates the publication and interpretation of LLM-evaluated rationales independent of realized outcomes.

ValueBlindBench Protocol: Structure and Adversarial Design

ValueBlindBench is architected to serve as a pre-calibration metrology layer, not an outcome imitation. The protocol’s deliverables are structured as multifield verdicts—not scalar scores—specifying for each claim: scope, agreement status, stability, adversarial robustness, and publication permission. The core method is to operationalize publication gates via quadratic-weighted κ-bar ( $\bar{\kappa}_w$ ) for inter-judge agreement (with $\bar{\kappa}_w \geq 0.4$ as a preregistered cutoff), per-dimension reliability gates, judge-family stability diagnostics (LOFO), and cell-specific adversarial tests targeting construct contamination, especially verbosity and anchor ambiguity.

Empirical validation uses a controlled market-state allocation substrate with four distinct rationale-generating agent families (OpenAI GPT-5.5, Anthropic Claude Sonnet 4.6, Google Gemini 3.1, Alibaba Qwen3), and a three-judge ensemble (replicated Claude judge, GPT-5.5, Gemini). Primary adversarial controls are engineered as:

Cell A: verbose, confident but substantively incorrect rationales
Cell B: terse but factually correct rationales (≤60 tokens, ≥3 cited features, no hedging)

Repetition stability, dimension-level verdicts, and bootstrap-based ranking stability provide deeper noise and bias audits.

Key Results and Measurement Findings

Agreement and Overclaim Prevention: Aggregate panel agreement is robust: $\bar{\kappa}_w = 0.7168$ (95% CI [0.7006, 0.7330]), comfortably exceeding the publication gate. However, per-dimension analysis reveals rapid collapse: the constraint awareness dimension fails with $\bar{\kappa}_w = 0.2022$ , demonstrating that high overall agreement can mask dimension-level arbitrariness.

Adversarial Control Outcomes: The protocol’s adversarial cells generate strong, contradictory claims to the naively trusted leaderboard. While Cell A is correctly penalized (panel mean 1.44 vs. honest 4.35, $p < 10^{-3}$ ), Cell B—terse-but-correct rationales—receives a severe penalty relative to honest rationales ( $\Delta = -2.81$ rubric points), statistically far beyond the MDE. Thus, naive LLM-judging confounds rhetorical coverage with substantive adequacy: LLM panels systematically undervalue correct but terse reasoning.

Stability and Ranking: Only Claude Sonnet 4.6 maintains a stable rank-1 position across bootstrap and judge-ablation (LOFO) analyses. Lower agent ranks are not meaningfully separable; noise dominates judge-family-specific score gaps, and removal of individual judge families rapidly dissolves apparent distinctions (Spearman $\rho$ for positions 2–4 collapses to 0.2). Single-judge protocols produce spurious, unstable distinctions.

Anchor Specificity: An anchor-specificity probe on the constraint awareness dimension underlines the importance of operationalizing rubric anchors with explicit, quantitative requirements. Weak or verbal-only anchors promote judge-family ambiguity and measurement noise.

Implications: Theoretical and Practical

The theoretical implication is a substantive critique of off-the-shelf LLM-judged leaderboards in financial AI. ValueBlindBench demonstrates that leaderboard-like outputs dramatically overstate the reliability and interpretability of LLM-judged rationales unless strict measurement discipline—ensemble auditing, adversarial control, and per-dimension gates—is enforced. Notably, it extends classical metrology perspectives (Cronbach–Meehl, Messick) into the AI-evaluated finance domain: reproducibility and agreement are necessary but not sufficient for claim validity.

Practically, the ValueBlindBench verdict tuple can be adopted as a standardized reporting object for AI-finance model evaluation, procurement, and governance in the pre-realization regime. This explicit foregrounding of claim permission, stability, and contamination serves as a direct safeguard against surface-plausibility selection, leaderboard inflation, and style/anchor-induced failure modes.

For rubric and protocol designers, the anchor-specificity findings argue that every financial rubric dimension should be explicitly grounded in operationalizable, context-specific constructs to reduce judge-family variance—e.g., requiring explicit constraint arithmetic rather than abstract constraint “consideration.”

Limitations and Future Work

The authors explicitly delimit their claims: ValueBlindBench does not attempt to replace realized returns as the final arbiter, does not validate latent agent reasoning, and is presently scoped to a capital-allocation prototype not full-fundamental equity research. Key areas for extension include: expert panel benchmarks (with ValueBlindBench gating), multi-asset and multi-modal settings, rationale-length normalization to mitigate verbosity bias, and adversarial tests for paraphrase and arithmetic robustness.

The current protocol is sensitive to potential gaming if made public, necessitating held-out rubric variants and post-hoc P&L cross-checks prior to production deployment.

Conclusion

ValueBlindBench provides a rigorous, preregistered discipline for the publication and interpretation of LLM-judged investment rationale claims in the delayed-outcome, pre-realization regime. It surfaces and enforces explicit conditions under which LLM-panel outputs may or may not be acted upon, prioritizing measurement stability, dimension-level reliability, and adversarial robustness over naive score reporting. The protocol's findings challenge the reliability of both single-judge and raw aggregate leaderboards, highlighting rhetorical-coverage bias and dimension-specific ambiguity as practical risks. Adoption of similar adversarial-gated, multi-ensemble protocols is recommended for future AI-finance evaluation infrastructure.

Markdown Report Issue