- The paper introduces a preregistered protocol that uses quadratic-weighted kappa and multi-dimensional diagnostics to assess LLM-judged investment rationales before returns are observable.
- The methodology incorporates adversarial controls targeting both overconfident verbose and terse but accurate rationales, revealing judge bias and measurement instability.
- Empirical results demonstrate that ensemble judging and explicit, dimension-specific anchors outperform single-score leaderboards in ensuring reliable AI-finance evaluations.
The evaluation of AI-generated investment rationales faces a core methodological challenge: the outcomes that validate or falsify financial decisions are typically delayed, noisy, and confounded by exogenous market factors. Traditional realized return metrics (P&L) provide eventual ground truth but cannot inform model development, selection, or deployment oversight at operational timescales. LLMs present an accessible mechanism for producing and evaluating rationale statements, yet rely on LLM-judged rubrics means interposing a new, unvalidated measurement instrument between the model and future returns. The central contribution of "ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable" (2604.25224) is to introduce a preregistered, adversarial-audited protocol that gates the publication and interpretation of LLM-evaluated rationales independent of realized outcomes.
ValueBlindBench Protocol: Structure and Adversarial Design
ValueBlindBench is architected to serve as a pre-calibration metrology layer, not an outcome imitation. The protocol’s deliverables are structured as multifield verdicts—not scalar scores—specifying for each claim: scope, agreement status, stability, adversarial robustness, and publication permission. The core method is to operationalize publication gates via quadratic-weighted κ-bar ( κˉw ) for inter-judge agreement (with κˉw≥0.4 as a preregistered cutoff), per-dimension reliability gates, judge-family stability diagnostics (LOFO), and cell-specific adversarial tests targeting construct contamination, especially verbosity and anchor ambiguity.
Empirical validation uses a controlled market-state allocation substrate with four distinct rationale-generating agent families (OpenAI GPT-5.5, Anthropic Claude Sonnet 4.6, Google Gemini 3.1, Alibaba Qwen3), and a three-judge ensemble (replicated Claude judge, GPT-5.5, Gemini). Primary adversarial controls are engineered as:
- Cell A: verbose, confident but substantively incorrect rationales
- Cell B: terse but factually correct rationales (≤60 tokens, ≥3 cited features, no hedging)
Repetition stability, dimension-level verdicts, and bootstrap-based ranking stability provide deeper noise and bias audits.
Key Results and Measurement Findings
Agreement and Overclaim Prevention: Aggregate panel agreement is robust: κˉw=0.7168 (95% CI [0.7006, 0.7330]), comfortably exceeding the publication gate. However, per-dimension analysis reveals rapid collapse: the constraint awareness dimension fails with κˉw=0.2022, demonstrating that high overall agreement can mask dimension-level arbitrariness.
Adversarial Control Outcomes: The protocol’s adversarial cells generate strong, contradictory claims to the naively trusted leaderboard. While Cell A is correctly penalized (panel mean 1.44 vs. honest 4.35, p<10−3), Cell B—terse-but-correct rationales—receives a severe penalty relative to honest rationales (Δ=−2.81 rubric points), statistically far beyond the MDE. Thus, naive LLM-judging confounds rhetorical coverage with substantive adequacy: LLM panels systematically undervalue correct but terse reasoning.
Stability and Ranking: Only Claude Sonnet 4.6 maintains a stable rank-1 position across bootstrap and judge-ablation (LOFO) analyses. Lower agent ranks are not meaningfully separable; noise dominates judge-family-specific score gaps, and removal of individual judge families rapidly dissolves apparent distinctions (Spearman ρ for positions 2–4 collapses to 0.2). Single-judge protocols produce spurious, unstable distinctions.
Anchor Specificity: An anchor-specificity probe on the constraint awareness dimension underlines the importance of operationalizing rubric anchors with explicit, quantitative requirements. Weak or verbal-only anchors promote judge-family ambiguity and measurement noise.
Implications: Theoretical and Practical
The theoretical implication is a substantive critique of off-the-shelf LLM-judged leaderboards in financial AI. ValueBlindBench demonstrates that leaderboard-like outputs dramatically overstate the reliability and interpretability of LLM-judged rationales unless strict measurement discipline—ensemble auditing, adversarial control, and per-dimension gates—is enforced. Notably, it extends classical metrology perspectives (Cronbach–Meehl, Messick) into the AI-evaluated finance domain: reproducibility and agreement are necessary but not sufficient for claim validity.
Practically, the ValueBlindBench verdict tuple can be adopted as a standardized reporting object for AI-finance model evaluation, procurement, and governance in the pre-realization regime. This explicit foregrounding of claim permission, stability, and contamination serves as a direct safeguard against surface-plausibility selection, leaderboard inflation, and style/anchor-induced failure modes.
For rubric and protocol designers, the anchor-specificity findings argue that every financial rubric dimension should be explicitly grounded in operationalizable, context-specific constructs to reduce judge-family variance—e.g., requiring explicit constraint arithmetic rather than abstract constraint “consideration.”
Limitations and Future Work
The authors explicitly delimit their claims: ValueBlindBench does not attempt to replace realized returns as the final arbiter, does not validate latent agent reasoning, and is presently scoped to a capital-allocation prototype not full-fundamental equity research. Key areas for extension include: expert panel benchmarks (with ValueBlindBench gating), multi-asset and multi-modal settings, rationale-length normalization to mitigate verbosity bias, and adversarial tests for paraphrase and arithmetic robustness.
The current protocol is sensitive to potential gaming if made public, necessitating held-out rubric variants and post-hoc P&L cross-checks prior to production deployment.
Conclusion
ValueBlindBench provides a rigorous, preregistered discipline for the publication and interpretation of LLM-judged investment rationale claims in the delayed-outcome, pre-realization regime. It surfaces and enforces explicit conditions under which LLM-panel outputs may or may not be acted upon, prioritizing measurement stability, dimension-level reliability, and adversarial robustness over naive score reporting. The protocol's findings challenge the reliability of both single-judge and raw aggregate leaderboards, highlighting rhetorical-coverage bias and dimension-specific ambiguity as practical risks. Adoption of similar adversarial-gated, multi-ensemble protocols is recommended for future AI-finance evaluation infrastructure.