Hedge-Bench: Robust Benchmarking Suite

Updated 9 June 2026

Hedge-Bench is a suite of diverse, rigorously benchmarked frameworks and datasets designed to evaluate hedged decision-making across domains such as financial reasoning, packet classification, machine learning, and risk management.
It employs explicit evaluation protocols, expert-derived rubrics, entropy tests, and conformal prediction methods to measure adversarial robustness and ensure reproducibility.
Its protocols include deterministic scoring for AI-driven financial tasks, calibrated entropy metrics for network security, and dynamic programming solutions for risk-minimizing hedging strategies.

Hedge-Bench is a term applied to several rigorously benchmarked frameworks, datasets, and evaluation protocols, each rooted in distinct domains: (1) robust AI-agent financial reasoning (Cho et al., 2 Jun 2026), (2) high-entropy packet classification (Casino et al., 2019), (3) hedged prediction in machine learning via conformal prediction [0611011], (4) benchmarking risk-minimizing hedging strategies in partial-information markets (Ceci et al., 2013), and (5) optimal multibenchmark risk-hedging with constraints (Jiao et al., 2013). All address the common methodological theme of evaluating “hedged” or adversarially robust behaviors—be it prediction, strategy, or decision—against explicit or implicit benchmarks.

1. Hedge-Bench for Financial Reasoning Agents

The “Hedge-Bench” suite, as introduced in "Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning" (Cho et al., 2 Jun 2026), targets open-ended reasoning in professional investment research environments. Unlike prior tool-use or QA datasets, Hedge-Bench 1.0 consists of 102 real, on-the-job research tasks with complete expert analyst reasoning traces and structured rubrics derived from full diligence transcripts. Each task involves an adversarial, multi-step research challenge set in a deterministic Docker environment containing real, time-stamped filings and curated news, with agents evaluated on their ability to (1) ground claims in file-derived evidence, (2) structurally cover all expert-identified analytical “themes” (where each theme requires multiple “moves”), and (3) synthesize disparate or conflicting findings.

Scoring is deterministic and multi-dimensional: theme coverage is tallied using $\tau = \max(1, \min(n-1, 3))$ for $n$ submoves per theme, with strict requirements that “grounded” (non-hallucinated) evidence is necessary for credit; synthesis is rewarded at the highest rubric tier. Macro-averaged pass@1 (“perfect solution in one run”) remains below $16\%$ for frontier models (e.g., Claude-Opus, GPT-5.5), with mean dense scores rarely exceeding $2/4$. Notably, the hardest categories involve forward-looking, ungroundable risk analysis, while categories like valuation afford higher success rates due to more document-anchored inference. Hallucination detection and error attribution are integral: hallucinated moves, uncovered themes, and lack of synthesis are tabulated, providing granular diagnostic signals.

The experimental harness leverages the Harbor format and Terminus execution/judging stack, with all agent trials fully reproducible via open-source code (github.com/Trata-Inc/trata-hedge-bench). Limitations include analyst subjectivity in rubric generation, potential for rubric drift, fixed document pools, and LLM-based judge limitations.

2. Hedge-Bench in High-Entropy Network Packet Classification

In encrypted/compressed traffic classification, “Hedge-Bench” refers to a systematic dataset and protocol (Casino et al., 2019) underpinning the HEDGE (High Entropy DistinGuishEr) algorithm. The dataset construction encompasses six canonical file types—images, text, PDFs, audio, video, executables—drawn from public sources. Each raw file is transformed into ten variants via AES/Camellia encryption and ZIP/RAR/BZIP2/GZIP compression, then split into blocks of size 1–64 KB (powers of two). The resulting dataset is strictly balanced by transformation type and block size, yielding equal volumes (≈1 GB/class/size) and hundreds of thousands of packets per configuration.

Benchmarking proceeds using a set of low-latency randomness tests—Shannon entropy, chi-square ( $\chi^2$ ) goodness-of-fit, three NIST SP 800-22 battery subtests (frequency, cumulative sums, approximate entropy)—with classification thresholds empirically calibrated (e.g., $|\chi^2-\mu_1| \leq \gamma \cdot \sigma_1$ , $x_2$ in $\{<1\%, >99\%\}$ , $x_3=0$ failed blocks).

Evaluation uses repeated inverse 10-fold splits for robust estimation of accuracy, recall, precision, and F1. Reported accuracy varies strongly with block size, from $\sim 68.7\%$ (1 KB) to $n$ 0 (64 KB), outstripping deep CNN baselines on comparable tests. The computational pipeline is $n$ 1 per block (early exit on feature threshold failure) and embarrassingly parallelizable.

3. Hedge-Bench: Hedged Prediction Benchmarking in Machine Learning

The archetype “Hedge-Bench” framework in machine learning [0611011] benchmarks hedged (conformal) predictors for diverse base learning algorithms. In this setting, the goal is to produce for each input $n$ 2 a prediction set $n$ 3 with explicit, provable coverage guarantees: $n$ 4, holding for exchangeable data and any choice of the miscoverage parameter $n$ 5. The conformal predictor utilizes a nonconformity measure $n$ 6, computes leave-one-out (or split-conformal) scores over calibration data, and inverts these to construct prediction sets or intervals for new data points.

The benchmarking protocol (Hedge-Bench) tracks empirical coverage $n$ 7, average set size, efficiency (normalized set size), and computation time, sweeping over nominal confidence levels (e.g., $n$ 8) and base learners (SVM, kernel ridge, nearest neighbor, etc.). Core code snippets, LaTeX formulas, and data structures are provided to support reproducible research and experimental comparisons.

4. Hedge-Bench in Financial Mathematics: Benchmark Approach to Risk-Minimizing Hedging

The “Hedge-Bench” concept in the context of risk-minimizing hedging under partial market information (Ceci et al., 2013) formalizes optimal portfolio construction in continuous-time, incomplete semimartingale markets. The market comprises $n$ 9 traded assets, including a numéraire portfolio $16\%$ 0 such that any self-financing wealth process, when benchmarked ( $16\%$ 1), evolves as a supermartingale under the real-world measure $16\%$ 2. The risk-minimization objective is to hedge a claim $16\%$ 3 using only information in a restricted filtration $16\%$ 4.

Optimal strategies are extracted via the Galtchouk–Kunita–Watanabe (GKW) decomposition of $16\%$ 5: under full information, the decomposition yields

$16\%$ 6

with $16\%$ 7 the $16\%$ 8-predictable integrand. Under restricted information, the $16\%$ 9-predictable integrand $2/4$0 is computed from $2/4$1 via predictable dual projections, yielding

$2/4$2

for $2/4$3, given quadratic covariation matrices $2/4$4. This approach generalizes to Markovian jump-diffusion models where the price dynamics depend on latent stochastic factors.

5. Hedge-Bench for Multibenchmark Risk-Constrained Hedging

In "Hedging under multiple risk constraints" (Jiao et al., 2013), “Hedge-Bench” encapsulates solving for minimum-cost portfolios that outperform a vector of stochastic benchmarks under expected shortfall constraints. The financial market is complete, with $2/4$5 risky and one riskless asset, and the agent must cover stochastic liabilities $2/4$6 at scheduled times. Loss is measured by a convex, decreasing function $2/4$7 (e.g., call loss, exponential loss), and at each benchmark $2/4$8 the expected loss $2/4$9 cannot exceed threshold $\chi^2$ 0.

Three constraint modes are formalized:

European (EU): global expectation constraints at each $\chi^2$ 1;
Time-consistent (TC): conditional expectation constraints at each $\chi^2$ 2 given $\chi^2$ 3;
Lookback (LB): expected maximum shortfall constraint.

Inclusion relationships are strict: $\chi^2$ 4, with corresponding cost $\chi^2$ 5. Resolution is achieved via dynamic programming (Bellman recursions) that handle either unconditional or conditional budget tracking, adapting to non-Markovian settings. Explicit solutions are derived for $\chi^2$ 6 and exponential loss, while computationally tractable recursions exist for more general cases.

6. Hedge-Bench in Multimodal Hallucination Detection

The “hedge-bench” library in vision-language research (Gautam et al., 16 Nov 2025) implements modular hallucination detection via robust geometric entropy measures. The pipeline executes visual perturbations (e.g., affine, noise, color-jitter), samples model responses at high and low temperatures, clusters answers semantically (embedding- or NLI-based techniques), and computes metrics such as Semantic Entropy (SE), RadFlag, and VASE (Vision-Amplified Semantic Entropy). Resulting scores quantify consistency and robustness of VQA models to input-level perturbations. The framework is explicitly designed for reproducible benchmarking, extensibility (custom distortions, clustering), and compute-aware evaluation.

Summary Table: Hedge-Bench Contexts

Domain	Core Object	Benchmark Protocol Key Features
Financial Reasoning (Cho et al., 2 Jun 2026)	Open-ended agent reasoning	102 real tasks, expert rubric, deterministic grading
Packet Classification (Casino et al., 2019)	High-entropy traffic	Encrypted/compressed, entropy/randomness metrics
Conformal Prediction [0611011]	Hedged prediction sets	Provable coverage, split-conformal, efficiency report
Risk-Minimization Hedging (Ceci et al., 2013)	Partial-info portfolio	GKW decomposition, dual projections, explicit formulae
Multi-Risk Hedging (Jiao et al., 2013)	Benchmark outperformance	EU/TC/LB constraints, DP recursion, closed-form/DP solns
VQA Hallucination (Gautam et al., 16 Nov 2025)	Multimodal model robustness	Perturbation sampling, clustering, VASE/SE metrics

Each Hedge-Bench instance is unified by a methodological orientation towards rigorous, adversarially-framed benchmarking—whether of agent behavior, statistical robustness, or hedge strategy—grounded in transparent protocols and explicit performance metrics. The Hedge-Bench series in financial reasoning in particular remains a definitive standard for evaluating open-ended, document-grounded, stepwise decision-making in high-stakes domains.