Herculean: An Agentic Benchmark for Financial Intelligence

Published 14 May 2026 in cs.AI and cs.CL | (2605.14355v1)

Abstract: As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

Abstract PDF Upgrade to Chat

Authors (64)

First 10 authors:

Summary

The paper introduces HERCULEAN, a benchmark that evaluates AI agents on realistic, multi-step financial workflows like trading, hedging, market insights, and auditing.
The paper demonstrates that agent execution frameworks significantly affect performance, with robust controller designs outperforming mere backbone scaling in complex financial tasks.
The paper reveals critical gaps in current financial AI, urging the integration of state management and deterministic verification to meet professional-grade financial standards.

Herculean: An Agentic Benchmark for Financial Intelligence

Motivation and Benchmark Design

HERCULEAN introduces a comprehensive benchmark for evaluating AI agents across a heterogeneous set of high-fidelity, end-to-end financial workflows. The principal motivation is to move beyond atomized tasks such as question answering or isolated document retrieval toward the rigorous assessment of agents' ability to execute workflows that authentically reflect professional financial labor. Existing benchmarks largely restrict evaluation to static, reductionist tasks, insufficient for capturing the interaction dynamics, state consistency, and verification demands that characterize financial decision-making.

HERCULEAN instantiates four canonical agentic workflows: Trading, Hedging, Market Insights, and Auditing. Each is realized as a skill-based environment grounded in the Model Context Protocol (MCP), encapsulating task-specific tools, constraints, and procedural logic. This design yields architecture-agnostic evaluation, facilitating consistent comparisons across diverse agent and backbone model configurations while maintaining operational fidelity.

Workflows and Environment Structure

Trading

The Trading workflow operationalizes single-asset trading under realistic market conditions. Agents make daily buy/sell/hold decisions over a multi-month horizon, informed by multi-modal signals including OHLCV prices, corporate filings, and financial news. The environment precludes information leakage through strict chronological enforcement and requires sequential commitment, preventing retroactive modification of decisions. Performance is measured using standard quantitative finance metrics—cumulative return, Sharpe ratio, and maximum drawdown—allowing for risk-adjusted outcome comparisons.

Hedging

Hedging focuses on pair trading, demanding agents select and manage a market-neutral position across correlated equities. The agent must conduct pair selection at horizon initiation, then manage exposure via day-by-day position updates (long-short, hold, close), exploiting relative mispricing rather than absolute predictions. This workflow emphasizes cross-asset reasoning, relational judgment, and persistent position tracking—capabilities rarely probed by prior financial agent benchmarks.

Market Insights

Market Insights evaluates the synthesis, structuring, and presentation of weekly investment research reports. Agents must aggregate signals and generate markdown reports that are multi-sectioned, logically structured, and grounded in evidence, including an explicit investment rating. Performance is assessed on both simulated trading outcomes derived from agent-issued ratings and comprehensive rubric-based evaluations of report quality, targeting structure, content accuracy, evidence grounding, and reasoning depth.

Auditing

Auditing models deterministic, calculation-based financial verification, requiring agents to identify and recompute specific numeric facts from multi-document XBRL filings according to the US-GAAP taxonomy, incorporating calculation network reasoning and concept semantics. This is the most verification-intensive workflow, demanding strict structural validity, robust extraction, and precise deterministic computation. Evaluation decomposes accuracy into granular error rates spanning structural failures, extraction errors, and calculation inaccuracies.

Agent and Backbone Evaluation

The benchmark systematically evaluates five agent frameworks (ReAct Agent, Claude Code, Codex, Hermes, OpenClaw), each paired with four foundation models (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-397B-A17B, Qwen3.5-27B). This grid surfaces not only the isolated impact of backbone scaling but, critically, the effects of agent framework design—particularly in task orchestration, tool integration, and execution trajectory stability.

Notably, all agents operate exclusively in context, with persistent memory, parametric retrieval, and web search disabled, isolating in-context reasoning and workflow execution fidelity as the sole axes of evaluation.

Empirical Analysis and Key Findings

Workflow-Dependent Capability Gaps

A central empirical finding is that financial reasoning competence remains highly workflow-dependent. Agents exhibit moderate performance in Trading and Market Insights, particularly for generative reporting and lightweight tool use, achieving rubric scores exceeding 9.0 in Market Insights for some configurations (e.g., ReAct Agent+sonnet) while often surpassing buy-and-hold in cumulative return on select assets. However, performance deteriorates markedly in Hedging and Auditing, where the requirements for cross-asset relational reasoning, persistent state management, and deterministic calculation exceed the capabilities of both agent frameworks and foundational models.

Importance of Execution Framework

Results indicate that the agent execution framework exerts a substantial influence independent of the backbone LLM. For instance, in Auditing with the same Sonnet model, Claude Code and OpenClaw achieved 66.15% accuracy, while ReAct Agent and Hermes were limited to 20.00%, predominantly due to chronic structural errors and failures in trajectory validity. CLI-oriented and skill-centric frameworks consistently sustain interaction stability and correct tool orchestration in verification-heavy workflows, in contrast to ReAct-type loops which are prone to collapse under prolonged tool chaining.

Backbone Scaling is Insufficient

Larger and more capable backbones (Sonnet, GPT-5.4) generally outperform the Qwen variants in narrative synthesis, sequential trading, and report generation. Nevertheless, model scaling alone does not ensure robust workflow proficiency: agents with high generative fluency may still exhibit fundamental deficiencies in structured verification, relational judgment, and deterministic error correction. The intersection between dynamic execution control and backbone model reasoning is crucial.

Substantial Shortfalls in Professional Financial Labor

No agent-model pair achieves dominant performance across all workflows or assets. Execution failures, unstable behavior, and unsystematic results persist even under standardized environments. Deterministic financial verification (Auditing) remains acutely challenging, with only the strongest configurations approaching two-thirds accuracy and frequent catastrophic failures in weaker or misaligned frameworks.

Implications and Theoretical Significance

HERCULEAN exposes critical deficiencies in the current generation of agentic systems when tasked with reliably executing professional-grade financial workflows. The empirical evidence underscores intrinsic limitations arising not merely from raw financial knowledge or language modeling, but from the orchestrated control structure required for state-preserving, tool-mediated, and workflow-constrained execution.

From a methodological perspective, the benchmark motivates further research on agent designs that couple high-capacity LLMs with robust execution controllers, memory architectures, and relational modules optimized for persistent, stateful, and verifiable reasoning. The observed failure modes—particularly in deterministic verification and cross-asset judgment—highlight the need for hybrid approaches that tightly integrate symbolic computation, automated tool chaining, and rigorous state management.

Practically, HERCULEAN offers a reproducible substrate for comparative evaluation, methodology development, and error analysis in the context of high-stakes financial automation. Benchmarking on agentic workflows surfaces failure modes that are not evident in static QA or document-centric tasks, thus informing risk analysis, regulatory compliance, and deployability in financial domains.

Future Directions

Key areas for further exploration include: expanding workflow and domain coverage to incorporate credit risk analysis, portfolio optimization, and global, multi-asset regimes; designing agentic controllers that afford fine-grained state tracking and compositional tool use; and improving the alignment between rubric-based evaluation and true professional standards (both analytical and regulatory).

Developments in robust agent architectures—leveraging advanced planner-orchestrators, modular tool interaction, and reliable execution loops—will be vital for achieving the workflow-level reliability demanded in financial practice. The benchmark's extension to multilingual, non-US, and less-resourced markets will also be essential for comprehensive agentic finance evaluation.

Conclusion

HERCULEAN sets a new standard for agentic evaluation in finance by targeting end-to-end workflows across Trading, Hedging, Market Insights, and Auditing in MCP-grounded environments. The results demonstrate that, despite progress, AI agents remain far from realizing the requirements of dependable, professional financial labor. Capability gaps are pronounced for workflows demanding structured verification, state management, and cross-asset reasoning, and are not eliminated by further backbone scaling alone. Future research must prioritize the co-design of reasoning backbones and robust execution frameworks to bridge the systemic gaps revealed by HERCULEAN (2605.14355).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Explaining “HERCULEAN: An Agentic Benchmark for Financial Intelligence”

Overview: What’s this paper about?

This paper introduces HERCULEAN, a new way to test how well AI “agents” can do real financial work from start to finish. Instead of just answering questions or summarizing documents, the agents have to carry out full workflows, like a human finance professional would. The benchmark includes four kinds of tasks: Trading, Hedging, Market Insights, and Auditing.

Think of it like a set of realistic “levels” in a finance game. Each level has rules, tools, and goals, and the AI must plan, act, and check its work over days or weeks—just like in the real world.

Key questions the researchers asked

The paper focuses on simple but important questions:

Can current AI agents handle complete financial workflows, not just isolated tasks?
Which kinds of financial tasks are easier or harder for AI agents?
Does the way an agent is designed (its “framework”) matter as much as the LLM it uses?
Do bigger, more advanced LLMs guarantee success in these realistic settings?

How they did the research

The team built HERCULEAN as four “skill environments,” each mirroring a real financial job. Agents interact with these environments through a standardized interface called the Model Context Protocol (MCP). MCP is like a shared controller: it defines what the agent can see, what tools it can use, and how it must report results, so every agent is tested fairly.

Here’s what each workflow looks like:

Trading: The agent makes a daily choice—BUY, SELL, or HOLD—on one stock over three months. It can look at past prices, company news, and official filings, but not the future. Success is measured by:
- Cumulative return (how much you made or lost overall),
- Sharpe ratio (profit relative to volatility/risk),
- Maximum drawdown (the biggest drop from a peak).
Hedging (pairs trading): The agent picks two related stocks (like MSFT and GOOG) and bets on their relationship, not the market’s direction. Each day, it chooses positions like LONG_SHORT (long one, short the other), HOLD, or CLOSE. The portfolio is “dollar neutral,” meaning it balances long and short so total dollar exposure is zero. It uses the same return/risk metrics as Trading.
Market Insights: The agent writes a weekly investment report (with sections like summary, rating, risks) for one stock and gives a rating (STRONG_BUY to STRONG_SELL). The report must combine prices, news, filings, and peer comparisons. The team judges:
- Report quality (structure, accuracy, evidence, reasoning),
- Whether following the ratings would have made money (using the same trading metrics).
Auditing: The agent checks a specific number in a company’s official XBRL filing (a digital format for financial statements). It must:
- Find the reported value,
- Compute the correct value using the filing’s calculation links and GAAP rules (the accounting standards),
- Report both and see if there’s an error.
- Accuracy is judged in steps: whether the output is valid, whether the right number was extracted, and whether the math/logic was correct.

To test fairness and robustness, the researchers ran five different agent frameworks with four different LLMs (including advanced closed-source models and smaller open-source ones). They turned off web search and persistent memory to make sure the agents relied only on the provided financial data and tools.

Main findings and why they matter

The results show clear patterns. Here are the highlights:

Agents do better on “talking and summarizing” tasks than on “precise, step-by-step” tasks:
- Market Insights: Many agents wrote high-quality reports (often scoring above 9/10). However, good writing didn’t always mean good investment results.
- Trading: Some agents beat a simple “Buy & Hold” baseline, but gains were small and inconsistent.
- Hedging: Agents struggled. This task requires tracking positions over time and understanding relationships between two stocks—skills current agents found hard.
- Auditing: This was the hardest. The best systems reached about 66% accuracy, but many made structural mistakes (like not following the required format or process), and calculation errors were common. This shows strict, rule-based financial checking is still very tough for AI.
The agent’s design matters as much as the LLM:
- Agents with strong execution control (good at structured tool use and following protocols) did much better, especially in Auditing.
- Agents that rely on a simple “think-then-act” loop often broke the rules or failed to complete tasks over long time spans.
Bigger LLMs help, but they aren’t a magic fix:
- Advanced models improved performance overall.
- Still, being great at writing a convincing report didn’t mean the agent could verify accounting numbers correctly or manage a hedged position reliably.
The core gap:
- Today’s agents can reason and write well, but turning that reasoning into consistent, correct actions over time—especially under strict rules—remains a major challenge.

Implications: What this means going forward

This benchmark pushes AI beyond simple Q&A and toward real professional work. The findings suggest:

To make AI agents truly useful in finance, we need better “execution brains,” not just better “language brains.” That means stronger:
- State tracking (remembering what’s already done or open),
- Tool orchestration (using the right tools, in the right order, with the right parameters),
- Verification (checking math and rules carefully).
Teams building financial AI should focus on workflow stability, not just flashy reasoning. An agent that can follow rules, maintain consistent state, and verify results is more valuable—and safer—than one that only sounds smart.
HERCULEAN gives researchers a common, realistic testbed. It can help the community compare methods fairly and improve agents so they can handle high-stakes tasks more reliably.

Note: The authors released code and data for research purposes. The benchmark uses public information and doesn’t offer financial advice. It focuses on US markets and English-language filings, so results may not generalize globally.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of unresolved issues that future work could address to strengthen the benchmark’s validity, coverage, and scientific conclusions:

External validity across markets: Assess generalization beyond large-cap US equities to small/mid-cap, international markets, and non-English filings/markets.
Accounting regime coverage: Extend Auditing from US GAAP to IFRS and cross-standard mappings; evaluate mixed-standard corpora.
Asset-class breadth: Add fixed income, FX, commodities, options/derivatives (e.g., Greeks-based hedging), ETFs, and crypto to test cross-asset reasoning.
Longer horizons and live evaluation: Run multi-quarter/multi-year out-of-sample and longer live periods with regime shifts; report stability over time.
Market realism: Incorporate transaction costs, slippage, fees, borrow availability, margin, liquidity/impact, and order execution constraints.
Position sizing and risk: Move beyond discrete BUY/SELL/HOLD to include position sizing, leverage, VaR/stop-loss, and risk budgets; evaluate portfolio-level constraints.
Intraday/microstructure: Introduce intraday data and execution tasks (limit/market orders, order book dynamics) to test timing and microstructure-aware decisions.
Trading semantics clarity: Specify inventory/position persistence, shorting rules, and how BUY/SELL map to exposure changes; include stronger baselines (e.g., momentum/mean-reversion) with and without costs.
Hedging strategy realism: Support dynamic pair re-selection, dynamic hedge ratios (e.g., rolling OLS), z-score thresholds, cointegration tests, and rebalancing cadence.
Hedging constraints: Model short-sale constraints, borrow costs, margin calls, and re-hedging under volatility spikes.
Portfolio workflows: Evaluate multi-asset portfolio construction (cross-sectional long–short, risk parity) rather than single-asset or single-pair tasks.
Market Insights impact mapping: Calibrate and justify the mapping from weekly ratings to trading strategies; compare to analyst-style benchmarks and buy-side baselines.
Report quality measurement validity: Validate the rubric and judge-LLM with expert auditors/analysts; report inter-rater agreement and robustness to prompt variations.
Evidence fidelity and citation: Enforce source-linked claims with verifiable citations; score citation correctness and coverage beyond rubric pass/fail.
Data leakage audits: Rigorously audit chronology for prices/news/filings and the weekly metrics aggregator; publish leakage checks and unit tests.
News summarization bias: Quantify how the aggregation/summarization pipeline affects evidence fidelity and downstream decisions; compare to raw/newswire feeds.
Auditing ground truth scale: Expand beyond 65 instances; include diverse concept types, periods, dimensional contexts (hypercubes), footnotes, segment data, and restatements.
Deterministic auditing labels: Provide human-verified numeric ground truth (not LLM-judged) for a substantial subset to benchmark judge accuracy and agent correctness.
Narrative–numeric cross-checks: Add tasks that reconcile narrative disclosures (MD&A/notes) with numeric facts and detect inconsistencies.
Judge-LLM bias and contamination: Quantify judge sensitivity to fluency and backbone family; use multiple orthogonal judges and human adjudication to estimate bias.
Variance and reliability: Report run-to-run variance, seeds, confidence intervals, and sensitivity to reasoning depth and temperature; analyze failure rates over time.
Prompt/tool sensitivity: Ablate skill prompts, tool schemas, and MCP API designs; quantify how interface changes affect execution stability and outcomes.
Memory and retrieval effects: Evaluate the impact of persistent memory, retrieval augmentation, and external web search on all workflows under identical constraints.
Execution-control methods: Compare self-verification, planning, program-of-thought, code execution, and constrained decoding; quantify their effect on Auditing/Hedging.
Learning inside the environment: Explore RL/fine-tuning/curriculum learning for trajectory control and verification, measuring sample efficiency and safety.
Multi-agent and human-in-the-loop: Test role-specialized agent teams and escalation-to-human protocols; measure coordination overhead and error reduction.
Adversarial robustness: Introduce corrupted/noisy/mislabeled data, adversarial tool responses, and stress scenarios; measure brittleness and recovery.
Cost–performance trade-offs: Report token/tool-call budgets, latency, throughput, and cost-normalized performance; study compute-aware agent design.
Process analytics: Publish telemetry on tool-call counts, depth, time per step, and failure taxonomy (SER/EER/CER analogs across workflows) to link behavior to outcomes.
Cross-workflow transfer: Test whether improvements in Auditing (verification) transfer to Trading/Hedging execution; study shared skills vs. specialization.
Additional workflows: Add credit risk/underwriting, AML/KYC/compliance monitoring, portfolio rebalancing, corporate actions processing, and earnings forecasting.
Alternative data and macro: Incorporate macro indicators, options surface/skew, supply-chain and satellite data, and sell-side transcripts to test multimodal integration.
Safety and governance: Define metrics for safe failure, escalation criteria, and policy constraints to prevent high-risk actions; audit for hallucinated or non-compliant outputs.
Interoperability and standards: Evaluate portability beyond MCP to other tool protocols; propose reference schemas for financial tools to reduce framework-induced variance.
Reproducibility of agent setups: Provide full agent configuration, prompts, and deterministic seeds; document API versioning to enable faithful replication.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the benchmark’s MCP-based skill environments, evaluation protocols, and observed agent performance characteristics.

Industry: AI agent vendor evaluation and procurement harness
- What: Use HERCULEAN as a standardized, workflow-faithful testbed to compare agent frameworks/backbones for Trading, Hedging, Market Insights, and Auditing before purchase or deployment.
- Sector(s): Finance, Software
- Tool/Product/Workflow: Evaluation dashboards reporting CR/SR/MDD for Trading/Hedging, rubric scores for Market Insights, and ACC/SER/EER/CER for Auditing; repeatable MCP-based test suites; vendor scorecards.
- Dependencies/Assumptions: US large-cap equities, GAAP/XBRL focus, benchmark time windows; LLM/API access; does not model transaction costs or market frictions.
Industry: Research analyst co-pilot with rubric-based quality gates
- What: Generate weekly, structured investment reports (ratings plus 8-section Markdown) and gate them with the benchmark’s LLM-as-judge rubrics (structure, fidelity, accuracy, reasoning).
- Sector(s): Finance (buy-side/sell-side), Enterprise Research Ops
- Tool/Product/Workflow: Market Insights MCP skill + rubric evaluator as a CI/CD-like gate for research notes; report templates for portfolio meetings.
- Dependencies/Assumptions: Human-in-the-loop review to mitigate LLM-judge bias toward fluency; coverage limited to benchmark asset universe unless extended.
Industry/Enterprise: Pre-filing XBRL self-checker for internal audit/compliance
- What: Use the Auditing skill as a pre-screen tool to detect calculation-network inconsistencies and sign/balance issues in draft filings.
- Sector(s): Audit, Corporate Finance, RegTech
- Tool/Product/Workflow: Auditing MCP server integrated into SEC-reporting workflows; red-flag reports with traceable calculation steps.
- Dependencies/Assumptions: Current agent accuracy is uneven; treat as assistive triage, not a replacement for professional audit; GAAP taxonomy coverage; requires internal mapping to company-specific extensions.
Engineering/MLOps: Execution-control–oriented agent selection and hardening
- What: Prefer CLI-oriented or schema-enforcing agent frameworks (lower SER) for tool-heavy workflows (e.g., Auditing), based on benchmark findings.
- Sector(s): Software, Finance
- Tool/Product/Workflow: Agent orchestration policies (typed tools, schema validators, trajectory monitors), regression suites using HERCULEAN scenarios.
- Dependencies/Assumptions: Access to frameworks that expose tool typing and strict I/O schemas; ops discipline to maintain evaluation baselines.
Academia/Education: End-to-end finance labs for teaching and evaluation
- What: Course modules that mirror professional workflows (trading, hedging, insights, auditing) rather than static QA.
- Sector(s): Education
- Tool/Product/Workflow: MCP servers + DuckDB datasets as lab infrastructure; assignments on pair selection, weekly reporting, and XBRL verification.
- Dependencies/Assumptions: Faculty/IT support to deploy MCP; curated guardrails to avoid “live trading” misconceptions.
Industry: Paper-trading sandboxes for strategy prototyping
- What: Rapid prototyping of single-asset trading and pairs hedging in a controlled, reproducible environment with tool-mediated data access.
- Sector(s): Asset Management, Fintech
- Tool/Product/Workflow: Broker-simulated backtests driven via MCP; scenario libraries (different assets/time windows).
- Dependencies/Assumptions: Historical-only and limited universe; no slippage/fees unless added; agents show instability on Hedging—keep in sandbox.
Policy/RegTech: SupTech prototypes for disclosure consistency checks
- What: Pilot automated audits of EDGAR filings for calculation-network consistency and taxonomy conformance.
- Sector(s): Policy, Regulation, Audit
- Tool/Product/Workflow: Batch auditing of recent filings with triage dashboards; human reviewer queue for flagged facts.
- Dependencies/Assumptions: Treat outputs as leads, not determinations; align with regulator data-access and security policies; extend to new taxonomy updates.
Daily life/Prosumer: Transparent weekly market summaries
- What: Consumer-facing “explain like I’m an analyst” reports using the Market Insights skill on covered tickers, emphasizing evidence links (news/filings) and risks.
- Sector(s): Personal Finance, Education
- Tool/Product/Workflow: Web app that generates weekly reports with caveats and learning prompts; no trading execution.
- Dependencies/Assumptions: Strict disclaimers (not investment advice); limited to public data; encourage diversified, long-term investing principles.

Long-Term Applications

These applications require further research, scaling, and/or integration work (e.g., stronger execution control, broader data, regulatory acceptance).

Industry: Production-grade agent portfolio managers with market-neutral modules
- What: End-to-end agents that trade/hedge live with robust state tracking, risk budgeting, and compliance-aware execution.
- Sector(s): Asset Management, Brokerage
- Tool/Product/Workflow: Live MCP skills wired to real-time market data and broker APIs; policy engines enforcing position limits, P&L stop-outs, and audit trails.
- Dependencies/Assumptions: Significant improvements in long-horizon coordination and cross-asset reasoning; model risk governance; full treatment of costs and slippage.
Policy/RegTech: Continuous, automated XBRL auditing at scale
- What: Always-on agents that recompute and cross-verify reported facts across issuers/periods, raising probabilistic anomalies for human follow-up.
- Sector(s): Regulation, Exchanges, Audit
- Tool/Product/Workflow: High-throughput auditing pipelines; concept graph reasoning across company-specific extensions; cross-filing consistency checks.
- Dependencies/Assumptions: Higher ACC and lower CER; regulator acceptance; robust handling of taxonomy evolution and restatements.
Standardization: Certification regimes for AI financial agents
- What: Industry-standard tests (built on HERCULEAN-like workflows) certifying execution stability and verification competence before live deployment.
- Sector(s): Finance, Standards Bodies, Risk Management
- Tool/Product/Workflow: Tiered benchmarks and thresholds per workflow; periodic re-certification; incident reporting tied to benchmark regressions.
- Dependencies/Assumptions: Broad community adoption; transparent, versioned datasets; governance for test leakage.
Software/Tooling: Execution-control platforms for agentic finance
- What: Products that provide typed tool schemas, deterministic verification loops, state stores, and “trajectory stabilizers” as a layer atop LLMs.
- Sector(s): Software, Finance, Compliance
- Tool/Product/Workflow: Agent SDKs with schema enforcement, retry/rollback, and deterministic calculators for financial primitives; audit logs.
- Dependencies/Assumptions: Integration with diverse data vendors; compatibility with MCP and future agent standards.
Cross-sector MCP skill libraries
- What: Extend the skill-based, MCP-grounded approach to insurance underwriting, credit risk scoring, procurement auditing, and energy trading.
- Sector(s): Insurance, Banking, Supply Chain, Energy
- Tool/Product/Workflow: Domain-specific skills with canonical tools, constraints, and evaluation criteria (e.g., loss triangles, PD/LGD estimation, invoice matching).
- Dependencies/Assumptions: Domain ontologies, regulatory rulesets, and high-quality labeled data; sector buy-in.
Learning: RL/DPO training for workflow competence
- What: Use HERCULEAN environments as training grounds to optimize for execution metrics (low SER/EER/CER; improved CR/SR with risk constraints).
- Sector(s): AI Research, Finance
- Tool/Product/Workflow: Offline RL with logged trajectories; curriculum learning from Market Insights to Auditing; verifier-in-the-loop optimization.
- Dependencies/Assumptions: Reliable reward shaping without overfitting; cost-effective training; safety constraints.
Multi-agent systems with verifier and memory roles
- What: Architectures where a “doer” agent is paired with a “verifier” and a “state manager” to ensure deterministic checks and consistent long-horizon behavior.
- Sector(s): Finance, Software
- Tool/Product/Workflow: Agent ensembles with explicit role APIs; shared state stores and calculation graphs; escalation to humans on uncertainty.
- Dependencies/Assumptions: Coordination overhead and latency budgets; secure shared memory; robust arbitration policies.
Enterprise: Automated research pipelines from ingestion to publish
- What: Semi-autonomous production of house views and sector decks, with evidence linking, peer-relative benchmarking, and compliance review.
- Sector(s): Investment Research, Corporate Strategy
- Tool/Product/Workflow: End-to-end content generation with rubric gates, fact-citation checks, and compliance sign-offs; knowledge-base integration.
- Dependencies/Assumptions: Strong hallucination control; IP/document access rights; alignment with editorial standards.
Trading infrastructure: Broker and OMS/EMS integration with compliance guards
- What: Agents that propose actions which must pass pre-trade checks (mandates, exposure limits) and post-trade surveillance using audit logs.
- Sector(s): Brokerage, Asset Management
- Tool/Product/Workflow: Policy-as-code libraries; explainability artifacts tied to each action; real-time risk dashboards.
- Dependencies/Assumptions: Low-latency tool orchestration; regulator-ready auditability; robust kill-switches.
Policy and education: Model risk management and regulatory curricula
- What: Regulator and practitioner training programs built around workflow benchmarks to teach where agents fail (e.g., verification-heavy tasks).
- Sector(s): Policy, Education, Risk
- Tool/Product/Workflow: Case libraries, hands-on labs, and simulation exercises; guidelines for acceptable use and controls.
- Dependencies/Assumptions: Cross-institution collaboration; continuously updated examples reflecting new agent capabilities and failure modes.

View Paper Prompt View All Prompts

Glossary

Adjusted close: A stock’s closing price adjusted for corporate actions like splits and dividends to reflect true economic value. "OHLCV prices with adjusted close"
Agentic: Refers to AI systems that can act autonomously using tools and multi-step interactions. "the first skilled benchmark for agentic financial intelligence"
Alpha: Excess return relative to a benchmark, often attributed to skill or unique insights. "single-stock alpha, momentum, and sector-relative beta blocks"
Backbone models: The underlying LLMs that power agent frameworks. "Each agent system is tested on four backbone models"
Balance semantics: XBRL/accounting property indicating whether a concept increases with debits or credits. "the concept’s balance semantics"
Beta: Sensitivity of an asset’s returns relative to a market or sector benchmark. "sector-relative beta blocks"
Buy&Hold baseline: A benchmark strategy that buys an asset and holds it without trading. "outperform the negative Buy&Hold baseline"
Calculation linkbase: The XBRL file that encodes arithmetic relationships among reported concepts. "(instance, calculation linkbase, schema, definition linkbase, label linkbase, presentation linkbase)"
Calculation network: The graph of linked XBRL concepts and formulas used to compute or validate values. "the filing’s calculation network"
Cumulative return (CR): Total percentage gain or loss over a period. "cumulative return (CR), Sharpe ratio (SR), and maximum drawdown (MDD)."
Definition linkbase: The XBRL file capturing semantic relationships among concepts beyond pure arithmetic. "(instance, calculation linkbase, schema, definition linkbase, label linkbase, presentation linkbase)"
Dimensional context: The XBRL specification of dimensions (e.g., segments, products) qualifying a reported fact. "dimensional-context resolution"
Dollar-neutral portfolio: A long-short position with equal dollar amounts on each side, yielding zero net exposure. "Any open pair position is implemented as a dollar-neutral portfolio"
DuckDB: An in-process analytical database used here to store and query market data offline. "is materialized in an offline DuckDB"
Equal-weighted: A portfolio or basket where each constituent has the same weight. "equal-weighted sector basket"
Extraction error rate (EER): The fraction of audit cases where the system extracted the wrong value from a filing. "extraction error rate (EER)"
Form 10-K: The SEC’s annual report filing that provides a comprehensive overview of a company’s business and financials. "Form 10-K and Form 10-Q"
Form 10-Q: The SEC’s quarterly report filing summarizing interim financial performance. "Form 10-K and Form 10-Q"
Hedging: Strategies designed to reduce or offset risk, often via offsetting positions. "Hedging strategies seek to profit not from predicting market direction"
Hierarchical fact-verification: A structured evaluation approach that checks multiple layers of correctness when verifying reported facts. "hierarchical fact-verification task lineage"
Instance document: The XBRL file that contains actual numeric facts reported by a company. "instance document"
Label linkbase: The XBRL file that provides human-readable labels for taxonomy concepts. "label linkbase"
LLM-as-a-judge: An evaluation paradigm where a LLM assesses correctness or quality of outputs. "hierarchical LLM-as-a-judge framework"
Market-neutral: A strategy designed to have minimal net market exposure, focusing on relative performance. "market-neutral pairs trading strategies"
Market timing: Making buy/sell decisions based on forecasts of short-term market movements. "daily market-timing decisions"
Maximum drawdown (MDD): The largest peak-to-trough decline in a portfolio over a period. "maximum drawdown (MDD)."
MCP (Model Context Protocol): A standardized protocol that packages tools and interactions for agents to use within an environment. "built following the Model Context Protocol (MCP)"
MCP server: The service that implements MCP tools and enforces environment state and evaluation logic. "an MCP server that exposes workflow-specific observations, tools, actions, and evaluation criteria."
Momentum: A signal based on the tendency of assets with recent strong performance to continue performing well (and vice versa). "single-stock alpha, momentum, and sector-relative beta blocks"
Notional exposure: The total value controlled by a position, irrespective of leverage effects. "equal absolute notional exposure"
OHLCV: Open, High, Low, Close, Volume — standard fields in price time series data. "OHLCV prices with adjusted close"
Pair trading: A market-neutral strategy that exploits relative mispricings between two correlated assets. "Pair trading, one of the most representative market-neutral hedging strategies"
Parametric memory: Knowledge encoded in model parameters rather than retrieved from external data/tools. "parametric memory"
Peer mapping: A mapping from a company to its sector peers used for relative comparisons. "a static peer mapping"
Presentation linkbase: The XBRL file that specifies how concepts are organized for display. "(instance, calculation linkbase, schema, definition linkbase, label linkbase, presentation linkbase)"
Schema (XBRL schema): The XBRL file defining the taxonomy’s elements and their data types. "(instance, calculation linkbase, schema, definition linkbase, label linkbase, presentation linkbase)"
SEC EDGAR: The SEC’s Electronic Data Gathering, Analysis, and Retrieval system that hosts company filings. "SEC EDGAR 27."
Sector basket: A portfolio of peer companies within the same sector used for relative performance benchmarking. "equal-weighted sector basket"
Sharpe ratio (SR): Risk-adjusted return metric defined as excess return over volatility. "cumulative return (CR), Sharpe ratio (SR), and maximum drawdown (MDD)."
U.S. GAAP taxonomy: The standardized set of accounting concepts used in US financial reporting within XBRL. "against the U.S. GAAP taxonomy"
XBRL: eXtensible Business Reporting Language used for machine-readable financial reporting. "verify individual XBRL numeric facts"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Herculean: An Agentic Benchmark for Financial Intelligence

Summary

Herculean: An Agentic Benchmark for Financial Intelligence

Motivation and Benchmark Design

Workflows and Environment Structure

Trading

Hedging

Market Insights

Auditing

Agent and Backbone Evaluation

Empirical Analysis and Key Findings

Workflow-Dependent Capability Gaps

Importance of Execution Framework

Backbone Scaling is Insufficient

Substantial Shortfalls in Professional Financial Labor

Implications and Theoretical Significance

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Explaining “HERCULEAN: An Agentic Benchmark for Financial Intelligence”

Overview: What’s this paper about?

Key questions the researchers asked

How they did the research

Main findings and why they matter

Implications: What this means going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets