Papers
Topics
Authors
Recent
Search
2000 character limit reached

OneMillion-Bench: Language Agent Benchmark

Updated 11 March 2026
  • OneMillion-Bench is a comprehensive benchmark that assesses language agents across professional domains using expert-curated tasks with explicit economic metrics.
  • It employs 400 semi-open-ended tasks across five sectors, using detailed rubric-based scoring to evaluate multi-step reasoning and adherence to domain-specific rules.
  • The evaluation highlights significant performance gaps in factual integration and analytical reasoning, stressing the need for enhanced professional readiness in language models.

OneMillion-Bench (1M-Bench) is a comprehensive benchmark designed to evaluate language agents in economically consequential, professional domains. It consists of 400 expert-curated tasks spanning Finance, Law, Healthcare, Natural Science, and Industry, each paired with explicit economic value based on expert time and prevailing market wages. Tasks are constructed to mirror real-world professional workflows, requiring not only accurate answers but robust multi-step reasoning, evidence integration, and adherence to domain-specific constraints. The evaluation protocol employs fine-grained, rubric-based scoring, departing from traditional benchmarks focused solely on answer correctness and targeting professional readiness and agentic reliability in domain-intensive contexts (Yang et al., 9 Mar 2026).

1. Motivation and Limitations of Prior Benchmarks

Classic QA and multiple-choice evaluations such as MMLU, GPQA, and HLE are limited to isolated knowledge units or formal reasoning tasks in highly structured environments. These approaches fail to reflect the complexity of real-world deliverables, neglecting factors such as multi-step workflows, the need for authoritative references, evidence reconciliation, and nuanced domain rules. Tool-agent evaluations like SWE-bench, TravelPlanner, and Terminal-Bench introduce planning and tool usage but lack rigorous domain-grounded reasoning assessment. Crucially, there has been no standard benchmark to quantify how much real economic value (senior professional time × market wages) LLMs might deliver on tasks of direct economic import. OneMillion-Bench was instituted to systematically bridge these gaps through high-value domains, explicit task valuation, and comprehensive process-oriented rubrics (Yang et al., 9 Mar 2026).

2. Benchmark Composition and Task Distribution

OneMillion-Bench comprises 400 semi-open-ended, expert-curated tasks:

  • Domains: Finance, Law, Healthcare, Natural Science, Industry (80 questions each).
  • Language/Cultural Subsets: Each domain is divided evenly between Chinese (CN; 200 tasks) reflecting Mainland-China standards, and English (Global; 200 tasks) based on U.S./international norms.
  • Hierarchical Coverage: 37 subdomains and 92 third-level categories ensure diversity (e.g., Finance → “Life Insurance,” “VC/PE”; Healthcare → “Orthopedics,” “Dentistry”; Law → “M&A,” “Data Compliance”; Industry → “Semiconductors,” “Civil Engineering”; Natural Science → “Quantum Physics,” “Genetics”).
Domain Subset Task Count Avg Value per Q Total Value
Finance CN (¥) 80 ¥7,410.8 ¥296,432
Finance Global ($) | 80 | \$4,593.2 \$183,726
Total CN 200 ¥921,832
Total Global 200 \$1,008,370

Tasks are exemplified by prompts such as, “Review the Yen’s 2025 depreciation cycle phases [...] and project 2026 trend-reversion triggers,” requiring structured, multi-source synthesis, quantitative analysis, and forward-looking judgments.

3. Rubric-Based Evaluation Protocol

Each question qq is paired with a set of rubrics RqR_q, where every rubric rr includes:

  • Explicit measurement criteria (e.g., legal citation, multi-step calculation, guideline application)
  • Weight wrw_r (ranging up to +10+10 or 20-20) reflecting impact
  • Functional tag: Factual Information (FI), Analytical Reasoning (AR), Instruction Following (IF), Structure & Formatting (SF)
  • (Optionally) references to authoritative sources

Scoring is performed either by an LM judge or human expert, assigning sr[wr,+wr]s_r \in [-w_r, +w_r] for each rubric. The ExpertScore for a question qq:

ExpertScore(q)=max ⁣(0,  rRqsrrRq+wr),\text{ExpertScore}(q) = \max\!\Bigl(0,\; \frac{\sum_{r\in R_q} s_r}{\sum_{r\in R_q^+} w_r}\Bigr)\,,

where Rq+R_q^+ is the set of rubrics with positive weights. Scores are clipped to [0,1][0,1] for normalization. The PassRate across a question set QQ is defined as

PassRate(Q)=1QqQ1[ExpertScore(q)0.7].\text{PassRate}(Q) = \frac{1}{|Q|} \sum_{q \in Q} \mathbf{1}\bigl[\text{ExpertScore}(q) \ge 0.7 \bigr].

Aggregations can be performed per domain, rubric type, or as an overall mean benchmark score (Yang et al., 9 Mar 2026).

4. Task Requirements and Operational Distinctions

Distinct from prior benchmarks, 1M-Bench tasks explicitly require:

  • Retrieval and citation of authoritative sources (e.g., SEC filings, medical guidelines)
  • Resolution of conflicting evidence (e.g., divergent regulatory thresholds)
  • Application of domain-specific rules (e.g., IFRS 17 methodology, Supreme People’s Court precedents)
  • Constraint decisions (e.g., compliance with “non-extraction only” directives in clinical or industrial contexts)

These dimensions introduce nontrivial professional trade-offs such as risk management and feasibility, and cannot be addressed by pattern-matching or simplistic chain-of-thought approaches. This design aims to reveal agent limitations in handling conflicting information, integrating multi-source evidence, and following professional standards under real-world constraints (Yang et al., 9 Mar 2026).

5. Experimental Results and Performance Insights

Thirty-five models and agentic systems were evaluated under “Vanilla” (no tools) and “Search” (web search) conditions for both Global and CN subsets. Top-performing models (Global subset) included Claude-Opus-4.6 and GPT-5.4-High. Representative metrics:

Model Economic Value ($k)</th> <th>ExpertScore (%)</th> <th>PassRate (%)</th> </tr> </thead><tbody><tr> <td></td> <td>Vanilla</td> <td>Search</td> <td>Vanilla</td> </tr> <tr> <td>Claude-Opus-4.6</td> <td>439.2</td> <td>483.8↑44.6</td> <td>55.0</td> </tr> <tr> <td>GPT-5.4-High</td> <td>330.9</td> <td>365.5↑34.6</td> <td>55.0</td> </tr> </tbody></table></div> <p>Key findings:</p> <ul> <li><strong>Claude-Opus-4.6</strong> achieved \$483.8k economic value, 63% ExpertScore, and 43.5% PassRate with search.
  • Search augmentation improved leading models by +4–12 points (ExpertScore) and +6–13 points (PassRate); however, weaker models sometimes deteriorated with noisy retrieval.
  • DeepResearch agents (e.g., o3-DeepResearch, Sonar-DeepResearch) delivered mid-tier results (40–50% ExpertScore) but did not surpass top generalists.
  • Domain difficulty varied: Finance was most challenging; Healthcare and Law yielded the highest scores. This profile was consistent across language subsets.
  • Rubric-type breakdown: Structure & Formatting (85–90%), Instruction Following (65–82%), Analytical Reasoning (40–60%), Factual Information (40–65%). Search most improved Factual Information and Analytical Reasoning for strong models.
  • Pareto analysis indicated that search-enabled agents dominate non-search baselines at equivalent inference costs, while smaller agents offer mid-range performance at reduced computational expense (Yang et al., 9 Mar 2026).

    6. Professional Readiness, Conclusions, and Future Directions

    OneMillion-Bench reveals a persistent reliability gap between leading LLM agents and human experts in high-stakes domains. Even the best agents (with search capabilities) attain only ~60% ExpertScore and ~40% PassRate, falling short on complex, multi-constraint deliverables. High scores in Structure & Formatting can mask deeper failures in factual and analytical dimensions, indicating limitations in substantive reasoning. Tool-use (web search) has conditional benefits—empowering strong models but introducing risk for retrieval-sensitive systems—and temporal sensitivity remains a challenge, as performance declines on time-sensitive tasks.

    Planned future developments include domain expansion (to energy, climate, public policy), dynamic benchmark adaptation reflecting evolving professional workflows, and the development of rule-based or learned judges for scalable evaluation of intermediate reasoning, citations, and compliance. OneMillion-Bench thus advances the field toward economically grounded, compliance-oriented, and reliability-demanding benchmarks for language agents capable of meaningful value creation in expert settings (Yang et al., 9 Mar 2026).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)

    Topic to Video (Beta)

    No one has generated a video about this topic yet.

    Whiteboard

    No one has generated a whiteboard explanation for this topic yet.

    Follow Topic

    Get notified by email when new papers are published related to OneMillion-Bench ($1M-Bench).