OneMillion-Bench: Language Agent Benchmark
- OneMillion-Bench is a comprehensive benchmark that assesses language agents across professional domains using expert-curated tasks with explicit economic metrics.
- It employs 400 semi-open-ended tasks across five sectors, using detailed rubric-based scoring to evaluate multi-step reasoning and adherence to domain-specific rules.
- The evaluation highlights significant performance gaps in factual integration and analytical reasoning, stressing the need for enhanced professional readiness in language models.
OneMillion-Bench (1M-Bench) is a comprehensive benchmark designed to evaluate language agents in economically consequential, professional domains. It consists of 400 expert-curated tasks spanning Finance, Law, Healthcare, Natural Science, and Industry, each paired with explicit economic value based on expert time and prevailing market wages. Tasks are constructed to mirror real-world professional workflows, requiring not only accurate answers but robust multi-step reasoning, evidence integration, and adherence to domain-specific constraints. The evaluation protocol employs fine-grained, rubric-based scoring, departing from traditional benchmarks focused solely on answer correctness and targeting professional readiness and agentic reliability in domain-intensive contexts (Yang et al., 9 Mar 2026).
1. Motivation and Limitations of Prior Benchmarks
Classic QA and multiple-choice evaluations such as MMLU, GPQA, and HLE are limited to isolated knowledge units or formal reasoning tasks in highly structured environments. These approaches fail to reflect the complexity of real-world deliverables, neglecting factors such as multi-step workflows, the need for authoritative references, evidence reconciliation, and nuanced domain rules. Tool-agent evaluations like SWE-bench, TravelPlanner, and Terminal-Bench introduce planning and tool usage but lack rigorous domain-grounded reasoning assessment. Crucially, there has been no standard benchmark to quantify how much real economic value (senior professional time × market wages) LLMs might deliver on tasks of direct economic import. OneMillion-Bench was instituted to systematically bridge these gaps through high-value domains, explicit task valuation, and comprehensive process-oriented rubrics (Yang et al., 9 Mar 2026).
2. Benchmark Composition and Task Distribution
OneMillion-Bench comprises 400 semi-open-ended, expert-curated tasks:
- Domains: Finance, Law, Healthcare, Natural Science, Industry (80 questions each).
- Language/Cultural Subsets: Each domain is divided evenly between Chinese (CN; 200 tasks) reflecting Mainland-China standards, and English (Global; 200 tasks) based on U.S./international norms.
- Hierarchical Coverage: 37 subdomains and 92 third-level categories ensure diversity (e.g., Finance → “Life Insurance,” “VC/PE”; Healthcare → “Orthopedics,” “Dentistry”; Law → “M&A,” “Data Compliance”; Industry → “Semiconductors,” “Civil Engineering”; Natural Science → “Quantum Physics,” “Genetics”).
| Domain | Subset | Task Count | Avg Value per Q | Total Value |
|---|---|---|---|---|
| Finance | CN (¥) | 80 | ¥7,410.8 | ¥296,432 |
| Finance | Global ($) | 80 | \$4,593.2 | \$183,726 | ||
| Total | CN | 200 | — | ¥921,832 |
| Total | Global | 200 | — | \$1,008,370 |
Tasks are exemplified by prompts such as, “Review the Yen’s 2025 depreciation cycle phases [...] and project 2026 trend-reversion triggers,” requiring structured, multi-source synthesis, quantitative analysis, and forward-looking judgments.
3. Rubric-Based Evaluation Protocol
Each question is paired with a set of rubrics , where every rubric includes:
- Explicit measurement criteria (e.g., legal citation, multi-step calculation, guideline application)
- Weight (ranging up to or ) reflecting impact
- Functional tag: Factual Information (FI), Analytical Reasoning (AR), Instruction Following (IF), Structure & Formatting (SF)
- (Optionally) references to authoritative sources
Scoring is performed either by an LM judge or human expert, assigning for each rubric. The ExpertScore for a question :
where is the set of rubrics with positive weights. Scores are clipped to for normalization. The PassRate across a question set is defined as
Aggregations can be performed per domain, rubric type, or as an overall mean benchmark score (Yang et al., 9 Mar 2026).
4. Task Requirements and Operational Distinctions
Distinct from prior benchmarks, 1M-Bench tasks explicitly require:
- Retrieval and citation of authoritative sources (e.g., SEC filings, medical guidelines)
- Resolution of conflicting evidence (e.g., divergent regulatory thresholds)
- Application of domain-specific rules (e.g., IFRS 17 methodology, Supreme People’s Court precedents)
- Constraint decisions (e.g., compliance with “non-extraction only” directives in clinical or industrial contexts)
These dimensions introduce nontrivial professional trade-offs such as risk management and feasibility, and cannot be addressed by pattern-matching or simplistic chain-of-thought approaches. This design aims to reveal agent limitations in handling conflicting information, integrating multi-source evidence, and following professional standards under real-world constraints (Yang et al., 9 Mar 2026).
5. Experimental Results and Performance Insights
Thirty-five models and agentic systems were evaluated under “Vanilla” (no tools) and “Search” (web search) conditions for both Global and CN subsets. Top-performing models (Global subset) included Claude-Opus-4.6 and GPT-5.4-High. Representative metrics:
| Model | Economic Value ($k)</th>
<th>ExpertScore (%)</th>
<th>PassRate (%)</th>
</tr>
</thead><tbody><tr>
<td></td>
<td>Vanilla</td>
<td>Search</td>
<td>Vanilla</td>
</tr>
<tr>
<td>Claude-Opus-4.6</td>
<td>439.2</td>
<td>483.8↑44.6</td>
<td>55.0</td>
</tr>
<tr>
<td>GPT-5.4-High</td>
<td>330.9</td>
<td>365.5↑34.6</td>
<td>55.0</td>
</tr>
</tbody></table></div>
<p>Key findings:</p>
<ul>
<li><strong>Claude-Opus-4.6</strong> achieved \$483.8k economic value, 63% ExpertScore, and 43.5% PassRate with search.
Pareto analysis indicated that search-enabled agents dominate non-search baselines at equivalent inference costs, while smaller agents offer mid-range performance at reduced computational expense (Yang et al., 9 Mar 2026). 6. Professional Readiness, Conclusions, and Future DirectionsOneMillion-Bench reveals a persistent reliability gap between leading LLM agents and human experts in high-stakes domains. Even the best agents (with search capabilities) attain only ~60% ExpertScore and ~40% PassRate, falling short on complex, multi-constraint deliverables. High scores in Structure & Formatting can mask deeper failures in factual and analytical dimensions, indicating limitations in substantive reasoning. Tool-use (web search) has conditional benefits—empowering strong models but introducing risk for retrieval-sensitive systems—and temporal sensitivity remains a challenge, as performance declines on time-sensitive tasks. Planned future developments include domain expansion (to energy, climate, public policy), dynamic benchmark adaptation reflecting evolving professional workflows, and the development of rule-based or learned judges for scalable evaluation of intermediate reasoning, citations, and compliance. OneMillion-Bench thus advances the field toward economically grounded, compliance-oriented, and reliability-demanding benchmarks for language agents capable of meaningful value creation in expert settings (Yang et al., 9 Mar 2026). |
|---|