EcomBench: E-Commerce Agent Evaluation
- EcomBench is a comprehensive benchmark that evaluates foundation agents in real-world e-commerce environments with authentic user demands and multi-step reasoning.
- It employs expert curation, fine-grained categorization, and dynamic difficulty stratification to rigorously assess performance across seven key business functions.
- Evaluation results highlight significant challenges in advanced reasoning and domain integration, emphasizing the need for improved agent adaptation in complex e-commerce ecosystems.
EcomBench is a holistic, large-scale benchmark focused on rigorous evaluation of foundation agents within authentic e-commerce ecosystems. Distinct from prior academic agent benchmarks, which emphasize synthetic tasks or narrow scenarios, EcomBench grounds agent assessment in realistic user demands sampled from leading international marketplaces. It methodologically combines expert curation, fine-grained categorization, and a dynamic stratification of question difficulty to reveal key limitations in contemporary agent reasoning, tool use, and domain knowledge integration (Min et al., 9 Dec 2025).
1. Motivation and Benchmark Framework
E-commerce constitutes a high-complexity, high-volume domain characterized by heterogeneous user intents, dynamic market regulations, and direct ties to economic decision-making. Previous benchmarks predominantly addressed artificial puzzles or simplified business logic, lacking coverage of authentic multi-step procedural tasks, unstable inventory conditions, and non-trivial regulatory constraints. EcomBench was established to provide a domain-grounded evaluation platform, measuring not only basic retrieval or arithmetic but also advanced reasoning, cross-source information synthesis, and robust tool utilization. Its design pipeline includes:
- Collection and anonymization of millions of raw “user demand” logs (e.g., search, support, dashboard queries) from global platforms such as Amazon, AliExpress, Walmart.
- LLM-based filtering and rewriting of logs into candidate Q&A pairs, discarding unsolvable or subjective entries.
- Three-stage expert annotation: professional rewrite, ground-truth verification, and consensus labeling (discarding items with any disagreement).
- Difficulty annotation via a tool hierarchy: tasks are classified based on the minimal toolset (atomic search/browse vs. domain-specific APIs) and number of actions required. Classification levels are Easy, Medium, and Hard.
- Quarterly benchmark refresh: obsolete items are retired, and new questions reflecting market changes are added.
2. Task Categories and Formal Definitions
EcomBench defines seven mutually exclusive fine-grained task categories, each reflecting a core business function:
- Policy Consulting: Given product specification and target market , retrieve or regulate the applicable rule and compute quantitative limit .
- Cost and Pricing: Given costs , duties , taxes , and exchange rates , compute the total or optimal price under margin/budget constraints.
- Fulfillment Execution: Given routing and shipping options , select minimizing delay within budget .
- Marketing Strategy: For portfolio and segment , design promotion plan maximizing reach subject to budget .
- Intelligent Product Selection: Rank categories by forecasted growth using historical time series and trend signals across horizon .
- Opportunity Discovery: Using metrics , apply anomaly/clustering detection to identify emerging segments where .
- Inventory Control: Given demand , holding cost , and penalty , choose safety stock minimizing
These formalizations are representative based on the benchmark’s structure and described annotation procedures (Min et al., 9 Dec 2025).
3. Dataset Construction and Annotation Pipeline
The benchmark leverages anonymized logs representing genuine business and user requests. Raw entries are pre-filtered using LLMs to discard ambiguous or non-verifiable samples. Expert annotators, numbering over ten full-time specialists, refine each question for clarity, offer complete specification, and validate each answer independently. Instances with any label disagreement are excluded, ensuring ground-truth reliability. “Large-scale” here refers to approximately 5,000 unique, verified instances, evenly distributed across task categories (within ±5%). Difficulty stratification yields Level 1 (20%), Level 2 (30%), and Level 3 (50%) items; annotation demands about eight human-hours per question (Min et al., 9 Dec 2025).
4. Difficulty Levels and Assignment Rubric
Difficulty is operationalized via atomic vs. domain-specific tool usage and minimal action count. Let be a benchmark question and the set of web-based tools. The difficulty function is:
- Level 1: (single-step/superficial lookup).
- Level 2: (multi-step, moderate reasoning).
- Level 3: or unsolvable via , requiring (e.g., price_lookup, trend_analyzer).
This stratification enforces progressive complexity, mapping directly to critical agent capabilities: deep retrieval, procedural reasoning, and integration of heterogeneous business APIs.
5. Evaluation Metrics
EcomBench adopts a binary-correctness protocol. All model outputs are normalized and compared to ground-truth answers, with a judge LLM assigning a 0/1 verdict. Metrics are:
- Overall accuracy:
- Per-level accuracy:
- Per-category accuracy:
Consistency is verified with random manual audit, confirming >95% agreement with automated LLM judgment.
6. Experimental Results and Analysis
Tests include 12 leading agent models (ChatGPT-5.1, Gemini DeepResearch, SuperGrok Expert, Doubao DeepResearch, etc.), each run on provider-default infrastructure (up to 8×A100 GPUs, no further hyperparameter tuning). Results by difficulty:
| Level | Top-10 Model Accuracy |
|---|---|
| Level 1 | 80–95% |
| Level 2 | ~60–75% |
| Level 3 | 46% (best), <35% (typical) |
Level 3 remains challenging, indicating persistent gaps in cross-source integration and long-chain reasoning. Category-wise, no single agent leads in all business areas:
- Policy Consulting / Fulfillment: ChatGPT-5.1, SuperGrok
- Cost / Inventory: SuperGrok Expert, ChatGPT-5.1
- Strategy/Opportunity/Selection/Marketing: Gemini DeepResearch, ChatGPT-5.1
Removal of domain-specific tool APIs (e.g., price_lookup) drops Level 3 accuracy by 15–20%, while disabling chain-of-thought prompts causes ~10% reduction in Level 2.
7. Limitations, Controversies, and Future Directions
EcomBench currently isolates question-answer tasks, omitting end-to-end action sequences typical in live transaction environments (e.g., full order placement). Predictive, multimodal, and interactive workflows are reserved for future quarterly releases, alongside planned expansion to include a dashboard simulation module, A/B test design, and tabular/image data support. Annotation cost remains high (~8 hours/question), constraining rapid scaling. Community-driven challenge submission is scheduled for future versions.
A plausible implication is that while accuracy on superficially simple questions approaches saturation for top models, agents remain bottlenecked by real-world connectivities and domain adaptation. The benchmark continues to evolve, aiming for comprehensive end-to-end assessment grounded in authentic e-commerce complexities (Min et al., 9 Dec 2025).
EcomBench represents the state of the art in holistic domain-specific foundation agent evaluation, extending beyond synthetic logic exercises to cover the multi-modal, multi-entity, and multi-step reasoning demands of modern digital commerce. Its stratified, expert-verified, and quarterly updated framework is positioned as a crucial reference for agent development and deployment in production e-commerce ecosystems (Min et al., 9 Dec 2025).