EcomBench: E-Commerce Agent Evaluation

Updated 12 December 2025

EcomBench is a comprehensive benchmark that evaluates foundation agents in real-world e-commerce environments with authentic user demands and multi-step reasoning.
It employs expert curation, fine-grained categorization, and dynamic difficulty stratification to rigorously assess performance across seven key business functions.
Evaluation results highlight significant challenges in advanced reasoning and domain integration, emphasizing the need for improved agent adaptation in complex e-commerce ecosystems.

EcomBench is a holistic, large-scale benchmark focused on rigorous evaluation of foundation agents within authentic e-commerce ecosystems. Distinct from prior academic agent benchmarks, which emphasize synthetic tasks or narrow scenarios, EcomBench grounds agent assessment in realistic user demands sampled from leading international marketplaces. It methodologically combines expert curation, fine-grained categorization, and a dynamic stratification of question difficulty to reveal key limitations in contemporary agent reasoning, tool use, and domain knowledge integration (Min et al., 9 Dec 2025).

1. Motivation and Benchmark Framework

E-commerce constitutes a high-complexity, high-volume domain characterized by heterogeneous user intents, dynamic market regulations, and direct ties to economic decision-making. Previous benchmarks predominantly addressed artificial puzzles or simplified business logic, lacking coverage of authentic multi-step procedural tasks, unstable inventory conditions, and non-trivial regulatory constraints. EcomBench was established to provide a domain-grounded evaluation platform, measuring not only basic retrieval or arithmetic but also advanced reasoning, cross-source information synthesis, and robust tool utilization. Its design pipeline includes:

Collection and anonymization of millions of raw “user demand” logs (e.g., search, support, dashboard queries) from global platforms such as Amazon, AliExpress, Walmart.
LLM-based filtering and rewriting of logs into candidate Q&A pairs, discarding unsolvable or subjective entries.
Three-stage expert annotation: professional rewrite, ground-truth verification, and consensus labeling (discarding items with any disagreement).
Difficulty annotation via a tool hierarchy: tasks are classified based on the minimal toolset (atomic search/browse vs. domain-specific APIs) and number of actions required. Classification levels are Easy, Medium, and Hard.
Quarterly benchmark refresh: obsolete items are retired, and new questions reflecting market changes are added.

2. Task Categories and Formal Definitions

EcomBench defines seven mutually exclusive fine-grained task categories, each reflecting a core business function:

Policy Consulting: Given product specification $P$ and target market $M$ , retrieve or regulate the applicable rule $R$ and compute quantitative limit $L$ .
Cost and Pricing: Given costs $C=\{c_1,\dots,c_m\}$ , duties $D$ , taxes $\tau$ , and exchange rates $E$ , compute the total or optimal price $p^*$ under margin/budget constraints.
Fulfillment Execution: Given $O \to D$ routing and shipping options $S$ , select $s \in S$ minimizing delay $d(s)$ within budget $B$ .
Marketing Strategy: For portfolio $P$ and segment $U$ , design promotion plan $\pi$ maximizing reach $R(\pi)$ subject to budget $b$ .
Intelligent Product Selection: Rank categories $k$ by forecasted growth $g_k$ using historical time series $X$ and trend signals $T$ across horizon $H$ .
Opportunity Discovery: Using metrics $M$ , apply anomaly/clustering detection to identify emerging segments $S$ where $g(S)\geq\theta$ .
Inventory Control: Given demand $D(t)$ , holding cost $h$ , and penalty $p$ , choose safety stock $s^*$ minimizing

$\text{Loss}(s) = \int_0^T [h \cdot (s - D(t))^+ + p \cdot (D(t) - s)^+] dt$

These formalizations are representative based on the benchmark’s structure and described annotation procedures (Min et al., 9 Dec 2025).

3. Dataset Construction and Annotation Pipeline

The benchmark leverages anonymized logs representing genuine business and user requests. Raw entries are pre-filtered using LLMs to discard ambiguous or non-verifiable samples. Expert annotators, numbering over ten full-time specialists, refine each question for clarity, offer complete specification, and validate each answer independently. Instances with any label disagreement are excluded, ensuring ground-truth reliability. “Large-scale” here refers to approximately 5,000 unique, verified instances, evenly distributed across task categories (within ±5%). Difficulty stratification yields Level 1 (20%), Level 2 (30%), and Level 3 (50%) items; annotation demands about eight human-hours per question (Min et al., 9 Dec 2025).

4. Difficulty Levels and Assignment Rubric

Difficulty is operationalized via atomic vs. domain-specific tool usage and minimal action count. Let $Q$ be a benchmark question and $T_\text{atomic}$ the set of web-based tools. The difficulty function is:

Level 1: $S(Q, T_\text{atomic}) \leq 2$ (single-step/superficial lookup).
Level 2: $3 \leq S(Q, T_\text{atomic}) \leq 4$ (multi-step, moderate reasoning).
Level 3: $S(Q, T_\text{atomic}) > 4$ or unsolvable via $T_\text{atomic}$ , requiring $T_\text{ecom\_specific}$ (e.g., price_lookup, trend_analyzer).

This stratification enforces progressive complexity, mapping directly to critical agent capabilities: deep retrieval, procedural reasoning, and integration of heterogeneous business APIs.

5. Evaluation Metrics

EcomBench adopts a binary-correctness protocol. All model outputs are normalized and compared to ground-truth answers, with a judge LLM assigning a 0/1 verdict. Metrics are:

Overall accuracy: $\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{y}_i = y_i]$
Per-level accuracy: $\text{Accuracy}_L = \frac{1}{N_L} \sum_{i \in \text{level }L} \mathbb{1}[\hat{y}_i = y_i]$
Per-category accuracy: $\text{Accuracy}_\text{cat} = \frac{1}{N_\text{cat}} \sum_{i \in \text{cat}} \mathbb{1}[\hat{y}_i = y_i]$

Consistency is verified with random manual audit, confirming >95% agreement with automated LLM judgment.

6. Experimental Results and Analysis

Tests include 12 leading agent models (ChatGPT-5.1, Gemini DeepResearch, SuperGrok Expert, Doubao DeepResearch, etc.), each run on provider-default infrastructure (up to 8×A100 GPUs, no further hyperparameter tuning). Results by difficulty:

Level	Top-10 Model Accuracy
Level 1	80–95%
Level 2	~60–75%
Level 3	46% (best), <35% (typical)

Level 3 remains challenging, indicating persistent gaps in cross-source integration and long-chain reasoning. Category-wise, no single agent leads in all business areas:

Policy Consulting / Fulfillment: ChatGPT-5.1, SuperGrok
Cost / Inventory: SuperGrok Expert, ChatGPT-5.1
Strategy/Opportunity/Selection/Marketing: Gemini DeepResearch, ChatGPT-5.1

Removal of domain-specific tool APIs (e.g., price_lookup) drops Level 3 accuracy by 15–20%, while disabling chain-of-thought prompts causes ~10% reduction in Level 2.

7. Limitations, Controversies, and Future Directions

EcomBench currently isolates question-answer tasks, omitting end-to-end action sequences typical in live transaction environments (e.g., full order placement). Predictive, multimodal, and interactive workflows are reserved for future quarterly releases, alongside planned expansion to include a dashboard simulation module, A/B test design, and tabular/image data support. Annotation cost remains high (~8 hours/question), constraining rapid scaling. Community-driven challenge submission is scheduled for future versions.

A plausible implication is that while accuracy on superficially simple questions approaches saturation for top models, agents remain bottlenecked by real-world connectivities and domain adaptation. The benchmark continues to evolve, aiming for comprehensive end-to-end assessment grounded in authentic e-commerce complexities (Min et al., 9 Dec 2025).

EcomBench represents the state of the art in holistic domain-specific foundation agent evaluation, extending beyond synthetic logic exercises to cover the multi-modal, multi-entity, and multi-step reasoning demands of modern digital commerce. Its stratified, expert-verified, and quarterly updated framework is positioned as a crucial reference for agent development and deployment in production e-commerce ecosystems (Min et al., 9 Dec 2025).

Markdown Upgrade to Chat

References (1)

EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EcomBench.

EcomBench: E-Commerce Agent Evaluation

1. Motivation and Benchmark Framework

2. Task Categories and Formal Definitions

3. Dataset Construction and Annotation Pipeline

4. Difficulty Levels and Assignment Rubric

5. Evaluation Metrics

6. Experimental Results and Analysis

7. Limitations, Controversies, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

EcomBench: E-Commerce Agent Evaluation

1. Motivation and Benchmark Framework

2. Task Categories and Formal Definitions

3. Dataset Construction and Annotation Pipeline

4. Difficulty Levels and Assignment Rubric

5. Evaluation Metrics

6. Experimental Results and Analysis

7. Limitations, Controversies, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research