Papers
Topics
Authors
Recent
2000 character limit reached

EcomBench: E-Commerce Agent Evaluation

Updated 12 December 2025
  • EcomBench is a comprehensive benchmark that evaluates foundation agents in real-world e-commerce environments with authentic user demands and multi-step reasoning.
  • It employs expert curation, fine-grained categorization, and dynamic difficulty stratification to rigorously assess performance across seven key business functions.
  • Evaluation results highlight significant challenges in advanced reasoning and domain integration, emphasizing the need for improved agent adaptation in complex e-commerce ecosystems.

EcomBench is a holistic, large-scale benchmark focused on rigorous evaluation of foundation agents within authentic e-commerce ecosystems. Distinct from prior academic agent benchmarks, which emphasize synthetic tasks or narrow scenarios, EcomBench grounds agent assessment in realistic user demands sampled from leading international marketplaces. It methodologically combines expert curation, fine-grained categorization, and a dynamic stratification of question difficulty to reveal key limitations in contemporary agent reasoning, tool use, and domain knowledge integration (Min et al., 9 Dec 2025).

1. Motivation and Benchmark Framework

E-commerce constitutes a high-complexity, high-volume domain characterized by heterogeneous user intents, dynamic market regulations, and direct ties to economic decision-making. Previous benchmarks predominantly addressed artificial puzzles or simplified business logic, lacking coverage of authentic multi-step procedural tasks, unstable inventory conditions, and non-trivial regulatory constraints. EcomBench was established to provide a domain-grounded evaluation platform, measuring not only basic retrieval or arithmetic but also advanced reasoning, cross-source information synthesis, and robust tool utilization. Its design pipeline includes:

  • Collection and anonymization of millions of raw “user demand” logs (e.g., search, support, dashboard queries) from global platforms such as Amazon, AliExpress, Walmart.
  • LLM-based filtering and rewriting of logs into candidate Q&A pairs, discarding unsolvable or subjective entries.
  • Three-stage expert annotation: professional rewrite, ground-truth verification, and consensus labeling (discarding items with any disagreement).
  • Difficulty annotation via a tool hierarchy: tasks are classified based on the minimal toolset (atomic search/browse vs. domain-specific APIs) and number of actions required. Classification levels are Easy, Medium, and Hard.
  • Quarterly benchmark refresh: obsolete items are retired, and new questions reflecting market changes are added.

2. Task Categories and Formal Definitions

EcomBench defines seven mutually exclusive fine-grained task categories, each reflecting a core business function:

  1. Policy Consulting: Given product specification PP and target market MM, retrieve or regulate the applicable rule RR and compute quantitative limit LL.
  2. Cost and Pricing: Given costs C={c1,,cm}C=\{c_1,\dots,c_m\}, duties DD, taxes τ\tau, and exchange rates EE, compute the total or optimal price pp^* under margin/budget constraints.
  3. Fulfillment Execution: Given ODO \to D routing and shipping options SS, select sSs \in S minimizing delay d(s)d(s) within budget BB.
  4. Marketing Strategy: For portfolio PP and segment UU, design promotion plan π\pi maximizing reach R(π)R(\pi) subject to budget bb.
  5. Intelligent Product Selection: Rank categories kk by forecasted growth gkg_k using historical time series XX and trend signals TT across horizon HH.
  6. Opportunity Discovery: Using metrics MM, apply anomaly/clustering detection to identify emerging segments SS where g(S)θg(S)\geq\theta.
  7. Inventory Control: Given demand D(t)D(t), holding cost hh, and penalty pp, choose safety stock ss^* minimizing

Loss(s)=0T[h(sD(t))++p(D(t)s)+]dt\text{Loss}(s) = \int_0^T [h \cdot (s - D(t))^+ + p \cdot (D(t) - s)^+] dt

These formalizations are representative based on the benchmark’s structure and described annotation procedures (Min et al., 9 Dec 2025).

3. Dataset Construction and Annotation Pipeline

The benchmark leverages anonymized logs representing genuine business and user requests. Raw entries are pre-filtered using LLMs to discard ambiguous or non-verifiable samples. Expert annotators, numbering over ten full-time specialists, refine each question for clarity, offer complete specification, and validate each answer independently. Instances with any label disagreement are excluded, ensuring ground-truth reliability. “Large-scale” here refers to approximately 5,000 unique, verified instances, evenly distributed across task categories (within ±5%). Difficulty stratification yields Level 1 (20%), Level 2 (30%), and Level 3 (50%) items; annotation demands about eight human-hours per question (Min et al., 9 Dec 2025).

4. Difficulty Levels and Assignment Rubric

Difficulty is operationalized via atomic vs. domain-specific tool usage and minimal action count. Let QQ be a benchmark question and TatomicT_\text{atomic} the set of web-based tools. The difficulty function is:

  • Level 1: S(Q,Tatomic)2S(Q, T_\text{atomic}) \leq 2 (single-step/superficial lookup).
  • Level 2: 3S(Q,Tatomic)43 \leq S(Q, T_\text{atomic}) \leq 4 (multi-step, moderate reasoning).
  • Level 3: S(Q,Tatomic)>4S(Q, T_\text{atomic}) > 4 or unsolvable via TatomicT_\text{atomic}, requiring Tecom_specificT_\text{ecom\_specific} (e.g., price_lookup, trend_analyzer).

This stratification enforces progressive complexity, mapping directly to critical agent capabilities: deep retrieval, procedural reasoning, and integration of heterogeneous business APIs.

5. Evaluation Metrics

EcomBench adopts a binary-correctness protocol. All model outputs are normalized and compared to ground-truth answers, with a judge LLM assigning a 0/1 verdict. Metrics are:

  • Overall accuracy: Accuracy=1Ni=1N1[y^i=yi]\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{y}_i = y_i]
  • Per-level accuracy: AccuracyL=1NLilevel L1[y^i=yi]\text{Accuracy}_L = \frac{1}{N_L} \sum_{i \in \text{level }L} \mathbb{1}[\hat{y}_i = y_i]
  • Per-category accuracy: Accuracycat=1Ncaticat1[y^i=yi]\text{Accuracy}_\text{cat} = \frac{1}{N_\text{cat}} \sum_{i \in \text{cat}} \mathbb{1}[\hat{y}_i = y_i]

Consistency is verified with random manual audit, confirming >95% agreement with automated LLM judgment.

6. Experimental Results and Analysis

Tests include 12 leading agent models (ChatGPT-5.1, Gemini DeepResearch, SuperGrok Expert, Doubao DeepResearch, etc.), each run on provider-default infrastructure (up to 8×A100 GPUs, no further hyperparameter tuning). Results by difficulty:

Level Top-10 Model Accuracy
Level 1 80–95%
Level 2 ~60–75%
Level 3 46% (best), <35% (typical)

Level 3 remains challenging, indicating persistent gaps in cross-source integration and long-chain reasoning. Category-wise, no single agent leads in all business areas:

  • Policy Consulting / Fulfillment: ChatGPT-5.1, SuperGrok
  • Cost / Inventory: SuperGrok Expert, ChatGPT-5.1
  • Strategy/Opportunity/Selection/Marketing: Gemini DeepResearch, ChatGPT-5.1

Removal of domain-specific tool APIs (e.g., price_lookup) drops Level 3 accuracy by 15–20%, while disabling chain-of-thought prompts causes ~10% reduction in Level 2.

7. Limitations, Controversies, and Future Directions

EcomBench currently isolates question-answer tasks, omitting end-to-end action sequences typical in live transaction environments (e.g., full order placement). Predictive, multimodal, and interactive workflows are reserved for future quarterly releases, alongside planned expansion to include a dashboard simulation module, A/B test design, and tabular/image data support. Annotation cost remains high (~8 hours/question), constraining rapid scaling. Community-driven challenge submission is scheduled for future versions.

A plausible implication is that while accuracy on superficially simple questions approaches saturation for top models, agents remain bottlenecked by real-world connectivities and domain adaptation. The benchmark continues to evolve, aiming for comprehensive end-to-end assessment grounded in authentic e-commerce complexities (Min et al., 9 Dec 2025).


EcomBench represents the state of the art in holistic domain-specific foundation agent evaluation, extending beyond synthetic logic exercises to cover the multi-modal, multi-entity, and multi-step reasoning demands of modern digital commerce. Its stratified, expert-verified, and quarterly updated framework is positioned as a crucial reference for agent development and deployment in production e-commerce ecosystems (Min et al., 9 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EcomBench.