E-Commerce Bargaining Benchmark

Updated 15 September 2025

E-Commerce Bargaining Benchmark is a comprehensive framework that evaluates online negotiation by integrating dynamic pricing, double-side aggregation, and logistical optimization.
It applies mixed-integer linear programming and incomplete information game models to measure AI-driven buyer and seller negotiation performance using real-world datasets.
Enhanced by multi-turn dialogue and intention-aware reasoning, the benchmark supports scalable, safe, and robust evaluation of intelligent bargaining agents in complex e-commerce scenarios.

An E-Commerce Bargaining Benchmark is a formal framework for evaluating the negotiation, aggregation, and reasoning abilities of algorithms and agents—particularly those using machine learning and artificial intelligence—within the context of online commerce. Such benchmarks range in scope from dynamic pricing models and aggregation frameworks to interaction-based agent evaluation, and combine structured negotiation settings, operational parameters, and rigorous evaluation protocols. They enable precise, reproducible comparisons of bargaining capabilities for both buyers and sellers, with particular attention to realistic market constraints, multiple parties, and task heterogeneity.

1. Aggregated Bargaining Models and Double-Side Aggregation

Early benchmark proposals focus on models that aggregate both buyers and sellers for optimal price formation and logistics. The "fair" model (Gallo et al., 2016) and its extension in the "e-fair" framework (Gallo et al., 2017) define cooperative e-commerce environments that move beyond auctions (buyer competition driving prices up) or group-buying (buyer aggregation for volume discounts):

Double-Side Aggregation: Both buyers and sellers are clustered; sellers provide price–quantity curves (typically broken-line or piecewise linear functions), buyers pool demand to leverage lower prices, and logistic parameters (distance, waiting time, payment time) are included.
System Implementation: Prototypes were built atop OpenCart and evaluated with ad hoc Matlab simulators, integrating dynamic price computation and spatial shipment optimization.
Analytical Model: Mixed-integer linear programming models—augmented with auxiliary variables and binary segment indicators—allow for continuous, scalable price optimization across supply-demand channels:

$p_{(i)}(x_i) = \begin{cases} f_{(i)}^{(1)} \cdot x_i & 0 < x_i \leq x_i^{(1)}\ f_{(i)}^{(2)} \cdot x_i + c_2 & x_i^{(1)} < x_i \leq x_i^{(2)} \ \dots & \ f_{(i)}^{(L)} \cdot x_i + c_L & x_i^{(L-1)} < x_i \leq x_i^{(L)} \end{cases}$

Benefits and Impact: Aggregation yields both purchase and shipment cost savings and alters traditional bargaining power structures, shifting them toward collaborative optimization.

2. Bargaining as Incomplete Information Game and Agent Evaluation

Recent advancements conceptualize bargaining as an asymmetric, incomplete information game suitable for agent evaluation (Xia et al., 24 Feb 2024). Core features include:

Formal Bargaining Game: Buyer’s budget ( $B$ ) and seller’s cost ( $C$ ) are private; the negotiation proceeds in discrete offer rounds with both sides seeking profit:

$P_b = B - D, \quad P_s = D - C$

Normalized utility allows for cross-session comparison:

$P_b' = (B - D) / |B - C|, \quad P_s' = (D - C) / |B - C|$

Dataset Grounding: AmazonHistoryPrice dataset (930 products, 18 categories) provides real-world product and price diversity.
LLM Agent Benchmarking: Multiple LLM agents (e.g., GPT-4, Llama-2, Yi, Mistral, Qwen) are systematically tested in Buyer/Seller roles. Metrics include valid rate, deal rate, sum of (normalized) profits.
Methodological Enhancement: The OG-Narrator pipeline decouples deterministic price generation from natural language "narration," significantly boosting buyer success rates (e.g., deal rate up from 26.67% to 88.88% and 10× profit improvement).

3. Bargaining Intention and Intent-Aware Reasoning

Understanding and modeling purchase intentions are critical for effective bargaining. The IntentionQA benchmark (Ding et al., 14 Jun 2024) advances evaluation along two axes:

Double-Task MCQA Protocol: Task 1—intent inference from product pairs; Task 2—intent utilization for predicting next purchases.
Automated Construction: Product names are conceptually mapped and embedded via ASER graph context with distractor and difficulty sampling based on cosine similarity:

$\mathrm{Sim}^{(p)}(p_1, p_2) = \cos_{\mathrm{sim}}(\mathrm{CE}(p_1), \mathrm{CE}(p_2))$

Human Evaluation: High correctness (96–97%) and low false-negative distractor rates validate benchmark quality.
Significance: Results reveal large model gaps in joint product–intention reasoning, highlighting the challenge of intent comprehension for state-of-the-art LLMs.

4. Price Competition, Filtering, and Equilibrium Analysis

Elastic price competition under realistic customer choice models—particularly those reflecting search/filter mechanisms—is central to e-commerce bargaining (Banerjee et al., 19 Aug 2024):

Flexible CLC Model: Customers first form a consideration set by filtering sellers based on non-price attributes and price willingness. Selection then follows lexicographic attribute ordering.
Equilibrium Characterization: Sellers face a pseudo-competitive game; local Nash equilibria may be efficiently computed using "valid order–price pairs" and the gradient dominance condition:

$R_i(\mathbf{p}) = p_i \cdot D_i(\mathbf{p}), \qquad g_i(p_i, p_{-i}) = \frac{\partial R_i}{\partial p_i}$

Gradient ascent dynamics converge to equilibrium under mild conditions.

Robustness: Numerical experiments with diverse customer types (loyal, quality-sensitive, price-sensitive) confirm the model predicts realistic and stable price trajectories, even when "gradient dominance" is relaxed.

5. Dialogue-Based Bargaining Agents and Benchmarking Seller Strategy

Dialogue-centric benchmarking frameworks are increasingly important for evaluating agents in marketplaces with multi-turn negotiations:

FishBargain Agent (Kong et al., 22 Jan 2025): An LLM-based seller assistant for fleamarkets decomposes the bargaining process into price extraction, strategic action selection (with bi-directional adversary modeling), and natural language utterance generation. Performance metrics include success rate, average negotiation turns, and sale-to-list ratio.
Multi-Turn Bargain Evaluation (Wang et al., 8 Sep 2025): Large-scale benchmarks (3,014 tasks, 9,892 products, 622 categories) use turn-level Theory of Mind (ToM) annotation, requiring agents to track buyer intent, strategic moves, and tool invocation. Evaluation metrics incorporate intent precision/recall/F1, and failure rates:

$\mathrm{IP} = \frac{\mathrm{CI}}{\mathrm{CI} + \mathrm{MMI} + \mathrm{II}}$

This fine-grained approach captures agent robustness in dynamic, real-world bargaining.

6. Heterogeneity, Safety, and Comprehensive Benchmark Coverage

Comprehensive benchmarks such as ChineseEcomQA (Chen et al., 27 Feb 2025), DeepShop (Lyu et al., 3 Jun 2025), ShoppingBench (Wang et al., 6 Aug 2025), WebMall (Peeters et al., 18 Aug 2025), Amazon-Bench (Zhang et al., 18 Aug 2025), and ECom-Bench (Wang et al., 8 Jul 2025) expand evaluation scope:

Task and Domain Heterogeneity: Coverage includes fundamental concepts, complex negotiation/bargaining tasks (coupons, vouchers, multi-product aggregation), customer support, multimodal inputs, and safety-critical operations (address management, auto-reload).
Benchmark Construction: Use of scalable automated pipelines, retrieval-augmented validation, and manual annotation balances generality and domain specificity. Metrics such as absolute success rate (ASR), pass^k, precision, recall, and safety classification provide granular diagnosis.
Safety Considerations: Automated "LLM-as-Judge" frameworks classify outcomes as success, benign failure, or harmful failure, with formulas such as:

$\text{pages}_i = \max\left(m, \frac{\ln(1 + D_i)}{\sum_j \ln(1 + D_j)} N\right)$

7. Practical Implications and Future Directions

These benchmarking frameworks inform future directions in algorithm and agent design:

Scalability: System prototypes demonstrate integration with existing platforms and applicability to broad product/service domains.
Intention-Aware Bargaining: Enhanced comprehension and utilization of purchase intentions improve recommendation, search, and negotiation outcomes.
Safety and Robustness: Focus on avoiding harmful actions and controlling negotiation consistency.
Multi-Modal and Multi-Turn Reasoning: Advancements in agent architectures and evaluation support more intelligent, context-aware bargaining behavior in complex, high-stakes e-commerce environments.

A plausible implication is that composite benchmarks—integrating aggregation, incomplete information games, intention modeling, and multi-turn agent evaluation—will be central to the credible advancement of intelligent bargaining systems in future online marketplaces.