Synthetic Buyer Agents in E-Commerce

Updated 2 March 2026

E-Commerce Synthetic Buyer Agents are autonomous AI systems that simulate human shopping behavior using LLMs, VLMs, and tool-use pipelines.
They employ diverse architectures—from single-shot query reformulation to multi-agent cooperative simulations—for product discovery, A/B testing, and safety evaluation.
These agents optimize digital retail experiences by integrating evolutionary algorithms, cohort-weighted simulations, and detailed behavioral analysis to assess market dynamics.

E-commerce synthetic buyer agents are autonomous, AI-driven software systems designed to emulate the decision-making and behavioral patterns of human consumers within digital retail environments. These agents are typically powered by LLMs, vision-LLMs (VLMs), and supporting tool-use pipelines, facilitating tasks ranging from natural-language query rewriting and product discovery to complex purchasing behavior, comparative shopping, multi-modal dialogue, and offline experimentation. The development and deployment of such agents aim to (1) automate high-fidelity user simulation for research and optimization, (2) benchmark and improve agentic shopping capabilities, (3) enable traffic-grounded A/B testing, and (4) provide platforms for systematic evaluation of agent impacts on market dynamics, safety, and fairness.

1. Synthetic Agent Architectures and Pipeline Design

E-commerce synthetic buyer agents span a spectrum of architectures, from single-shot LLMs performing query reformulation to multi-agent systems orchestrating complex shopping missions.

Agent Ensemble Modeling

OptAgent employs a multi-agent simulation framework, instantiating $K$ LLM-based agents with varying temperatures $T_i \in \{0.00, 0.25, 0.50, 0.75, 1.00\}$ to induce diverse reasoning and purchase patterns, explicitly avoiding persona prompting to minimize bias. Agents interact with real e-commerce search interfaces, perform product-level scraping, assign semantic relevance scores $s_{i,j} \in \{-1, 0, +1\}$ to retrieved items, and simulate purchase decisions (Handa et al., 4 Oct 2025).
ProductResearch formalizes a three-agent cooperation: a User Agent infers persona profiles and complex research queries, a Research Agent executes a ReAct-inspired Plan→Toolcall→Report loop leveraging web and catalog tools, and a Supervisor Agent enforces rubric-aligned, multi-turn supervision to correct error trajectories. Approved, length-filtered synthetic trajectories are distilled and used as fine-tuning data for downstream shopping agents (Wang et al., 27 Feb 2026).
SimGym and PAARS construct population-level agentic simulations by mining behavioral archetypes and persona embeddings from large-scale, real clickstream or session data, clustering user sessions, extracting continuous profile vectors (e.g., price sensitivity, exploration depth), and mapping these to structured prompts for browser-attached agents (Castelo et al., 1 Feb 2026, Mansour et al., 31 Mar 2025).

Action and Observation Model

Agents operate over partial observations such as HTML DOM accessibility trees, viewport screenshots, or combined modalities. Action spaces encompass natural-language queries, clicks, tool-use API invocations (e.g., Search(), View(), Cart()), and direct browser controls (scrolling, form entry, navigation). Core pipelines include memory support (short-term or episodic), chain-of-thought prompting for explicit reasoning, and dynamic planning modules to adapt strategies under uncertainty (Lyu et al., 3 Jun 2025, Peeters et al., 18 Aug 2025).

2. Persona and Behavioral Modeling

High-fidelity agent simulation necessitates modeling of shopper diversity at both the individual and cohort levels.

Persona Induction and Embedding

PAARS mines consumer profiles from anonymized session logs, assembling age, income, interests, and shopping preferences into structured JSON personas. These personas drive session-level decision making, with agents equipped with prompt-injected identity, behavioral reasoning, and value signals (Mansour et al., 31 Mar 2025).
SimGym further refines this approach by extracting multi-dimensional vectors spanning behavioral and value axes, clustering sessions, and generating intent statements reflecting high-variance behavioral archetypes (Castelo et al., 1 Feb 2026).
In ProductResearch, the User Agent extrapolates multi-dimensional persona and query-evaluation rubrics directly from behavioral histories, conditioning subsequent research agent behavior to reflect user-specific objectives (Wang et al., 27 Feb 2026).

Alignment Evaluation

Both group- and individual-level alignment are quantified: PAARS, for instance, computes KL divergence between empirical distributions of agent and human outcomes (queries, item selections, session metrics), and average cosine similarity for embedding-based queries. Inclusion of persona signals improves both alignment metrics and behavioral diversity, though gaps persist due to limitations in visual and contextual realism (Mansour et al., 31 Mar 2025).

3. Core Task Domains and Benchmarking

Synthetic buyer agents are evaluated across a range of task granularities and functional benchmarks designed to replicate realistic and high-stakes e-commerce challenges.

Task Taxonomy and Coverage

Amazon-Bench formalizes a comprehensive taxonomy: account management, product and deal search, interaction with and within brand stores, review filtering, media browsing, and safety-critical actions (e.g., address updates, payment configuration) (Zhang et al., 18 Aug 2025).
WebMall and DeepShop introduce multi-shop comparison, price minimization, vague requirement fulfillment, attribute-based and compatibility searches, and end-to-end cart/checkout flows in fully simulated, diverse e-commerce ecosystems (Peeters et al., 18 Aug 2025, Lyu et al., 3 Jun 2025).
ShoppingComp advances evaluation to expert-aligned product retrieval, evidence-based report generation, and explicit safety-critical decision-making, employing fine-grained rubrics and LLM-as-Judge validation (Tou et al., 28 Nov 2025).
Mix-Ecom extends agentic evaluation into multi-type dialogue (QA, recommendation, chit-chat, task-oriented) and complex business-rule adherence, with fine-grained rule enforcement via dynamic modules (Zhou et al., 28 Sep 2025).

Automated Evaluation Protocols

Fine-grained (attribute, filter, sorting sub-goals), holistic (overall task success), and safety metrics (harmful/benign failures) are standardized across benchmarks. Adjudication by both human and automated (LLM-based) judges ensures high-throughput, verifiable scoring. Key metrics include AnswerMatch-F1, scenario coverage, selection accuracy, rationale validity, and composite scores reflecting both effectiveness and safety (Zhang et al., 18 Aug 2025, Tou et al., 28 Nov 2025).

Benchmark	Completion Rate (Basic/Adv)	Safety/Harm Failure	Core Modalities
WebMall	75% / 53% (F1: 87% / 63%)	n/a	AX-Tree, Screenshot
Amazon-Bench	60% (max, GPT-4.1)	4–9%	AX-Tree, CoT
ShoppingComp	11% (GPT-5, F1)	38–65% pass (SOTA)	Tool-use, Report Gen.
DeepShop	≤32% overall (hard)	n/a	Multimodal, Browser

4. Optimization, Evaluation, and A/B Testing Methodologies

Synthetic agents are foundational to both optimization of customer-facing features and rapid, offline experimentation.

Evolutionary and Genetic Algorithms

OptAgent demonstrates a closed-loop pipeline where a genetic algorithm, guided by a multi-agent-derived reward function $F(q)=w_{10}s_{10}+w_{a}s_{a}+w_{p}n$ , iteratively evolves natural-language search queries for maximum relevance and purchase value against a live e-commerce retrieval system. Fitness is directly tied to agent-averaged product relevance and simulated purchase outcomes (Handa et al., 4 Oct 2025).

Multi-Agent Synthetic Trajectory Distillation

ProductResearch generates long-horizon tool-use logs through a supervised multi-agent protocol ( $\mathcal{S}$ , $\mathcal{R}$ , $F_{i,j}$ ), filters for trajectory length and consistency, and performs reflective internalization to inject correction signal. The distilled dataset is used to fine-tune large MoE models, raising RACE scores from 31.78 (base) to 45.40 (ProductResearch, 128k context) (Wang et al., 27 Feb 2026).

Traffic-Grounded, Cohort-Weighted Simulation

SimGym constructs shop-specific agent populations, cohort-weighted to match true user traffic. Offline A/B tests against real-world UI variants yield alignment rates of 69% and Pearson $r=0.64$ with actual human outcome shifts, sharply reducing experimentation cycles to under one hour (Castelo et al., 1 Feb 2026).

Population-Level A/B Testing and Market Simulation

PAARS and ACES simulate agentic A/B environments to evaluate changes in ranking, recommendation, or UI, quantifying both direction of sales lift and model sensitivity to perturbations in listing order, tags, and descriptions. Synthetic agent results show correct directional agreement but overestimate magnitude, attributed to high simulated purchase “intention” (Mansour et al., 31 Mar 2025, Allouah et al., 4 Aug 2025).

5. Behavioral Analysis, Safety, and Market Impact

Synthetic buyer agents exhibit systematic, model-dependent behaviors that have direct implications for both marketplace dynamics and safety.

Rationality, Biases, and Failure Modes

In ACES, VLM agents favor top-row positions (e.g., $+1.224$ utils for Claude 4), penalize sponsored listings, and show heterogeneous position/attribute sensitivity. Horizontal "heat maps" vary by model, undermining assumptions of universal user behavior (Allouah et al., 4 Aug 2025).
Agent price sensitivity ( $\beta_{\text{price}}\approx-1.6$ to $T_i \in \{0.00, 0.25, 0.50, 0.75, 1.00\}$ 0), rating weight, and response to tags are quantifiable; sponsored tags lower selection, “Overall Pick” endorsements elevate it significantly. Model upgrades can induce demand shocks, shifting modal product choice by 10–25% (Allouah et al., 4 Aug 2025).
Safety failures manifest as both benign (no state change) and harmful (unintended account mutation, duplicate purchases) outcomes. In Amazon-Bench, harmful failure rates range from 4–9% across LLMs; chain-of-thought and rule-based safety monitors are recommended (Zhang et al., 18 Aug 2025).

Market-Level Risks and Regulation

Simulations anticipate concentration of demand (e.g., single product capturing 45% of modal selections), model-specific sensitivity, and the emergence of seller-side optimization. These conditions raise regulatory questions on disclosure, market fairness, and the need for standardization (Model Context Protocol) (Allouah et al., 4 Aug 2025).
ShoppingComp highlights risks in product safety (failure to detect hazards), promotional misinformation, and underperformance on retrieval for multi-constraint queries (best F1: 11.2% for GPT-5) (Tou et al., 28 Nov 2025).

6. Limitations, Open Challenges, and Future Research

Model- and Task-Level Gaps

Agent populations remain less diverse than actual shoppers due to static or text-only persona modeling and insufficient incorporation of visual, temporal, or cultural context (Mansour et al., 31 Mar 2025, Castelo et al., 1 Feb 2026).
Key bottlenecks include complex retrieval (open-web queries with multi-constraint), robust rule adherence (Mix-Ecom: 63% of errors due to domain-rule violations), and hallucination in both product validation and rationale generation (Zhou et al., 28 Sep 2025, Tou et al., 28 Nov 2025).
Vision input and UI interaction limitations hinder assessment of purely visual theme tweaks and dynamic web widgets (SimGym, Amazon-Bench).

Research Directions and Recommendations

Integrate multimodal fusion (HTML + vision), adaptive query expansion, episodic memory, and end-to-end persona learning for agent pipelines (Lyu et al., 3 Jun 2025, Peeters et al., 18 Aug 2025).
Design safety-first decoding, co-training of in-the-loop verifiers (“LLM-Judge” models), and continuous dataset augmentation through adversarial and failure case-driven sampling (Tou et al., 28 Nov 2025).
Expand cohort and session realism to include multilingual, multi-regional personas, and dynamic fine-tuning against real shopping transcripts to further close the alignment gap.
Incorporate modular evaluation hooks, robust state-tracking, and explicit reasoning over widget/state for improved safety and transparency (Zhang et al., 18 Aug 2025).

A plausible implication is that future agentic marketplaces will require explicit coordination between platforms, regulatory bodies, and both buyer- and seller-side agent designers to ensure transparency, safety, and fairness as AI-mediated commerce transitions from controlled simulation to dominant reality.