ShopSimulator: Retail & E-Commerce Simulation

Updated 2 February 2026

ShopSimulator is a simulation framework that models retail, grocery, and e-commerce operations through high-fidelity 3D environments, agent-based systems, and dialog simulators.
It offers extensive APIs, realistic benchmarks, and integration of reinforcement learning and vision-language modules to address tasks in navigation, inventory management, and personalized shopping.
The platform combines probabilistic, operational, and adversarial simulation methods to support research on pricing strategies, multi-modal interactions, and real-world decision-making performance.

ShopSimulator refers to a class of simulation environments, frameworks, and toolkits designed for the rigorous modeling, evaluation, and training of AI agents and decision policies in retail, grocery, and e-commerce domains. These environments span physical retail simulations for embodied agents, discrete-event agent-based models for store operations, e-commerce dialog simulators for LLMs, probabilistic market and catalog simulators, and synthetic data generators for reinforcement learning (RL) and benchmarking. The ShopSimulator ecosystem supports research in navigation, manipulation, customer behavior prediction, product search, pricing strategies, inventory management, and multi-modal human–AI interaction.

1. Embodied Retail Environment Simulators

High-fidelity 3D and VR-based ShopSimulator environments have been created for benchmarking embodied agents in navigational, perceptual, and manipulation tasks within photorealistic retail settings. Sari Sandbox ("ShopSimulator") exemplifies this genre, providing:

Three convenience-store floorplans modeled after empirical surveys, each differing in aisles, shelf depth, and checkout placement. Rendering is handled via Unity URP, maintaining photorealistic visuals at ~60fps.
250+ interactive grocery item meshes covering 11 categories, annotated for physics realism (mass ∈ [0.1, 1.2] kg, friction ≈ 0.5) and manipulated via primitives (pick, drop, poke, inspect).
API primitives including TransformAgent, TransformHands, ToggleGrip, and perception tools (RGB, depth, segmentation).
VR human-in-the-loop integration, enabling human participants to generate ground-truth demonstration datasets (SariBench) for navigation and item retrieval under varying difficulty.
Vision–language integration, where tasks are grounded and decomposed via models such as Gemini 2.5 Pro in a ReAct-style planning loop with tool calls for object detection and text extraction.
Quantitative benchmarking with metrics (success rate $\mathrm{SR}$ , path length $\mathrm{PL}$ , inspection accuracy, and manipulation rate) and direct human–agent comparison across tasks. For instance, agents require 5–16× more time than humans due to model planning overhead and OCR errors.
Roadmap suggestions for realism/scalability improvements: dynamic lighting, deformable integration, procedural catalog and layout expansion, and multi-agent scenarios (Gajo et al., 1 Aug 2025).

MarketGen extends this approach to multi-modal, procedurally generated supermarkets:

Agent-driven PCG combines spatial (Binary Space Partitioning with constraint refinement) and semantic zoning (via LLMs) to synthesize conformant supermarket layouts from text/image prompts.
Asset library contains 1,100+ unique products and 100+ facilities, parameterized for architectural realism (shelf height, aisle width, adjacency constraints).
Benchmark tasks include Checkout Unloading (robotic picking and placing) and In-Aisle Item Collection (mobile robotic navigation and manipulation), with metrics such as $\mathrm{SR}$ and $\mathrm{SPL}$ .
Agent stack leverages affordance generation (SAM, SoM), motion (cuRobo), LLM-based planning, and successfully demonstrates sim-to-real transfer on standard object shapes, with simulation–real success rate discrepancies ≤10%.
Exposes a Python API for scenario instantiation, parallelization, and user asset injection, though current limitations include lack of dynamic avatars and non-supermarket layouts (Hu et al., 26 Nov 2025).

2. Operational and Agent-Based Retail Simulators

Classic ShopSimulator frameworks rooted in discrete-event and agent-based simulation are engineered for modeling store operations, people management practices, and the impact of process and personnel changes:

Architecture merges a discrete-event skeleton (customer arrivals, queues, service events) with an agent-based layer (heterogeneous customers, staff, managers, each with memory and decision policies).
Arrival, browsing, help, and checkout times are driven by empirically calibrated random processes (e.g., non-homogeneous Poisson for arrivals, triangular distributions for durations).
Agents are formally specified by state-transition diagrams and simple, probabilistically parameterized rules (e.g., $p_{\mathrm{needHelp}}$ , $p_{\mathrm{buyAfterBrowsing}}$ ), with extensions for memory and persona heterogeneity.
Benchmarking covers throughput, waiting time, staff utilization, and satisfaction (both per-visit and cumulative indices), quantified through event-based weighting schemes.
These models support counterfactual experiments ("what-if" scenarios) on staff mix, training, empowerment, and loyalty interventions, and are periodically validated against field observational data and transaction logs (Siebers et al., 2010).

3. E-Commerce and Multi-Turn Shopping Dialog Simulators

ShopSimulator environments for LLM-based agent evaluation model online shopping as POMDPs involving multi-turn dialog, user personalization, and fine-grained product disambiguation:

Environments simulate a user interface with rich context: current page, clickable elements, user utterance, and structured user profile (demographic, behavioral, brand/budget preferences).
Action space comprises search, click (including purchase), and user query actions; subsequent transitions model user replies and environment updates.
Grounded product catalogs are extracted from real-world e-commerce sites (e.g., 1.34M Taobao items across 12 domains) with task definitions covering single/multi-turn, with vs. without personalization (28K total tasks).
Reward structure is calculated at episode end, using additive and multiplicative alignment rewards over categories, attributes, options, and price ( $R_{\mathrm{loose}},\, R_{\mathrm{strict}}$ ), and strict full-success rate as the primary objective.
Benchmarks show SOTA models (GPT-5, DeepSeek-V3.1) achieve full-success rates below 40%, especially in long-horizon, personalized, or multi-turn scenarios.
Training with supervised fine-tuning (SFT) and reinforcement learning (RL, e.g., GRPO in the ROLL framework) under strict rewards optimally improves difficult, compositional shopping tasks ( $+24.76$ pp single-turn, $+29.02$ pp multi-turn R_succ) (Wang et al., 26 Jan 2026).
Design recommendations emphasize reward shaping, structured prompts, LLM role-play user simulation, and initialization of RL policies with SFT-warm starts.

4. Market Simulation, Personalization, and Price/Promotion Optimizer Simulators

Synthetic data generation and RL benchmarking ShopSimulators enable large-scale, calibrated experimentation on personalization and pricing strategy optimization:

RetailSynth and related toolkits implement multi-stage customer behavior models (store visit, category choice, product choice, quantity) with latent heterogeneity in price sensitivity and experience, empirically calibrated via public grocery datasets.
Product prices are programmatically evolved via Hidden Markov Models (base/discount state switching) and Bayesian parameter sampling.
Modular architecture supports dynamic pricing, high–low/seasonal/static policies, and evaluation of scenario-specific outcomes (revenue, penetration, retention, basket size) with tight alignment to real data via KS calibrations.
Open-source implementations provide Python APIs, vectorization, pluggable RL agent hooks, and scenario control.
In coupon targeting, purchase sparsity is explicitly addressed by batch RL, feature summarization, denser surrogate rewards, and validation through repeated experiments.
Bandit and deep RL methods (contextual LinTS, LinUCB, PPO, DQN) outperform static policies; e.g., LinTS/LinUCB and PPO achieve $1.26\times$ normalized revenue over static policies in simulation.
Analytical segment analysis reveals that discount offers are adaptively concentrated on price-sensitive segments, but deep coupons may sometimes be over-allocated due to limited model expressivity (Xia et al., 2024, Xia et al., 2023).

5. Probabilistic Simulators for Spatial Demand and Product Allocation

Advanced ShopSimulators model spatial product allocation as a stochastic process, facilitating offline learning of shelf and region-specific policies:

The core is a Bayesian hierarchical demand simulator parameterized over regions and products, fitted via ADVI and NUTS posterior sampling using supermarket sales data.
At each simulation epoch, sold quantity $\mathrm{PL}$ 0 for region $\mathrm{PL}$ 1, product $\mathrm{PL}$ 2, is sampled from a truncated normal with dynamic predictors ( $\mathrm{PL}$ 3 linear in time, region, product, and autoregressive features).
Region–product allocations are encoded as a binary matrix $\mathrm{PL}$ 4, enabling explicit modeling of placement effects and demand redistribution.
RL (DQN) agents interact with the simulator to learn allocation policies, with actions corresponding to "remove/place" operations. Learnable Q-networks optimize cumulative revenue under placement constraints and intervention costs.
DQN outperforms random, do-nothing, and heuristic (Tabu) baselines by up to 24.5% in long-horizon simulated roll-outs, especially as horizon length increases, indicating effective exploitation of spatio-temporal demand correlations (Jenkins et al., 2020).

6. Large-Scale, Benchmark-Driven Shopping Sandboxes

Emerging ShopSimulators concentrate on benchmark-driven evaluations for LLM-based agents in highly granular, real-world shopping tasks:

ShoppingBench features a shopping sandbox with 2.7M+ real-world products (metadata from Lazada.com), encompassing attribute-rich and constraint-driven tasks (budget, vouchers, same shop) within a POMDP formalism.
Action interface exposes granular tools (find_product, view_product_information, budget calculation, web search, recommend, terminate), with agent–environment transitions determined by the outcome of each tool call.
Task generation synthesizes realistic, intent-grounded instructions through multi-stage prompt pipelines leveraging attributes, user intents, external knowledge, and constraint enforcement.
Benchmark metrics include cumulative average relevance (CAR), absolute success rate (ASR), and finer constraint fulfillment (title similarity, price bounds, attribute overlap, knowledge inclusion).
Trajectory distillation with SFT and RL (GRPO reward-format-match) consistently outperforms raw LLMs, yet absolute success rates for complex intent (e.g., coupon, multi-shop) remain below 35–50% even for GPT-4.1 (Wang et al., 6 Aug 2025).
Failure analysis reveals that attribute mismatches and constraint failures dominate error modes, with product detail inspection strongly correlated with task success.

7. Adversarial User–Search Environment Simulations

ShopSimulator frameworks additionally support the adversarial evaluation of search and recommendation algorithms using generative adversarial imitation learning:

AESim architecture combines a WGAN-GP for user/query generation and a GAIL feedback module trained on real e-commerce logs, providing a closed-loop simulation for ranking model testing.
Simulated sessions involve retrieval, ranking, and synthetic feedback rollouts, enabling direct computation of business metrics (CTR, CVR, GMV) and offline–online correlation (Spearman ~0.8 between AESim GMV and true A/B test revenue lift).
GAIL-driven simulators close the gap between static offline metrics (AUC, NDCG) and revenue-centric online KPIs, flagging models with poor real-world performance despite favorable offline ranks.
Limitations include single-session constraints and GAIL computational cost; future directions point to multi-turn dialog, personalized re-ranking, and reinforcement learning for long-term retention (Gao et al., 2021).