- The paper introduces SimGym, a scalable, agent-based framework that simulates e-commerce UI changes via live browser interactions.
- It employs a six-phase data pipeline to extract and ground buyer personas from clickstream data, ensuring realistic simulation of user intent.
- Experimental results demonstrate that incorporating memory and personalized behavior yields 69% alignment and a 0.64 Pearson correlation, reducing test cycle times from weeks to under an hour.
SimGym: Traffic-Grounded, Agent-Based Browser Simulation for Offline E-Commerce A/B Testing
Introduction
The SimGym framework presents a scalable architecture for conducting agent-based, offline A/B testing of e-commerce interfaces, leveraging LLM-driven agents that operate in a live browser environment. It fundamentally addresses the inefficiencies and risks inherent in traditional online A/B testing paradigms, such as long cycle times, exposure of real users to suboptimal UI, and the operational costs of instrumentation and confounder control. Most notably, SimGym demonstrates that synthetic agent-based testing can achieve high predictive alignment with actual observed user behavior, thereby introducing an avenue for rapid pre-deployment evaluation of UI interventions without requiring real user exposure.
Methodology
Persona and Intent Extraction
SimGym’s synthetic agents are underpinned by a six-phase data pipeline that grounds buyer personas and intents directly in production clickstream data:
- Session-Level Clustering: Sessions are clustered (k=5) using behavioral features, including engagement metrics, funnel progression, and economic value.
- Product Preference Extraction: Within each cluster, LLMs extract aggregated buyer interests, producing reusable product themes and reasoning chains.
- Intent Generation: Structured intents are calibrated to cluster-specific conversion tendencies, ensuring realistic browsing vs. acquisition intent mixes for each agent.
- Buyer Behavior Aggregation: Centroid-proximate sessions are aggregated at the buyer level to summarize key behavior.
- Multidimensional Persona Construction: Agents receive fine-grained profiles capturing price sensitivity, exploration depth, and latent value dimensions (premium, performance, ethics), all normalized in a category-aware fashion.
- Prompt Composition: Each agent’s persona, intent, and product preferences are consolidated into a prompt that conditions simulated behavior across variants.
This data-grounding ensures that agents reflect empirical customer heterogeneity, eschewing heuristically constructed or transferred personas.
Agent Architecture
SimGym agents directly interact with live storefronts in a full-featured browser:
- Web Perception: Agents use the accessibility tree rather than raw DOM or screenshots, facilitating robust, generalizable action selection across heterogeneous designs.
- Session Memory: Agents maintain full episodic memory of actions, page states, and reasoning, enabling coherent, non-myopic navigation and decision-making.
- Schema-Constrained Action Planning: At each step, the agent uses its persona, intent, and cumulative session state to plan and execute browser interactions, with termination upon goal satisfaction, step exhaustion, or guardrail triggers (loop, error, timeout).
- Fine-Grained Behavioral Control: Guardrails and error propagation ensure failure modes are contained and analyzed for simulation reliability.
Evaluation and Empirical Results
Ground-Truth Dataset Construction
SimGym is validated using natural UI interventions on a production e-commerce platform, filtered to control confounders (seasonality, promotions, ramp-up effects, merchandising shift) via double machine learning. The benchmark comprises 20 shops from 12 countries and diverse verticals, each with pre/post theme-switch behavioral data.
Predictive Validity Metrics
Two principal metrics are used:
- Sign Alignment: Agreement in the direction of simulated and observed A2C (add-to-cart) rate changes between UI variants.
- Pearson Correlation: Correlation across buyer clusters between the magnitude of agent-predicted and empirical A2C shifts.
Strong numerical results are reported:
- SimGym (fully enabled) achieves 69% alignment and a 0.64 Pearson correlation, capturing both the direction and relative magnitude of human response across interventions.
- Cycle times for simulated experimentation are reduced from weeks to under an hour.
Ablation Studies
Ablation experiments isolate the causal roles of two critical system features:
- Memory Ablation: Removing episodic session memory collapses predictive validity (correlation drops from 0.64 to 0.29 and alignment from 69% to 55%). The majority of memoryless agents either time out or become stuck in loops, confirming the necessity of contextualized navigation for reliable behavioral prediction.
- Persona Grounding: Removing data-grounded personas (“Intent Only”) yields only 52% alignment and 0.27 correlation, with agents exhibiting navigation but not persona-driven product evaluation. Using generic, segment-derived personas yields marginally higher alignment (62%) but identical, degraded correlation, indicating that donor-based distributions misrepresent the focal shop’s buyer mix and induce miscalibration.
These results substantiate the paper’s claim that both behavioral memory and shop-specific persona grounding are essential for predictive fidelity in simulation-based A/B testing.
Theoretical and Practical Implications
Causal Predictivity in Synthetic Agent Evaluation
Unlike prior LLM agent research—where validation centered on behavioral realism or task success under static offline conditions—SimGym demonstrates quantitative predictive validity relative to actual causal effect sizes from production A/B interventions. This answers a longstanding challenge in agent-based simulation: generating behavioral shifts that not only mimic realism but accurately forecast population-level causal deltas.
Rapid Experimentation and Risk Elimination
SimGym’s framework enables merchants to conduct robust, non-intrusive A/B tests—simulating both the buyer population distribution and causal impact—entirely offline. This pipeline compresses the risk-discovery cycle from weeks to minutes and decouples experiment power from traffic constraints, providing a direct path to automated, closed-loop UI optimization in e-commerce.
Pathways for Future Development
- Agent fidelity can be improved via supervised fine-tuning or reinforcement learning from real user session feedback, targeting residual gaps in edge-case persona representation.
- Extension to end-to-end, unsupervised persona discovery could uncover latent behavioral archetypes not captured by explicit clustering.
- Integration with vision-enabled models and multimodal perception will allow inclusion of design changes whose impact is primarily visual—a significant open challenge.
- SimGym could function as an inner loop within continuous optimization systems, proposing and screening high-throughput UI modifications before real-user exposure.
Conclusion
SimGym establishes a robust foundation for offline, agent-based A/B testing that achieves both high-fidelity persona grounding and reliable, causally-aligned behavioral prediction. Its empirical results demonstrate that synthetic LLM agents can meaningfully forecast treatment impacts, creating new opportunities for rapid and risk-free experimentation in complex, real-world e-commerce settings. The framework’s modularity and extensibility position it as a credible candidate for integration into the inner optimization loop of adaptive online platforms, while its methodology delineates methodological standards for future agent-based evaluation research in behavioral simulation and causality.