Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

Published 1 Feb 2026 in cs.AI | (2602.01443v1)

Abstract: A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by LLM agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.

Summary

  • The paper introduces SimGym, a scalable, agent-based framework that simulates e-commerce UI changes via live browser interactions.
  • It employs a six-phase data pipeline to extract and ground buyer personas from clickstream data, ensuring realistic simulation of user intent.
  • Experimental results demonstrate that incorporating memory and personalized behavior yields 69% alignment and a 0.64 Pearson correlation, reducing test cycle times from weeks to under an hour.

SimGym: Traffic-Grounded, Agent-Based Browser Simulation for Offline E-Commerce A/B Testing

Introduction

The SimGym framework presents a scalable architecture for conducting agent-based, offline A/B testing of e-commerce interfaces, leveraging LLM-driven agents that operate in a live browser environment. It fundamentally addresses the inefficiencies and risks inherent in traditional online A/B testing paradigms, such as long cycle times, exposure of real users to suboptimal UI, and the operational costs of instrumentation and confounder control. Most notably, SimGym demonstrates that synthetic agent-based testing can achieve high predictive alignment with actual observed user behavior, thereby introducing an avenue for rapid pre-deployment evaluation of UI interventions without requiring real user exposure.

Methodology

Persona and Intent Extraction

SimGym’s synthetic agents are underpinned by a six-phase data pipeline that grounds buyer personas and intents directly in production clickstream data:

  • Session-Level Clustering: Sessions are clustered (k=5) using behavioral features, including engagement metrics, funnel progression, and economic value.
  • Product Preference Extraction: Within each cluster, LLMs extract aggregated buyer interests, producing reusable product themes and reasoning chains.
  • Intent Generation: Structured intents are calibrated to cluster-specific conversion tendencies, ensuring realistic browsing vs. acquisition intent mixes for each agent.
  • Buyer Behavior Aggregation: Centroid-proximate sessions are aggregated at the buyer level to summarize key behavior.
  • Multidimensional Persona Construction: Agents receive fine-grained profiles capturing price sensitivity, exploration depth, and latent value dimensions (premium, performance, ethics), all normalized in a category-aware fashion.
  • Prompt Composition: Each agent’s persona, intent, and product preferences are consolidated into a prompt that conditions simulated behavior across variants.

This data-grounding ensures that agents reflect empirical customer heterogeneity, eschewing heuristically constructed or transferred personas.

Agent Architecture

SimGym agents directly interact with live storefronts in a full-featured browser:

  • Web Perception: Agents use the accessibility tree rather than raw DOM or screenshots, facilitating robust, generalizable action selection across heterogeneous designs.
  • Session Memory: Agents maintain full episodic memory of actions, page states, and reasoning, enabling coherent, non-myopic navigation and decision-making.
  • Schema-Constrained Action Planning: At each step, the agent uses its persona, intent, and cumulative session state to plan and execute browser interactions, with termination upon goal satisfaction, step exhaustion, or guardrail triggers (loop, error, timeout).
  • Fine-Grained Behavioral Control: Guardrails and error propagation ensure failure modes are contained and analyzed for simulation reliability.

Evaluation and Empirical Results

Ground-Truth Dataset Construction

SimGym is validated using natural UI interventions on a production e-commerce platform, filtered to control confounders (seasonality, promotions, ramp-up effects, merchandising shift) via double machine learning. The benchmark comprises 20 shops from 12 countries and diverse verticals, each with pre/post theme-switch behavioral data.

Predictive Validity Metrics

Two principal metrics are used:

  • Sign Alignment: Agreement in the direction of simulated and observed A2C (add-to-cart) rate changes between UI variants.
  • Pearson Correlation: Correlation across buyer clusters between the magnitude of agent-predicted and empirical A2C shifts.

Strong numerical results are reported:

  • SimGym (fully enabled) achieves 69% alignment and a 0.64 Pearson correlation, capturing both the direction and relative magnitude of human response across interventions.
  • Cycle times for simulated experimentation are reduced from weeks to under an hour.

Ablation Studies

Ablation experiments isolate the causal roles of two critical system features:

  • Memory Ablation: Removing episodic session memory collapses predictive validity (correlation drops from 0.64 to 0.29 and alignment from 69% to 55%). The majority of memoryless agents either time out or become stuck in loops, confirming the necessity of contextualized navigation for reliable behavioral prediction.
  • Persona Grounding: Removing data-grounded personas (“Intent Only”) yields only 52% alignment and 0.27 correlation, with agents exhibiting navigation but not persona-driven product evaluation. Using generic, segment-derived personas yields marginally higher alignment (62%) but identical, degraded correlation, indicating that donor-based distributions misrepresent the focal shop’s buyer mix and induce miscalibration.

These results substantiate the paper’s claim that both behavioral memory and shop-specific persona grounding are essential for predictive fidelity in simulation-based A/B testing.

Theoretical and Practical Implications

Causal Predictivity in Synthetic Agent Evaluation

Unlike prior LLM agent research—where validation centered on behavioral realism or task success under static offline conditions—SimGym demonstrates quantitative predictive validity relative to actual causal effect sizes from production A/B interventions. This answers a longstanding challenge in agent-based simulation: generating behavioral shifts that not only mimic realism but accurately forecast population-level causal deltas.

Rapid Experimentation and Risk Elimination

SimGym’s framework enables merchants to conduct robust, non-intrusive A/B tests—simulating both the buyer population distribution and causal impact—entirely offline. This pipeline compresses the risk-discovery cycle from weeks to minutes and decouples experiment power from traffic constraints, providing a direct path to automated, closed-loop UI optimization in e-commerce.

Pathways for Future Development

  • Agent fidelity can be improved via supervised fine-tuning or reinforcement learning from real user session feedback, targeting residual gaps in edge-case persona representation.
  • Extension to end-to-end, unsupervised persona discovery could uncover latent behavioral archetypes not captured by explicit clustering.
  • Integration with vision-enabled models and multimodal perception will allow inclusion of design changes whose impact is primarily visual—a significant open challenge.
  • SimGym could function as an inner loop within continuous optimization systems, proposing and screening high-throughput UI modifications before real-user exposure.

Conclusion

SimGym establishes a robust foundation for offline, agent-based A/B testing that achieves both high-fidelity persona grounding and reliable, causally-aligned behavioral prediction. Its empirical results demonstrate that synthetic LLM agents can meaningfully forecast treatment impacts, creating new opportunities for rapid and risk-free experimentation in complex, real-world e-commerce settings. The framework’s modularity and extensibility position it as a credible candidate for integration into the inner optimization loop of adaptive online platforms, while its methodology delineates methodological standards for future agent-based evaluation research in behavioral simulation and causality.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.