Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training (2510.14969v1)

Published 16 Oct 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

Summary

  • The paper demonstrates that LLMs can serve as general-purpose simulators by synthesizing diverse UI trajectories, reducing reliance on costly real-world data.
  • The methodology employs a multi-step simulation pipeline with few-shot chain-of-thought prompting and retrieval-augmented techniques to enhance simulation fidelity.
  • Experiments reveal that agents trained via UI-Simulator and UI-Simulator-Grow achieve competitive performance and superior robustness in both web and mobile environments.

LLM-Based Digital World Simulation for Scalable Agent Training

Introduction and Motivation

The paper "LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training" (2510.14969) addresses the critical bottleneck in digital agent development: the scarcity and cost of large-scale, high-quality UI trajectory data. The authors propose UI-Simulator, a paradigm leveraging LLMs as digital world simulators to synthesize diverse, structured UI states and transitions, enabling scalable trajectory generation for agent training. The approach is motivated by the observation that LLMs, pre-trained on front-end code and procedural knowledge, can model environment dynamics and generate plausible UI states, circumventing the resource-intensive process of collecting real-world data. Figure 1

Figure 1: Overview and performance highlights of UI-Simulator and UI-Simulator-Grow.

UI-Simulator: Architecture and Simulation Process

Digital World Model Formulation

UI-Simulator models UI environments as structured accessibility trees, where each state sts_t encodes textual content, spatial coordinates, and dynamic attributes. The environment dynamics are governed by a transition function st+1=T(st,at)s_{t+1} = \mathcal{T}(s_t, a_t), instantiated by an LLM-based simulator MLLM\mathcal{M}_{\text{LLM}} or deterministic rules for specific actions. Observations oto_t are computed by extracting elements whose bounding boxes intersect with the current viewport.

Multi-Step Simulation Pipeline

The simulation process is decomposed into three stages:

  1. Overview Prediction: The LLM generates a high-level summary of the next state conditioned on the current state and action.
  2. Rich Draft Generation: Based on the overview, the LLM produces a semantically rich, unstructured draft describing UI elements and their attributes.
  3. Structured Conversion: The draft is converted into a structured format, assigning coordinates and hierarchical relationships, suitable for agent training.

Few-shot CoT prompting is employed to guide the LLM at each stage, enhancing coherence and diversity. Figure 2

Figure 2: Overall process of how the retrieval-free/-augmented simulators predict the next UI state.

Retrieval-Augmented Simulation

To improve adaptation to new environments, UI-Simulator supports retrieval-augmented simulation. A small offline corpus D\mathcal{D} of real environment transitions is indexed. During simulation, the most relevant prior state is retrieved using a hybrid BM25 and semantic retriever pipeline, and the LLM is prompted with both the current context and the retrieved state. This grounds the simulation in real experience while maintaining diversity.

Scalable Trajectory Collection and Guided Rollouts

Instruction-Free Rollouts and Trajectory Wrapping

Trajectory synthesis proceeds via instruction-free rollouts, where a teacher agent interacts with the simulated environment, sampling actions until a coherent task is completed. The trajectory is retrospectively summarized into a user instruction GG, and step-wise reasoning is reconstructed to align with GG. This process yields training instances with user instructions, ground-truth actions, and step-wise reasoning.

Step-Wise Guided Rollout

To mitigate LLM bias and enhance diversity, a step-wise guided rollout process is introduced. At each step, the teacher agent proposes high-level task controls, updating them as sub-goals are completed. Actions are generated with explicit reasoning, and trajectory termination is autonomously decided. This iterative control mechanism increases the diversity and validity of synthesized trajectories.

UI-Simulator-Grow: Targeted Scaling Paradigm

Blindly scaling trajectory volume is inefficient. UI-Simulator-Grow implements targeted scaling by iteratively selecting tasks with maximal learning potential, based on teacher-forcing loss signals. Tasks in the 25–75% loss percentile are prioritized, avoiding trivial or infeasible tasks. For each selected task, diverse variants are synthesized via lightweight rewriting, maintaining logical structure but varying content. Continual learning is supported via replay of representative tasks, selected using RoBERTa-based instruction embeddings and cosine similarity. Figure 3

Figure 3

Figure 3: Target task selection for web tasks.

Experimental Results and Analysis

Benchmarks and Setup

Experiments are conducted on WebArena (web navigation) and AndroidWorld (mobile usage), using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct as base models. UI-Simulator is powered by GPT-4o-mini for simulation and rollouts. Retrieval-augmented simulation uses only a fraction of the real environment experience compared to baselines.

Performance Highlights

  • UI-Simulator-F (retrieval-free) achieves a success rate of 6.28% on WebArena and 8.6% on AndroidWorld, outperforming OS-Genesis, which uses stronger teachers and real environment trajectories.
  • UI-Simulator-R (retrieval-augmented) matches or surpasses proprietary models (Gemini-Pro, GPT-4o) despite using smaller LLMs and less real environment exposure.
  • UI-Simulator-Grow matches Llama-3-70B-Instruct performance using only Llama-3-8B-Instruct and 66% of the training data, demonstrating strong data efficiency. Figure 4

    Figure 4: Successful task numbers across the 5 main task categories through the three iterations of the UI-Simulator-Grow scaling.

Robustness and Ablation

Agents trained on UI-Simulator trajectories exhibit greater robustness to UI perturbations and outperform agents trained directly on real environments with similar trajectory counts. Removal of step-wise task controls or multi-step simulation leads to significant performance drops and reduced diversity, as quantified by PCA effective dimension of task embeddings.

Qualitative Evaluation

Human evaluation across eight dimensions (realism, reasonability, validity, consistency, completion, etc.) yields satisfaction rates exceeding 90% for both UI-Simulator-F and UI-Simulator-R, confirming the high quality of synthesized trajectories. Figure 5

Figure 5: The front-end web interface for trajectory human evaluation.

Failure Modes

Analysis reveals that UI-Simulator-F may fuse irrelevant context, while UI-Simulator-R can overly depend on retrieved states, leading to simulation errors. These cases highlight areas for future improvement in context management and retrieval integration. Figure 6

Figure 6: A case of failed simulation where UI-Simulator-F generates the new page based on irrelevant context.

Figure 7

Figure 7: A case of failed simulation where UI-Simulator-R overly depends on the reference state to generate the new page.

Implications and Future Directions

The results demonstrate that LLM-based digital world simulation is a viable and efficient alternative to real environment data collection for agent training. The targeted scaling paradigm enables rapid, data-efficient agent improvement, and the simulation-driven approach yields agents with superior robustness and adaptability. The framework is extensible to other UI domains and potentially to pixel-level simulation, narrowing the sim-to-real gap.

Theoretical implications include the validation of LLMs as general-purpose world models for structured environments, and the effectiveness of loss-based task selection for continual agent improvement. Practically, the paradigm reduces infrastructure and annotation costs, and supports scalable agent development in domains with limited real environment access.

Conclusion

UI-Simulator and UI-Simulator-Grow establish a scalable, efficient paradigm for digital agent training via LLM-based world simulation and targeted trajectory synthesis. The approach achieves competitive or superior performance to real-environment training, with strong robustness and data efficiency. Future work may extend the paradigm to multimodal and pixel-level environments, further enhancing the generalization and adaptability of digital agents.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 32 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com