Code2World: A GUI World Model via Renderable Code Generation

Published 10 Feb 2026 in cs.CV, cs.AI, cs.CL, and cs.HC | (2602.09856v1)

Abstract: Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel paradigm using renderable HTML code to simulate GUI state transitions, bridging limitations of text and pixel-based models.
The methodology employs a two-stage training process with supervised fine-tuning and render-aware reinforcement learning, leveraging the AndroidCode dataset.
Experimental results show enhanced performance and planning capabilities for GUI agents, validating its effectiveness over existing modalities.

Code2World: A Renderable Code Paradigm for Next UI Prediction in GUI World Models

Introduction

"Code2World: A GUI World Model via Renderable Code Generation" (2602.09856) introduces a novel approach to world modeling for GUI agents, operationalizing state transitions via structured HTML code generation rather than traditional pixel- or text-level prediction. The motivation stems from limitations in extant modalities; whereas text-based world models lack visual fidelity and pixel-based models struggle with structural and semantic controllability, renderable code offers an optimal representation, faithfully capturing both visual appearance and explicit interface structure. Code2World is implemented as a vision-language coder that, given a current screenshot, user action, and task goal, predicts the next state by synthesizing HTML code subsequently rendered into an interface image. The paper details a robust data curation pipeline, leveraging an automated code synthesis and visual-feedback revision protocol to construct the AndroidCode dataset (80K+ screen-action pairs), and an advanced training regime employing supervised fine-tuning (SFT) and Render-Aware Reinforcement Learning (RARL) with composite rewards for visual and logical alignment.

Methodology

Data Synthesis and Dataset Construction

The foundation of Code2World is AndroidCode, a paired dataset of GUI trajectories and corresponding renderable code. Initial HTML code is generated from screenshots using GPT-5 under strict constraints (fixed root container, semantic placeholders for images/icons), followed by automated visual-feedback revision leveraging SigLIP-based alignment scores. Samples failing to achieve a threshold visual similarity undergo iterative refinement by multimodal coders, guaranteeing high-fidelity structural and visual alignment. This dataset addresses the acute scarcity of high-quality code-grounded GUI data in prior work.

Model Architecture and Training

The model architecture builds on Qwen3-VL-8B-Instruct as backbone, with training structured in two phases:

Stage 1: Supervised fine-tuning teaches syntactic HTML structure and layout semantics.
Stage 2: Render-Aware Reinforcement Learning introduces outcome-driven policy optimization. Here, dual rewards—visual semantic fidelity ( $R_{sem}$ via VLM-judge emphasizing structural correctness rather than pixel similarity) and action consistency ( $R_{act}$ via VLM-judge ensuring logical adherence to the executed action)—are computed based on rendered outcome, and Group Relative Policy Optimization (GRPO) is employed for robust credit assignment.

Evaluation Protocol

The evaluation protocol leverages VLM-as-a-Judge frameworks, defining metrics that explicitly decouple functional logic from visual quality:

Functional Logic: Action Adherence ( $S_{ad}$ ) and Action Identifiability ( $S_{id}$ ), quantifying logical consequence and causal clarity.
Visual Quality: Element Alignment ( $S_{ele}$ ) and Layout Integrity ( $S_{lay}$ ), measuring fine-grained correspondence and structural preservation.

Experimental Results

Next UI Prediction Performance

Across in-domain (Android Control) and OOD (GUI Odyssey) benchmarks, Code2World-8B demonstrates top-tier performance, rivaling closed-source proprietary baselines GPT-5 and Gemini-3-Pro-Image, and substantially outperforming open-source competitors of much larger scale (e.g., InternVL3-78B, GLM-4.6V-106B). On Android Control (ID), Code2World records $S_{ad}$ =94.28, $S_{id}$ =88.64, $S_{ele}$ =71.35, $S_{lay}$ =70.32, SigLIP=79.44. Under OOD domain shift, it sustains robust functional generalization, e.g., $S_{ad}$ =92.73, indicating genuine internalization of GUI interaction dynamics rather than mere memorization.

Analysis shows that large, generic VLMs frequently fail in UI-to-code alignment, while pixel-based image generators lack the structural flexibility required for complex interaction simulation. Code2World’s renderable code paradigm elegantly bridges these deficits.

Agent Enhancement and Practical Application

Embedding Code2World as a plug-and-play module in state-of-the-art agents (Mobile-Agent-v3, Qwen2.5-VL-7B, Gemini-2.5-Flash) yields consistent performance gains—culminating in +9.5% success rate improvement for Gemini-2.5-Flash on AndroidWorld navigation. Offline and online evaluations illustrate its utility for both single-step action selection and long-horizon planning. The “Propose, Simulate, Select” pipeline, enabled by Code2World, allows multi-path action evaluation via future state simulation, mitigating hallucination and erroneous trial-and-error policies. Ablation studies further reveal that integrating both visual and logical rewards during RARL is critical: SFT alone grounds basic structural prediction, while rendering rewards refine visual fidelity and action rewards enhance dynamic transition logic.

Implications and Future Directions

Theoretical Impact

Code2World represents a fundamental paradigm shift in GUI world modeling, advocating code-native simulation over pixel- or text-level abstraction. Renderable code not only bestows explicit structural controllability but also scales to novel environments with minimal domain-specific adaptation. This approach aligns with recent trends in embodied AI and agentic navigation, where high-fidelity, deterministic simulation is critical for robust planning and safe exploration.

Practical Impact

Practically, Code2World's sandbox enables autonomous agents to simulate outcomes of irreversible actions (e.g., payment confirmation, data deletion) before execution, dramatically reducing operational risk and facilitating error recovery. Its plug-and-play design democratizes world modeling for a broad spectrum of agents, promising enhanced accessibility in human-computer interaction and digital inclusivity for users with disabilities.

Risks and Recommendations

The chief risks involve potential hallucination of safety cues—misleading agents into hazardous actions if the world model’s predictions are not sufficiently accurate. Automated action generation at scale could also enable malicious automation (e.g., cyber-attacks, interface spamming). Robust verification mechanisms and rigorous evaluation of safety-critical behavior are imperative.

Future Developments

Subsequent research may focus on:

Scaling code-based world models to complex, multi-window and cross-app scenarios
Hybridizing code-native and pixel-based models for dynamic asset generation
Refining RL-based alignment with higher-order semantic reward signals
Design of agent-world co-adaptation frameworks optimizing both policy and environment models iteratively

Conclusion

Code2World introduces a robust, code-native paradigm for GUI world modeling, combining high visual fidelity with fine-grained interface structure via renderable HTML code generation. Empirical results validate superior performance across next UI prediction and agent enhancement benchmarks, establishing the paradigm’s scalability and utility. The approach significantly advances foresight and planning for autonomous GUI agents, and offers promising avenues for safer and more inclusive digital automation.

Markdown Report Issue