- The paper introduces a novel paradigm using renderable HTML code to simulate GUI state transitions, bridging limitations of text and pixel-based models.
- The methodology employs a two-stage training process with supervised fine-tuning and render-aware reinforcement learning, leveraging the AndroidCode dataset.
- Experimental results show enhanced performance and planning capabilities for GUI agents, validating its effectiveness over existing modalities.
Code2World: A Renderable Code Paradigm for Next UI Prediction in GUI World Models
Introduction
"Code2World: A GUI World Model via Renderable Code Generation" (2602.09856) introduces a novel approach to world modeling for GUI agents, operationalizing state transitions via structured HTML code generation rather than traditional pixel- or text-level prediction. The motivation stems from limitations in extant modalities; whereas text-based world models lack visual fidelity and pixel-based models struggle with structural and semantic controllability, renderable code offers an optimal representation, faithfully capturing both visual appearance and explicit interface structure. Code2World is implemented as a vision-language coder that, given a current screenshot, user action, and task goal, predicts the next state by synthesizing HTML code subsequently rendered into an interface image. The paper details a robust data curation pipeline, leveraging an automated code synthesis and visual-feedback revision protocol to construct the AndroidCode dataset (80K+ screen-action pairs), and an advanced training regime employing supervised fine-tuning (SFT) and Render-Aware Reinforcement Learning (RARL) with composite rewards for visual and logical alignment.
Methodology
Data Synthesis and Dataset Construction
The foundation of Code2World is AndroidCode, a paired dataset of GUI trajectories and corresponding renderable code. Initial HTML code is generated from screenshots using GPT-5 under strict constraints (fixed root container, semantic placeholders for images/icons), followed by automated visual-feedback revision leveraging SigLIP-based alignment scores. Samples failing to achieve a threshold visual similarity undergo iterative refinement by multimodal coders, guaranteeing high-fidelity structural and visual alignment. This dataset addresses the acute scarcity of high-quality code-grounded GUI data in prior work.
Model Architecture and Training
The model architecture builds on Qwen3-VL-8B-Instruct as backbone, with training structured in two phases:
- Stage 1: Supervised fine-tuning teaches syntactic HTML structure and layout semantics.
- Stage 2: Render-Aware Reinforcement Learning introduces outcome-driven policy optimization. Here, dual rewards—visual semantic fidelity (Rsem via VLM-judge emphasizing structural correctness rather than pixel similarity) and action consistency (Ract via VLM-judge ensuring logical adherence to the executed action)—are computed based on rendered outcome, and Group Relative Policy Optimization (GRPO) is employed for robust credit assignment.
Evaluation Protocol
The evaluation protocol leverages VLM-as-a-Judge frameworks, defining metrics that explicitly decouple functional logic from visual quality:
- Functional Logic: Action Adherence (Sad) and Action Identifiability (Sid), quantifying logical consequence and causal clarity.
- Visual Quality: Element Alignment (Sele) and Layout Integrity (Slay), measuring fine-grained correspondence and structural preservation.
Experimental Results
Across in-domain (Android Control) and OOD (GUI Odyssey) benchmarks, Code2World-8B demonstrates top-tier performance, rivaling closed-source proprietary baselines GPT-5 and Gemini-3-Pro-Image, and substantially outperforming open-source competitors of much larger scale (e.g., InternVL3-78B, GLM-4.6V-106B). On Android Control (ID), Code2World records Sad=94.28, Sid=88.64, Sele=71.35, Slay=70.32, SigLIP=79.44. Under OOD domain shift, it sustains robust functional generalization, e.g., Sad=92.73, indicating genuine internalization of GUI interaction dynamics rather than mere memorization.
Analysis shows that large, generic VLMs frequently fail in UI-to-code alignment, while pixel-based image generators lack the structural flexibility required for complex interaction simulation. Code2World’s renderable code paradigm elegantly bridges these deficits.
Agent Enhancement and Practical Application
Embedding Code2World as a plug-and-play module in state-of-the-art agents (Mobile-Agent-v3, Qwen2.5-VL-7B, Gemini-2.5-Flash) yields consistent performance gains—culminating in +9.5% success rate improvement for Gemini-2.5-Flash on AndroidWorld navigation. Offline and online evaluations illustrate its utility for both single-step action selection and long-horizon planning. The “Propose, Simulate, Select” pipeline, enabled by Code2World, allows multi-path action evaluation via future state simulation, mitigating hallucination and erroneous trial-and-error policies. Ablation studies further reveal that integrating both visual and logical rewards during RARL is critical: SFT alone grounds basic structural prediction, while rendering rewards refine visual fidelity and action rewards enhance dynamic transition logic.
Implications and Future Directions
Theoretical Impact
Code2World represents a fundamental paradigm shift in GUI world modeling, advocating code-native simulation over pixel- or text-level abstraction. Renderable code not only bestows explicit structural controllability but also scales to novel environments with minimal domain-specific adaptation. This approach aligns with recent trends in embodied AI and agentic navigation, where high-fidelity, deterministic simulation is critical for robust planning and safe exploration.
Practical Impact
Practically, Code2World's sandbox enables autonomous agents to simulate outcomes of irreversible actions (e.g., payment confirmation, data deletion) before execution, dramatically reducing operational risk and facilitating error recovery. Its plug-and-play design democratizes world modeling for a broad spectrum of agents, promising enhanced accessibility in human-computer interaction and digital inclusivity for users with disabilities.
Risks and Recommendations
The chief risks involve potential hallucination of safety cues—misleading agents into hazardous actions if the world model’s predictions are not sufficiently accurate. Automated action generation at scale could also enable malicious automation (e.g., cyber-attacks, interface spamming). Robust verification mechanisms and rigorous evaluation of safety-critical behavior are imperative.
Future Developments
Subsequent research may focus on:
- Scaling code-based world models to complex, multi-window and cross-app scenarios
- Hybridizing code-native and pixel-based models for dynamic asset generation
- Refining RL-based alignment with higher-order semantic reward signals
- Design of agent-world co-adaptation frameworks optimizing both policy and environment models iteratively
Conclusion
Code2World introduces a robust, code-native paradigm for GUI world modeling, combining high visual fidelity with fine-grained interface structure via renderable HTML code generation. Empirical results validate superior performance across next UI prediction and agent enhancement benchmarks, establishing the paradigm’s scalability and utility. The approach significantly advances foresight and planning for autonomous GUI agents, and offers promising avenues for safer and more inclusive digital automation.