Papers
Topics
Authors
Recent
Search
2000 character limit reached

Code-Based GUI in Mobile World Models

Updated 26 February 2026
  • Code-based GUI representation is a paradigm that encodes mobile UI states as executable HTML/CSS, ensuring precise visual fidelity and structural accuracy.
  • The methodology leverages LLMs to synthesize code from GUI screenshots, using pipelines like AndroidCode and gWorld for high renderability and robust compositional reasoning.
  • Comparative evaluations show that code-based models outperform pixel and text-based approaches in controllability, efficiency, and downstream simulation for autonomous agents.

Code-based GUI representation in mobile world models refers to a paradigm where the next GUI state is predicted as executable code—typically HTML and CSS—rather than as raw pixels or abstract tokens. This approach supports high visual fidelity, robust compositional reasoning, and efficient downstream simulation for autonomous GUI agents. By leveraging the symbolic, structured, and renderable nature of code, vision-LLMs (VLMs) can predict UI state transitions in a form directly usable for high-fidelity replay and action-conditioned planning. Recent research establishes that code-based world models outperform both pure pixel-based and text-based models in fidelity, controllability, and downstream agent performance, while retaining computational efficiency and scalability (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).

1. GUI Code Representation: Syntax and Structure

The core insight is that most mobile GUIs are ultimately described by structured markup. Code-based world models operationalize this by translating each screen–action tuple into a self-contained HTML snippet that, when rendered, produces the precise visual state corresponding to the next UI frame.

Key principles of code-based representation include:

  • Structured Markup Backbone: Each state is encoded as valid HTML5, augmented with CSS (often with utility frameworks such as Tailwind or Bootstrap) to reproduce layouts. Abstract containers like <div id="render-target"> of fixed size (e.g., 1080×2400 px) centralize coordinate frames (Zheng et al., 10 Feb 2026).
  • Direct Encoding of UI Elements: Buttons, inputs, text, and images are respectively mapped to <button>, <input>, and <span>/<p> tags; iconography is handled with inline SVGs or textual placeholders.
  • Layout Mirroring: Flexbox/Grid or absolute positioning—mirroring screenshot coordinates—ensures that spatial relationships and element alignment remain faithful to source GUIs.
  • Semantic Placeholders: Images and complex icons are replaced with placeholder tags or structured inner content (e.g., [IMG: Chicken Soup] within a <div>), supporting both visual accuracy and semantic interpretability.
  • Composable JSON Format (Koh et al., 2 Feb 2026): Several systems output JSON objects encapsulating both a natural-language “reasoning trace” and the HTML/CSS code for the next state:
    1
    2
    3
    4
    
    {
      "reasoning": "The click taps the 'Next' button, so the form advances",
      "html": "<!DOCTYPE html><html>...<button>Next</button>...</html>"
    }
    This representation is modular, supporting easy extension to other markup dialects (SwiftUI, React Native) and downstream integration.

2. Dataset Construction and Data Generation Pipelines

Creating high-quality code-based training data for world models requires translating large-scale GUI trajectories into code-screenshot-action pairs. Two representative pipelines are AndroidCode (Zheng et al., 10 Feb 2026) and gWorld (Koh et al., 2 Feb 2026):

AndroidCode

  • Source: AndroidControl trajectories (100K+ screenshot–action pairs)
  • Synthesis via LLM prompting: GPT-5 is prompted to synthesize HTML for each after-state, enforcing pre-defined layout constraints and using placeholders for images/icons.
  • Visual-feedback Revision Loop: Each synthesized HTML is rendered, and its image compared with the reference via SigLIP. If the alignment score is below 0.90, the LLM is prompted to revise the code. This yields 82K high-fidelity pairs, all passing human spot-checks, with mean HTML length ≈2500 tokens (Zheng et al., 10 Feb 2026).

gWorld

  • Source: Aggregates four major mobile GUI trajectory datasets (AndroidInTheWild, GUIOdyssey, AndroidControl, AMEX), totaling over 3 million usable transitions.
  • Procedure:
  1. Extract triplets (pre-state image, action, post-state image).
  2. Use a strong VLM (Gemini 3 Flash) to generate matching HTML code for the post-state image.
  3. Generate a reasoning trace providing context for the action.
  • Lifted Targets: Each training example is a mapping from (image, action) to (reasoning, HTML code) (Koh et al., 2 Feb 2026).

Both pipelines achieve near-perfect renderability (>99% valid code), high visual alignment, and enable the scaling of training data required for large VLM SFT.

3. Model Architectures and Optimization Protocols

Modern code-based world models utilize VLMs with architectural and training modifications suited for symbolic code prediction:

Architectures

  • Vision-Language Transformer Backbone: Qwen3-VL (8B and 32B) is used as the primary architecture, with outputs conditioned on visual encoding of the current GUI and action tokens (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).
  • Output Heads: Custom heads produce code tokens, with some architectures interleaving “reasoning” traces followed by HTML/CSS (Koh et al., 2 Feb 2026).
  • Frozen Vision Encoders: To maximize efficiency and avoid overfitting, the visual backbone is frozen post-pretraining, with only code-generation heads, MLP projectors, and LLM layers fine-tuned (Koh et al., 2 Feb 2026).

Optimization

  • Stage 1: Supervised Fine-Tuning (SFT): Autoregressive cross-entropy loss maximizes logP(Codestate, action)\log P(\text{Code}|\text{state, action}), constrained to produce faithful, valid HTML (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).
  • Stage 2: Render-Aware Reinforcement Learning (RARL) (Zheng et al., 10 Feb 2026):
    • Multiple code hypotheses are rendered.
    • Dual reward structure combining:
    • Rsem=R_{\text{sem}} = VLM-judged visual semantic fidelity (image-level)
    • Ract=R_{\text{act}} = VLM-judged action consistency (does the transformed UI correspond to action semantics)
    • Combined with policy KL penalties to enforce stability and faithfulness to initial SFT parameters.
  • Autoregressive Decoding: Decoding is performed greedily, with maximum output lengths up to 16K tokens for complex UIs (Koh et al., 2 Feb 2026).

4. Evaluation Protocols and Comparative Performance

Rigorous evaluation comprises both next-state generation accuracy and downstream agent navigation metrics. Standard benchmarks include MWMBench, Android Control (ID), GUI Odyssey (OOD), AndroidWorld (online), and KApps.

Key metrics:

  • Instruction Accuracy (IAcc): “Pass/fail” verdict, averaged across three VLM judges, indicating whether the generated code accurately reflects both initial state and action (Koh et al., 2 Feb 2026).
  • Similarity Metrics: Cosine similarity between DINO and SigLIP embeddings of rendered prediction vs. ground truth (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).
  • Visual Quality: Fine-grained scores such as Action Adherence (SadS_{ad}), Action Identifiability (SidS_{id}), Element Alignment (SeleS_{ele}), Layout Integrity (StayS_{tay}).
  • Render Fail Rate: Fraction of code predictions that fail to render in a headless browser.

Summary of quantitative results:

Model IAcc (%) Similarity (%) Render Fail (%) S_ad / S_id / S_ele / S_tay
gWorld 8B 74.9 70.3 1.4
gWorld 32B 79.6 71.4 0.6
Qwen3 VL 8B baseline 29.2 40.1
Code2World-8B 79.4 (SigLIP) 94.3 / 88.6 / 71.4 / 70.3
GPT-5 78.1 (SigLIP) 94.0 / 90.2 / 74.1 / 69.8
Gemini-3 Pro-Image 84.9 (SigLIP) 92.6 / 83.7 / 68.5 / 63.7
  • Pareto Frontier: gWorld 8B/32B and Code2World-8B achieve top scores among open-weight models, often surpassing baselines 50× their size (Koh et al., 2 Feb 2026, Zheng et al., 10 Feb 2026).
  • Navigation Success Rates: Addition of Code2World improves Gemini-2.5-Flash success by +9.5 pp (from 41.4% to 50.9%) on AndroidWorld, while Code2World-guided “Propose → Simulate → Select” pipelines yield consistent gains (Zheng et al., 10 Feb 2026).
  • Render Efficiency: gWorld 8B achieves ~20,000 tokens/s; gWorld 32B ~5,000 tokens/s, with per-state render + screenshot latencies of <1 s (Koh et al., 2 Feb 2026).

5. Comparative Analysis with Prior World Model Paradigms

Distinct world model paradigms include:

  • Text-only: Represent UIs by sequence of tokens or “sketches.” Proven to lose spatial/visual fidelity and provide poor policy improvement (Koh et al., 2 Feb 2026).
  • Pixel-based: Rely on generative diffusion pipelines, integrating OCR and GPT-based recognition for text rendering and layout synthesis. Suffer from legibility failures, complex multi-stage pipelines, high compute latency, and closed weights (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).
  • Code-based (present paradigm): Achieve precise text rendering, structurally accurate layouts, high visual fidelity, low error rate and render latency, and strong data-efficiency. Empirical evidence shows gWorld and Code2World outperform pixel-diffusion models in both visual quality and semantic correctness (Zheng et al., 10 Feb 2026, Koh et al., 2 Feb 2026).

Ablation studies confirm that data scalability (IAcc ∝ (data size)b, with R²≥0.94) and pipeline relabeling (synthesized code via “frontier” VLM → 100% renderable, IAcc=100%) are critical for model performance (Koh et al., 2 Feb 2026).

6. Implications, Advantages, and Future Extensions

The code-based approach confers several advantages:

  • Visual Fidelity: Deterministic rendering with exact reproduction of screenshot geometry and typography via browser engines (Zheng et al., 10 Feb 2026).
  • Structural Controllability: Symbolic, editable code representations support programmatic manipulation and long-horizon planning (e.g., inserting buttons, rearranging containers).
  • Sample Efficiency: Utilization of language modeling inductive biases improves data-efficiency over pixel-space diffusion (Zheng et al., 10 Feb 2026).
  • Modularity and Extensibility: Facilitates support for multiple UI frameworks (HTML, SwiftUI, React Native); action vocabularies including gestures, multi-phase interactions, and cross-device adaptation are straightforward extensions (Zheng et al., 10 Feb 2026).
  • Unified Simulation & Policy Interface: Enables “world-in-the-loop” learning setups with efficient simulate/select control cycles, accelerating RL training and evaluation.

A plausible implication is that such models will enable joint high-level task planning (instructed by LLMs) with low-level UI simulation for look-ahead and error correction, ultimately improving the autonomy and robustness of agentic goal navigation in complex mobile app environments (Zheng et al., 10 Feb 2026).

7. Open Challenges and Prospective Directions

While code-based world models surpass prior approaches on key technical axes, several challenges remain open for investigation:

  • Cross-framework Portability: Generalizing code synthesis to non-HTML mobile frameworks (e.g., native Android XML, SwiftUI DSLs, or React Native JSX) is necessary for full ecosystem coverage (Zheng et al., 10 Feb 2026).
  • Richer Semantics: Incorporation of text input, drag/gesture navigation, and non-visual states.
  • Robustness to Style and Device Heterogeneity: Adapting code generation for differing device DPIs, dark mode/branding styles, and accessibility guidance.
  • Minimizing Hallucination and Unfaithfulness: Ensuring the generated code faithfully mirrors hidden or partially occluded elements requires further refinement of vision-language alignment.
  • Scaling: Power-law behavior in data scaling suggests ongoing returns to dataset growth, but also raises questions of compute/dataset boundaries and diminishing returns (Koh et al., 2 Feb 2026).

These directions are poised to further consolidate code-based GUI state prediction as the canonical approach for interpretable, efficient, and high-fidelity mobile world modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Code-Based GUI Representation in Mobile World Models.