Visual World Modeling via Code Generation
- The paper presents a novel approach where systems generate renderable code to model, predict, and reconstruct visual environments with high levels of interpretability and compositionality.
- It introduces methodologies like Im2Sim and code-based GUI representation that transform images into executable code, offering precise control over simulation and rendering processes.
- Empirical evaluations show that code-generated outputs achieve strong instruction accuracy and render fidelity, highlighting the method's practical benefits in simulation, UI prediction, and procedural scene synthesis.
Visual world modeling via renderable code generation denotes a family of methods in which a system models, predicts, or reconstructs aspects of a visual environment not as direct pixels, but by emitting structured code that, when executed, produces visual renderings consistent with the world. This paradigm unifies advances in vision-LLMs, procedural world generation, and structured output transduction, enabling precise, interpretable, and compositional representations. Across applications from scientific simulation to UI state prediction and virtual environment authoring, renderable code becomes the canonical medium through which world models express generative processes, structural regularities, and interactive state transitions.
1. Conceptual Foundations and Motivation
Visual world modeling aims to build internal models capable of understanding, predicting, or reconstructing visual environments. Renderable code offers a high-level, interpretable, and executable formalism for expressing these models. The approach stands in contrast to traditional pixel-based generative pipelines, facilitating several key properties:
- Compositionality: Code can describe rich, multi-part systems (e.g., urban layouts, physical systems) by encapsulating generative mechanisms and component relationships.
- Interpretability: Code outputs expose explicit representations of system structure and generative logic, useful for diagnosis and transfer across domains.
- Precision and Fidelity: Especially in domains with linguistic or symbolic structure (e.g., text, GUI states), code enables perfect rendering of textual/numeric content and structural details via domain-specific primitives (e.g., CSS, geometry calls).
- Efficiency and Scalability: Code-based rendering often circumvents the inefficiencies of generative diffusion, and, when paired with appropriate renderers, supports high-throughput synthesis and manipulation.
Visual world modeling via renderable code thus provides a framework with broad utility in domains requiring accurate reconstruction, simulation, or prediction—whether as output of vision-LLMs or as the backend for procedural interfaces (Eppel, 8 Jan 2026, Koh et al., 2 Feb 2026, Lucanin, 2012).
2. Methodologies and System Architectures
A. Image-to-Simulation-to-Image (Im2Sim)
In the Im2Sim workflow, a vision-LLM (VLM) processes an input image , encodes it using a vision backbone (e.g., ViT), and generates two outputs: (i) a descriptive textual summary of the underlying generative process, and (ii) an executable code block in a designated language or API. This code, when executed in a sandboxed interpreter, synthesizes an image for direct comparison with the original input. The process formalizes as
where parameterizes the code-generating VLM conditioned on latent vision features. Rendering targets typically include Processing (Java), Python with NumPy/Matplotlib, and Blender Python API. This methodology has demonstrated high-level conceptual mapping of complex systems—waves, urban layouts, reaction-diffusion vegetation—by leading multimodal transformers (GPT, Gemini) (Eppel, 8 Jan 2026).
B. Code-Based GUI Representation in Mobile World Models
In mobile GUI world modeling, models predict the next state not as but as a code block (HTML/CSS), rendering new state by
gWorld models utilize a ViT-based encoder, Transformer-XL–style decoder with cross-attention, and BPE tokenization specialized for HTML/CSS tokens. The output concatenates a reasoning trace and code. This code-based approach maintains high precision for text and layout compared to pixel-level prediction and allows for compositional UI specification. Quantitative evaluations demonstrate superior instruction accuracy and render fidelity across distribution splits; gWorld 8B and 32B outperform models up to 50 their size (Koh et al., 2 Feb 2026).
C. Visual Programming and Procedural Virtual Scene Generation
Lučanin et al. explore VPLs that let users construct flowchart-style diagrams describing procedural generation pipelines for virtual environments. The VPL translates visual graphs into structured code via a GOTO→WHILE→Python multistage compiler, ultimately invoking C++ procedural APIs (via Boost.Python) bound to a 3D rendering engine. Key API components—e.g., ManhattanLayout, ProceduralBuildingGenerator—expose stochastic and parametric primitives for scene synthesis (Lucanin, 2012). The resulting code is directly executable and integrates with graphical renderers.
3. Training, Data Generation, and Code Synthesis
Supervised Code Generation from Image/Action Pairs
Code-generating VLMs are fine-tuned on datasets comprising triplets (image, action, code/rendered next state). In the case of gWorld, offline policy trajectories are repurposed and relabeled using high-capacity VLMs prompted to map images to code and provide reasoning traces. This process yields high-diversity training corpora with tens to hundreds of thousands of synthetic HTML/CSS exemplars, covering broad application domains (productivity, navigation, multimedia, settings) (Koh et al., 2 Feb 2026). The loss function is standard autoregressive maximum likelihood over the concatenated reasoning and code tokens, with optional regularization.
Code Rendering and Execution
Generated code is executed in isolated environments: Python scripts in Jupyter/Matplotlib, Processing, or Blender for scientific/physical domains (Eppel, 8 Jan 2026); headless Chromium-based browsers for HTML/CSS mobile GUIs (Koh et al., 2 Feb 2026); real-time 3D engines (Ogre) for procedural scene generation (Lucanin, 2012). Synthetic outputs are then evaluated against reference images using both automated metrics (L norms, SSIM, DINO cosine similarity) and qualitative expert judgment.
4. Evaluation Metrics and Empirical Findings
Evaluation examines both pixel-level and structural congruence between rendered code outputs and reference data.
| Model/System | Param Count | Instr. Accuracy | Visual Sim. | Render Fail Rate |
|---|---|---|---|---|
| gWorld 32B | 32B | 79.6% | 71.4% | 0.6% |
| GPT-5 (Color, Im2Sim) | — | 80% | — | — |
| Llama 4 402B | 402B | 55.7% | 62.4% | 9.2% |
| Human (Im2Sim) | — | 70% | — | — |
Metrics:
- Instruction Accuracy ("IAcc."): VLM-judged, action-consistency of predicted render versus ground-truth GUI (Koh et al., 2 Feb 2026).
- Visual Similarity: DINO v1/v2 cosine similarity between predicted and reference images.
- Render Fail Rate: Fraction where code fails to render a valid image/state.
- SSIM, L Norms: For image-level Im2Sim, SSIM and normed differences are standard (Eppel, 8 Jan 2026).
In Im2Sim, VLMs correctly match simulated to original images in 50–80% of cases across physical, organic, and urban systems; humans achieve 70%. For mobile GUIs, code-based WM architectures like gWorld achieve top IAcc. (79.6% at 32B parameters) and low render-fail rates, outperforming much larger baselines (Koh et al., 2 Feb 2026). A power-law scaling relation is observed between data corpus size and IAcc., with further gains implied by access to larger behavioral data.
5. Application Domains and Practical Benefits
Renderable code generation enables world modeling in domains characterized by structured, generative regularities, or heavy symbolic content, providing:
- Physical and Emergent Systems Modeling: Im2Sim demonstrates code-based modeling of caustics, reaction-diffusion systems, and urban networks.
- Graphical User Interface State Prediction: GUI world models leverage code for crisp reproduction of text, layout, and interaction outcomes, critical for UI agent rollouts and automated evaluation (Koh et al., 2 Feb 2026).
- Procedural Scene Synthesis: Visual programming methodologies empower non-programmers to author complex 3D cityscapes and architectural scenes using deterministic and stochastic code, fully integrated into rendering pipelines (Lucanin, 2012).
Benefits include interpretability, editability, and direct mapping to downstream simulation/rendering engines. In policy learning, richer world models directly improve agent performance, with demonstrated step-wise accuracy gains proportional to world-model quality (Koh et al., 2 Feb 2026).
6. Limitations and Asymmetries
A recurring empirical finding is the discrepancy between high-level mechanistic/conceptual fidelity and low-level visual matching:
- High-Level Successes: VLMs often infer correct underlying generative principles (Snell’s law, L-systems, Perlin noise) and can decompose complex visual scenes into sub-processes, accurately simulating the domain’s broad structure (Eppel, 8 Jan 2026).
- Low-Level Failures: Fine spatial detail, parameter precision, and pixel-exact patterning remain out of reach for most code-first systems; rendered outputs frequently miss subtle spatial correspondences and textures.
- “Cheating” Effects: Pure pattern copying or memorization can sometimes outperform physically faithful code, especially if the matching objective undervalues mechanistic correctness. This suggests a fundamental tension: current renderable code models effectively internalize systemic and generative abstractions but remain limited by both code expressivity and granular vision inductive bias.
7. Historical Perspective and Relation to Prior Work
Predecessors to contemporary approaches include pix2code (GUI screenshots HTML/CSS), InverseCSG/DeepCAD (3D geometry recovery as program induction), and graphical world-modeling in visual programming environments (Lucanin, 2012). The novelty in recent work lies in targeting unconstrained, natural images (not just clean diagrams), multi-component code synthesis, and closed-loop evaluation via rendered simulations. The integration of vision-language pretraining and structured code emission across diverse domains marks a shift toward general, unified world-modeling architectures (Eppel, 8 Jan 2026, Koh et al., 2 Feb 2026).
In summary, visual world modeling via renderable code generation unifies advances in multimodal learning, program synthesis, and procedural generation. It has demonstrated interpretive power and functional benefits across physical, symbolic, and interactive domains, while also exposing the frontiers of current model abstraction and granularity. Future directions likely include scaling to increasingly complex environments, further integration of code and pixel representations, and extending generalization to higher-order compositionality.