Code2World: Modeling Code in Action

Updated 17 February 2026

Code2World is a research paradigm that grounds code understanding in executable world models, enabling simulation of state transitions and agentic reasoning.
It leverages trace-based architectures, large-context transformers, and render-aware reinforcement learning to predict outcomes across software, GUI, and 4D simulation domains.
Empirical benchmarks show improved simulation fidelity and planning accuracy, despite challenges in token budgets, string state handling, and long-horizon tracking.

Code2World encompasses a research paradigm, technical methodology, and a class of models that translate code into interactive or grounded predictions of world state or dynamics. The unifying characteristic of Code2World approaches is the explicit grounding of code generation or understanding in executable, semantics-rich world models, enabling agentic reasoning, step-by-step simulation, and actuation across software, physics, and GUI domains. Canonical instantiations range from LLMs trained to simulate program execution through structured traces, to vision-language agents that generate renderable interface code, to 4D scene generators producing executable simulation scripts. This article synthesizes the development, core principles, and frontiers in Code2World methodologies, with a focus on representative research from CWM (team et al., 30 Sep 2025), Code2Worlds (Zhang et al., 12 Feb 2026), Debugging Code World Models (Rahmani, 7 Feb 2026), and Code2World for GUIs (Zheng et al., 10 Feb 2026).

1. Foundations and Definitions

Code2World denotes the capability of LLMs or code agents to “think” in code by simulating, predicting, or generating effects within an explicit world model. This approach diverges from conventional code LLMs trained solely on static corpora, instead grounding code understanding in dynamic, executable semantics. In the execution-based paradigm, a model receives code (or state/action history) and must autoregressively output the next “state” (e.g., runtime variable snapshot, simulated screen layout, or physical geometry), often conditioned on explicit actions and environmental feedback. Representative manifestations include:

CWM (Code World Model): An open-weights 32B-parameter LLM trained in phases to predict interpreter states, Docker environment transitions, and to perform agentic planning, with context windows capable of holding long execution traces or repository-scale state (team et al., 30 Sep 2025).
Code2World (GUI): A vision-language code generator that predicts future GUI states via renderable HTML conditioned on the current screenshot and action, optimized with a “render-aware” closed-loop RL signal (Zheng et al., 10 Feb 2026).
Code2Worlds (4D Simulation): A language-to-simulation framework that generates Blender/Python scripts to instantiate and actuate physics-aware 4D worlds from text, using dual-stream architectures and VLM-driven self-refinement (Zhang et al., 12 Feb 2026).

The goal is to endow AI agents with a “neural world model”: a learned function that predicts the results of code actions on world state, thereby supporting internal verification, planning, debugging, and foresight.

2. Model Architectures and Training Protocols

Code2World models typically extend decoder-only Transformers, adapting architectural and training strategies for explicit state-action prediction and long-horizon reasoning:

Trace-Based Architectures (CWM, Debugging CWM): The model processes sequences of (action, state) pairs. At each step $t$ , with action $a_t$ and serialized state $s_t$ (often as JSON), the model autoregressively predicts $s_{t+1}$ , sometimes jointly with $a_{t+1}$ (team et al., 30 Sep 2025, Rahmani, 7 Feb 2026).

The principal objective is

$L(\theta) = -\sum_{t=1}^T [\log p_\theta(a_t \mid s_{<t}, a_{<t}) + \log p_\theta(s_t \mid s_{<t}, a_{\leq t})]$

Large-Context Transformers: CWM achieves 131k token context through a composite attention scheme (3:1 ratio of local 8k and global 131k sliding-window patterns) with Grouped-Query Attention, RoPE to 1M tokens, SwiGLU activations, and RMSNorm, enabling repository- and trace-scale inference (team et al., 30 Sep 2025).
Render-Aware and RL-Enhanced Training: In Code2World (Zheng et al., 10 Feb 2026), a VLM backbone undergoes supervised fine-tuning (SFT) on HTML layout synthesis, then is further optimized with Render-Aware Reinforcement Learning (RARL) where rewards are computed from semantic visual alignment and action consistency, using Group Relative Policy Optimization.
Dual-Stream and Closed-Loop Architectures (Code2Worlds): Generation is decomposed into object streams (retrieval-augmented per-entity code synthesis) and scene streams (hierarchical environment orchestration), unified and subject to iterative post-processing with VLM-Motion Critic self-reflection to enforce dynamic plausibility (Zhang et al., 12 Feb 2026).
Dataset Curation: Training data encompasses synthetic and real-world execution traces (Python, Docker), renderable GUI trajectories (AndroidCode), and procedural simulation libraries. Visual feedback and post-hoc correction are integrated in data pipelines (Zheng et al., 10 Feb 2026).

3. Methodological Advances in "Code2World" Learning

The Code2World paradigm introduces methodological novelties distinct from both static code generation and generic tool-use agents:

World-Modeling via Execution Trace Prediction: Unlike chain-of-thought or unstructured code generation, models learn system dynamics by minimizing cross-entropy over next-frame prediction in execution traces: $L_\mathrm{trace} = -\sum_t \log P(o_t, a_t \mid C, \tau_{0:t-1})$ . Empirically, millions of real interpreter transitions are encoded as sequences of custom-tokens framing local state and action, with trajectory-based datasets for mid-training (team et al., 30 Sep 2025).
Simulation, Planning, and Agentic Reasoning: Rollouts in the latent world model enable stepwise simulation of code effects, agents to filter candidate edits via predicted test outcomes, and “simulate-rank-generate” cycles for patch synthesis. This unlocks agentic coding, tool-augmented planning, and non-interpreter offline debugging (team et al., 30 Sep 2025).
Render-Aware GUI Prediction: In GUI domains, next-state generation is formulated as conditional code synthesis, $C_{t+1} = M_\theta(I_t, a_t, G)$ , rendering a predicted HTML DOM and using browser feedback for iterative refinement. Supervised and reinforcement objectives combine, enforcing both syntactic (via code loss) and semantic/visual correctness (via VLM-based rewards) (Zheng et al., 10 Feb 2026).
Physics-Aware 4D World Generation: 4D scene construction is solved via dual-stream code synthesis (per-object and per-environment), followed by Blender-scripted actuation of physics effects (rigid body, fluid, particles), with closed-loop VLM-Motion Critic feedback ensuring compliance with physical and semantic expectations. Code2Worlds uniquely demonstrates correction of unphysical “hallucinations” with iterative pseudocode cycles (Zhang et al., 12 Feb 2026).

4. Evaluation, Benchmarks, and Empirical Results

Code2World research is characterized by rigorous benchmark evaluation across software, GUI, and simulation environments:

Model	SWE-bench Verified (pass@1)	LiveCodeBench	Math-500	Code4D SGS	AndroidWorld Navigation
CWM (Base)	53.9%	68.6%	96.6%	—	—
CWM (Scaled)	65.8%	—	—	—	—
Code2Worlds	—	—	—	61.4	—
3D-GPT (static)	—	—	—	41.7	—
ImmerseGen	—	—	—	35.5	—
AnimateDiff	—	—	—	—	—
Code2World (8B)	—	—	—	—	+9.5% vs. baseline

CWM establishes new open-weights baselines for verifiable code and math generation, outperforming prior static-code LLMs (team et al., 30 Sep 2025). Code2Worlds demonstrates 41% SGS (Scene Grammar Score) and 49% Richness gains over static 3D or video models, and reduces dynamic failure rates to 10% (vs >50% for AnimateDiff) (Zhang et al., 12 Feb 2026). Code2World for GUIs achieves Action Adherence (Sad) = 94.28 and Element Alignment (Sele) = 71.35 on Android Control, and boosts downstream RL agent navigation by +9.5% on AndroidWorld (Zheng et al., 10 Feb 2026).

Ablation studies indicate that adding explicit execution trace data or agentic environment interaction trajectories elevates code reasoning accuracy and pass rates (e.g., GitHub PR trajectories, Python execution traces, and ForagerAgent data improve SWE-bench performance in CWM) (team et al., 30 Sep 2025).

5. Analysis of Limitations and Failure Modes

Research identifies several bottlenecks and failure regimes in Code2World models:

Token-Budget Exhaustion: Dense execution traces (action-state pairs per step) quickly saturate Transformer context windows. On programs involving structured data or long histories, token-budget limits cause trace truncation and reduce performance, with truncation error rates of 0.8%-4.1% observed (Rahmani, 7 Feb 2026).
String-Valued State Failure: String manipulation is disproportionately error-prone due to subword tokenization discontinuities. Empirical over-representation ratios for string output errors reach up to 2.56× baseline rates on HumanEval, with context-sensitive BPE tokenization causing unreliable split or index predictions (Rahmani, 7 Feb 2026).
Long-Horizon State Tracking: Controlled S₅-permutation benchmarks reveal that error propagation over long horizons is primarily due to hallucinated action generation, not state carry-forward; teacher forcing with ground-truth actions maintains >90% accuracy over 128 steps (Rahmani, 7 Feb 2026).
GUI Modeling Constraints: HTML-only GUI world modeling cannot produce novel bitmaps or capture dynamic scripts/animations, and agent behavior can be misled by hallucinated but “valid” code (Zheng et al., 10 Feb 2026).

6. Directions for Future Research

Several open challenges and avenues are emphasized:

Efficient State Supervision: Sparse-reveal or delta supervision, where only every Rth state is shown or deltas are modeled, could reduce tokenization overhead (Rahmani, 7 Feb 2026). Structured or byte-level representations (e.g., ByT5-style byte encoding) could improve string operations.
Hybrid and Hierarchical Architectures: Integration of explicit symbolic representations (ASTs, IRs) and neural world models, or the use of structured state-space sequence architectures (e.g., SSMs, DeltaNetworks), may enhance generalization and context efficiency (team et al., 30 Sep 2025, Rahmani, 7 Feb 2026).
Multi-Domain and Multi-Language Extensions: Expanding execution-based world modeling to other programming languages (JavaScript, C++, theorem provers), dynamic and script-heavy GUIs, and multimodal simulation environments is a recognized need (team et al., 30 Sep 2025, Zheng et al., 10 Feb 2026).
Self-Reflective and Closed-Loop Learning: Iterative refinement using environment-grounded critics (e.g., VLM-Motion Critic, render-alignment judges) appears central to maintaining high-fidelity, physically consistent predictions in both simulation and GUI domains (Zhang et al., 12 Feb 2026, Zheng et al., 10 Feb 2026).
Continual Learning and Robustness: As software, simulation environments, and device interfaces evolve, Code2World agents require mechanisms for continual updating and for adversarial robustness to hallucinated or distributionally-shifted scenarios (team et al., 30 Sep 2025, Zheng et al., 10 Feb 2026).

A plausible implication is that the practical utility and scalability of Code2World systems will depend on innovations in efficient supervision, cross-domain integration, and dynamic self-correction mechanisms.

7. Significance and Impact

Code2World establishes a new paradigm for code intelligence grounded not merely in syntax or static data, but in executable, semantics-rich world models spanning software execution, UI visual transitions, and physical simulation. The paradigm demonstrates that learning consequences—not just compositions—of code actions is feasible at scale (demonstrated at the 32B-parameter level) and beneficial to agentic reasoning, planning, and robust code and scene generation (team et al., 30 Sep 2025, Zhang et al., 12 Feb 2026, Zheng et al., 10 Feb 2026).

By providing open-weights systems, standardized datasets (AndroidCode, Python/Docker traces), and new RL and evaluation methodologies, Code2World research forms a foundation for the next generation of AI systems that seamlessly simulate, envision, and act across digital and physical environments.

Markdown Report Issue Upgrade to Chat

References (4)

CWM: An Open-Weights LLM for Research on Code Generation with World Models (2025)

Code2Worlds: Empowering Coding LLMs for 4D World Generation (2026)

Debugging code world models (2026)

Code2World: A GUI World Model via Renderable Code Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Code2World.