World-Model Interfaces: Concepts & Design

Updated 4 July 2026

World-model interfaces are formal contracts that link perception, internal state, action, and planning in dynamic environments.
They decompose processes into modular APIs, such as encode, transition, and decode, optimizing control and facilitating simulation.
They balance predictive fidelity with controllability by exposing structured representations that enhance closed-loop planning and interventions.

Searching arXiv for papers on world-model interfaces and related interface abstractions. World-model interfaces are the formal, algorithmic, and software contracts through which a world model links perception, internal state, action, prediction, memory, and intervention. In recent work, these contracts appear at multiple levels: as probabilistic operators such as $p(s_{t+1}\mid s_t,a_t)$ and $p(o_t\mid s_t)$ , as callable APIs such as encode, transition, decode, and get_cost, and as higher-level protocols for simulation, querying, rendering, planning, and memory updates. Contemporary research increasingly treats the interface itself as a first-class design object, because closed-loop utility depends not only on predictive fidelity but also on controllability, modularity, and the ability to expose internal structure to downstream agents and planners (Zidan et al., 28 May 2026, Maes et al., 9 Feb 2026, Zhang et al., 20 Oct 2025).

1. Formal definitions and core interface primitives

A general formalization treats the world model as a transition model together with an observation or emission model. The survey literature writes this in the compact form $p_\theta(s_{t+1}\mid s_t,a_t)$ and $q_\theta(o_t\mid s_t)$ , or, in a factored control setting, $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ (Zidan et al., 28 May 2026). This is the minimal interface: a current internal state, an action-conditioned transition, and a map back to observable consequences.

OpenWorldLib makes this abstraction explicit at the software level by defining an advanced world model as a model or framework centered on building internal representations from perception, equipped with action-conditioned simulation and long-term memory (Team et al., 6 Apr 2026). Its interface decomposition is correspondingly modular: Perception is exposed through BaseOperator, implicit interaction through BaseSynthesis, explicit 3D state through BaseRepresentation, long-term storage through BaseMemory, and orchestration through BasePipeline. The pipeline follows a uniform from_pretrained → process → infer → update_memory lifecycle, which turns architectural capabilities into stable callable contracts rather than ad hoc task code (Team et al., 6 Apr 2026).

The same logic appears in stable-worldmodel-v1, which assumes a world model implements

$z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$

with an optional encoder $h(o_t)\to z_t$ (Maes et al., 9 Feb 2026). Its required methods—encode, transition, decode, and get_cost—show a particularly clear separation between representation, dynamics, rendering, and planning cost. Because planners repeatedly call transition and accumulate get_cost, the world-model interface is directly optimized for MPC-style use rather than merely for offline prediction (Maes et al., 9 Feb 2026).

A distinct but related decomposition is introduced by Web World Models, which split state into a deterministic physics component and a model-generated imagination component:

$S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$

The update rule is correspondingly bifurcated: stepPhysics deterministically computes $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ , while stepImagination samples $S_{t+1}^\psi\sim \pi_\theta(\cdot\mid S_{t+1}^\phi)$ under typed web interfaces and runtime validation (Feng et al., 29 Dec 2025). This separation makes the interface itself a correctness mechanism: logical invariants live in code, while open-ended content generation is constrained by schemas.

2. Representation as interface: from explicit scene structure to latent control

One major axis of world-model interface design is the representation exposed between perception and dynamics. VDAWorld instantiates an explicitly structured interface in which an image-caption pair $p(o_t\mid s_t)$ 0 is mapped to an abstract scene representation

$p(o_t\mid s_t)$ 1

with each object parameterized by shape type, pose, orientation, mass, friction, restitution, and inferred initial velocity (O'Mahony et al., 11 Dec 2025). The representation is stored as native Python lists, dicts, or a small class, and is then passed to a simulator selector that chooses among RigidBodyEngine, FluidEngine, SoftBodyEngine, or LogicEngine based on $p(o_t\mid s_t)$ 2 and caption keywords such as “water,” “lava,” “smoke,” or “Game of Life” (O'Mahony et al., 11 Dec 2025). The paper explicitly notes that no end-to-end abstraction loss $p(o_t\mid s_t)$ 3 is reported; perception modules are pre-trained and fixed (O'Mahony et al., 11 Dec 2025). The interface is therefore deliberately transparent and simulator-oriented rather than end-to-end latent.

ViMo introduces a different abstraction layer for GUI dynamics. Its Symbolic Text Representation replaces dynamic text regions with placeholder rectangles while preserving graphics:

$p(o_t\mid s_t)$ 4

This interface decouples graphic prediction from text generation: a diffusion-based STR Predictor models layout, color, and style, while a GUI-text Predictor restores content token by token (Luo et al., 15 Apr 2025). The representation is neither purely pixel-level nor purely textual; it is a structured intermediate contract tuned to GUI semantics.

WorldAct moves the interface into 3D scene decomposition. Starting from a monolithic 3D Gaussian Splatting world model, it produces a background Gaussian set $p(o_t\mid s_t)$ 5, a collision mesh $p(o_t\mid s_t)$ 6, and per-object tuples $p(o_t\mid s_t)$ 7 (Hu et al., 15 May 2026). The interface is dual-form: 3DGS primitives for differentiable rendering and textured meshes for collision, physics, and editing. This duality is central to the framework’s purpose, because editability and embodied manipulation require more than a visually coherent renderer (Hu et al., 15 May 2026).

A more latent formulation appears in BRICKS-WM, which factorizes state into an actuated Agent slot, a Background slot, and a scalar latent interface:

$p(o_t\mid s_t)$ 8

The transition factorization is unidirectional:

$p(o_t\mid s_t)$ 9

so background dynamics never directly observe the agent action (Zhang et al., 15 Jun 2026). Here the interface is explicitly functional rather than visual: it is a bottleneck through which one subsystem can influence another.

World modeling through Lie Action offers another latent-control interface. It encodes each frame into object slots $p_\theta(s_{t+1}\mid s_t,a_t)$ 0 and models action in a continuous Lie-group-structured latent space, with composition enforced by matrix multiplication,

$p_\theta(s_{t+1}\mid s_t,a_t)$ 1

and dynamics governed by $p_\theta(s_{t+1}\mid s_t,a_t)$ 2 (Hayashi et al., 13 Mar 2025). Olaf-World addresses a related problem at the level of latent action identifiability: Seq $p_\theta(s_{t+1}\mid s_t,a_t)$ 3-REPA aligns the sequence-integrated latent action with a frozen video encoder’s effect direction,

$p_\theta(s_{t+1}\mid s_t,a_t)$ 4

so that latent actions acquire a shared coordinate system across contexts (Jiang et al., 10 Feb 2026). Taken together, these systems suggest a spectrum in which interfaces range from explicit geometry and typed objects to low-dimensional, semantically aligned latent control variables.

3. Action channels and control formats

A world-model interface is also defined by how actions enter the system. World-in-World formalizes this point by requiring a standardized action API for heterogeneous models. Its controller supports three principal control formats: text prompts for text-to-video models, camera trajectories for trajectory-conditioned models, and low-level action vectors for action-conditioned models (Zhang et al., 20 Oct 2025). Observations are image tensors such as $p_\theta(s_{t+1}\mid s_t,a_t)$ 5 or panoramas of shape $p_\theta(s_{t+1}\mid s_t,a_t)$ 6, while actions may be discrete integers or continuous vectors in $p_\theta(s_{t+1}\mid s_t,a_t)$ 7 for a 7-DoF gripper (Zhang et al., 20 Oct 2025). The interface therefore abstracts over model family without collapsing action semantics.

ActWorld extends the action channel beyond navigation-centric control. It conditions chunk-autoregressive video generation on a starting frame, per-chunk semantic captions, low-level keyboard/mouse controls, high-level action commands such as pick up or pour, and hierarchical memory (Xiong et al., 16 Jun 2026). Its low-level camera control vocabulary comprises $p_\theta(s_{t+1}\mid s_t,a_t)$ 8 keyboard $p_\theta(s_{t+1}\mid s_t,a_t)$ 9 $q_\theta(o_t\mid s_t)$ 0 mouse combinations, yielding $q_\theta(o_t\mid s_t)$ 1 discrete combos, while the high-level action vocabulary has $q_\theta(o_t\mid s_t)$ 2 classes (Xiong et al., 16 Jun 2026). The same user action enters two conditioning streams: a symbolic stream encoded by a frozen UMT5 and, for camera motion, a geometric stream encoded as a Plücker-ray FiLM signal (Xiong et al., 16 Jun 2026). The interface is therefore multimodal even before prediction begins.

GUI-oriented work exposes still other control regimes. ViMo models a discrete app action space that includes taps, swipes, and typing, and learns an approximate transition $q_\theta(o_t\mid s_t)$ 3 over GUI images (Luo et al., 15 Apr 2025). “How Mobile World Model Guides GUI Agents?” compares four downstream-facing interfaces for the same transition prediction problem: delta text, full text, diffusion-based images, and renderable code (Xu et al., 11 May 2026). In that formulation, the interface is not only the action representation but also the representation of the predicted future state. A short natural-language delta can serve as a robust semantic feedback channel, whereas code can be rendered back into an image and also inspected as structured output (Xu et al., 11 May 2026).

Robotic interfaces introduce another layer of heterogeneity. The World-Language-Action model takes instruction $q_\theta(o_t\mid s_t)$ 4, recent visual observations $q_\theta(o_t\mid s_t)$ 5 and optionally $q_\theta(o_t\mid s_t)$ 6, proprioceptive state $q_\theta(o_t\mid s_t)$ 7, and a memory buffer of past subtasks, then autoregressively predicts a window of textual subtasks, a compact physical-dynamics vector $q_\theta(o_t\mid s_t)$ 8, and an $q_\theta(o_t\mid s_t)$ 9-step action chunk (Yang et al., 4 Jun 2026). The World Expert uses $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 0 to imagine a future static frame, and the Action Expert uses $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 1 to synthesize the action chunk (Yang et al., 4 Jun 2026). A plausible implication is that world-model interfaces are becoming increasingly factorized: one channel carries semantic intent, another physical dynamics, and another executable control.

4. Planning, querying, and intervention

The practical importance of a world-model interface becomes most visible in closed-loop planning. World-in-World organizes planning into a three-stage cycle at each time step: Proposal, Simulation, and Revision and Execution (Zhang et al., 20 Oct 2025). A proposal policy samples $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 2 candidate action-sequence plans, the world model rolls out predicted futures, and a revision policy or score-and-select operator chooses the best plan:

$p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 3

This interface is intentionally model-agnostic: any model that can accept the controller’s control input and return simulated observations can participate in closed-loop planning (Zhang et al., 20 Oct 2025).

ViMo embeds the same logic into GUI agents. Given horizon $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 4, the agent rolls out candidate action sequences,

$p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 5

and scores them against the goal via an LLM or learned reward model, with beam search used as a concrete planning procedure (Luo et al., 15 Apr 2025). Here the interface must preserve sufficient visual detail for action readiness, not merely coarse semantic plausibility.

Stable-worldmodel-v1 exposes planning even more explicitly. Its CEMSolver, MPPI, SGD, and Adam solvers all consume a world-model instance plus a PlanConfig, and return solve(initial_z, goal, **kwargs) → action_sequence (Maes et al., 9 Feb 2026). Because world.evaluate and world.evaluate_from_dataset are also standardized, the same interface supports online MPC, offline zero-shot evaluation, and controlled factor-of-variation studies (Maes et al., 9 Feb 2026).

Some systems make intervention itself part of the interface. VDAWorld routes an external query such as “Apply force $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 6 to object #3 at $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 7s.” into generated Python, for example by injecting self.engine.apply_force(body_id=3, force=(5,0,0)) inside update_simulation (O'Mahony et al., 11 Dec 2025). Because the world model is exposed as transparent Python code, arbitrary interventions—changing gravity, editing object shapes, redefining cellular automaton rules—can be encoded as small edits to the simulator, and re-running the edited code immediately yields new simulation results (O'Mahony et al., 11 Dec 2025). WorldAct provides a comparable intervention surface in 3D through APIs such as get_objects(), set_pose(obj_id, pose), remove(obj_id), and insert(obj_asset, pose), enabling pick-and-place, rearrangement, or navigation tasks in reconstructed scenes (Hu et al., 15 May 2026).

Web World Models provide a different intervention model: actions deterministically update the physics state via ordinary web code, after which imagination is regenerated from the updated typed state (Feng et al., 29 Dec 2025). This suggests that the same high-level interface pattern—observe, update structured state, imagine consequences, select action—can be realized either by learned latent transitions or by deterministic code-backed world engines.

5. Memory, modularity, and reusable protocols

Long-horizon behavior requires interfaces for memory as well as transition. ActWorld argues that interactive world models suffer from an action-forgetting pathology when recency-biased compression discards event-transition frames (Xiong et al., 16 Jun 2026). Its response is a hierarchical action-aware memory interface with three components: Event-Aware Frame Re-assignment, an Action-Conditioned History Amplifier, and a Persistent Action-Aware Memory Bank (Xiong et al., 16 Jun 2026). The importance score for a past chunk is

$p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 8

with $p_\theta(s_{t+1},o_{t+1},r_t\mid s_t,a_t)$ 9, $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 0, and $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 1 chunks, and the persistent bank stores at most $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 2 tokens with pinning of interaction frames (Xiong et al., 16 Jun 2026). Memory here is not a generic cache; it is an interface specialized to causal interaction state.

BRICKS-WM treats modularity itself as an interface problem. After source-task training, it freezes the background RSSM core and reuses it across agents, re-initializing only the Agent RSSM, the interface policy, the agent slot query, and task-specific heads (Zhang et al., 15 Jun 2026). To accommodate protocol shifts in the interface code $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 3, it inserts a zero-initialized residual adapter

$z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 4

so the frozen background initially sees the same interface it was trained on (Zhang et al., 15 Jun 2026). Proposition 4.6 states that if the conditional interface distribution matches between source and target agents, the frozen background dynamics produce valid transitions in the new setting (Zhang et al., 15 Jun 2026). Reusability is thus reduced to protocol matching.

Web World Models approach modularity through typed interfaces and deterministic generation. Physics objects and imagination objects are defined in TypeScript or JSON Schema, runtime validation enforces structural validity, and hash-based seed pinning yields object permanence without storing every state explicitly (Feng et al., 29 Dec 2025). This is a software-engineering answer to the same problem addressed by latent bottlenecks in BRICKS-WM: how to keep a world open-ended while preserving stable contracts.

OpenWorldLib and stable-worldmodel-v1 generalize this concern into reusable research infrastructure. OpenWorldLib standardizes BaseOperator, BaseReasoning, BaseSynthesis, BaseRepresentation, BaseMemory, and BasePipeline, with dynamic loading from manifests and uniform Dict[str,Tensor]-style message passing (Team et al., 6 Apr 2026). stable-worldmodel-v1 standardizes environment wrappers, policy attachment, data collection, factors of variation, and planner/model integration under a single Python-based API (Maes et al., 9 Feb 2026). A plausible implication is that, as world models diversify, software-level interface stabilization becomes a prerequisite for meaningful comparison and composition.

6. Evaluation, benchmarks, and recurrent design tensions

Evaluation results across recent systems indicate that interface design affects downstream success as much as raw generative quality. World-in-World makes this point most directly: its benchmark prioritizes task success over open-loop visual quality and reports three central findings—visual quality alone does not guarantee task success, controllability matters more; scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and allocating more inference-time compute substantially improves closed-loop performance (Zhang et al., 20 Oct 2025). It also fits an embodied scaling law,

$z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 5

with $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 6 and $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 7 for Active Recognition success rate (Zhang et al., 20 Oct 2025).

ViMo evaluates its GUI interface by GUI consistency via DINO-feature cosine similarity, instructional accuracy by an LLM judge, and action readiness; it reports harmonic mean $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 8, a $z_{t+1}=f(z_t,a_t;\theta_f),\qquad \hat o_t=g(z_t;\theta_g),$ 9 relative gain over baselines (Luo et al., 15 Apr 2025). In multi-step trajectory synthesis it attains $h(o_t)\to z_t$ 0 at $h(o_t)\to z_t$ 1 versus $h(o_t)\to z_t$ 2 for the best vision baseline, and augmenting T3A and M3A with ViMo improves single-step action accuracy from $h(o_t)\to z_t$ 3 and $h(o_t)\to z_t$ 4, respectively (Luo et al., 15 Apr 2025). These numbers tie interface quality to agent utility rather than to screenshot realism alone.

WorldAct similarly evaluates whether decomposition interfaces support interaction without collapsing scene quality. On its six-scene benchmark it achieves $h(o_t)\to z_t$ 5 Interactable Object Recall overall versus $h(o_t)\to z_t$ 6 without agent, while world-level fidelity in user study drops only marginally from $h(o_t)\to z_t$ 7 overall and object-level quality rises from $h(o_t)\to z_t$ 8 (Hu et al., 15 May 2026). VDAWorld uses $h(o_t)\to z_t$ 9 and PhysicsIQ metrics—Spatial IoU, Weighted-Spatial IoU, and Spatiotemporal IoU—and, for Game of Life, computes per-frame $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 0 scores (O'Mahony et al., 11 Dec 2025). These evaluations reflect the fact that simulator-facing interfaces are judged by physical and logical consistency as much as by appearance.

ActWorld evaluates an interaction-heavy video interface on I-Bench and reports Subject-Consistency $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 1, Background-Consistency $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 2, Motion-Smoothness $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 3, IF $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 4, Succ. (%) $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 5, ≥2 (%) $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 6, and Acc_full (%) $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 7 (Xiong et al., 16 Jun 2026). In user study it ranks first on Action-Following, Key/Mouse-Following, and Overall Quality, with Action-Follow $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 8 versus the next best $S_t=\bigl(S_t^\phi,S_t^\psi\bigr).$ 9 (Xiong et al., 16 Jun 2026). The evaluated object is not merely a generated video but a control-sensitive interface with both navigation and object interaction.

GUI-agent studies expose a different tension: high-fidelity structured output is not always the most robust interface at execution time. “How Mobile World Model Guides GUI Agents?” reports that delta text reaches Overall $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 0 on MobileWorldBench, whereas full text reaches Overall $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 1, while renderable code performs strongly in-distribution on Code2WorldBench with $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 2, $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 3, $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 4, $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 5, Overall $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 6 (Xu et al., 11 May 2026). On AndroidWorld, adding delta-text guidance improves end-to-end success rate from $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 7 for Qwen3-VL-8B (M3A), from $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 8 for Gemini-3-Flash, and from $S_{t+1}^\phi=f_{\mathrm{code}}(S_t^\phi,a_t)$ 9 for GPT-5.4 (Xu et al., 11 May 2026). The same study also reports very low mean action entropy, $S_{t+1}^\psi\sim \pi_\theta(\cdot\mid S_{t+1}^\phi)$ 0– $S_{t+1}^\psi\sim \pi_\theta(\cdot\mid S_{t+1}^\phi)$ 1, which limits the gains of posterior self-reflection (Xu et al., 11 May 2026). This suggests that a world-model interface may function more effectively as prior perception or training supervision than as a universal post-hoc verifier.

The survey literature characterizes these findings as part of a broader evaluation problem: perceptual metrics, task-performance metrics, and physics or consistency benchmarks often diverge, and fragmented evaluation remains a persistent challenge (Zidan et al., 28 May 2026). Across current systems, the recurring lesson is that world-model interfaces are not neutral wrappers around a learned dynamics model. They determine which variables are controllable, which interventions are expressible, which histories remain accessible, and which downstream tasks can treat the model as a usable world rather than as a visually impressive but operationally opaque predictor.