Test-time Mixture of World Models (TMoW)
- Test-time Mixture of World Models (TMoW) is an adaptive framework that assembles pre-trained, domain-specific world models at inference to enhance generalization in novel environments.
- It employs prototype-based routing and feature-level fusion mechanisms, such as compound attention and sigmoid gating, to dynamically integrate expert knowledge.
- Empirical results demonstrate significant improvements in zero-shot success rates and continual learning on benchmarks like VirtualHome, ALFWorld, RLBench, and CALVIN.
Test-time Mixture of World Models (TMoW) is a framework for adaptive embodied agents operating in novel or dynamic domains. Rather than relying on a single monolithic world model, TMoW assembles a sparse mixture of multiple pre-trained, domain-specific world models—or “experts”—at inference. These models are composed in real time based on the agent’s current observations, utilizing prototype-based retrieval, dynamic routing, and feature-level fusion mechanisms. TMoW enables robust zero-shot and few-shot generalization to new environments, as demonstrated across benchmarks including VirtualHome, ALFWorld, RLBench, and CALVIN (Yoo et al., 4 Sep 2025, Jang et al., 30 Jan 2026, Shang et al., 26 Sep 2025).
1. Architectural Foundations
TMoW treats each world model as an expert corresponding to a source domain. Each expert may comprise a full model or an adapter module augmenting a base policy network such as a transformer or LLM. The central design involves three main architectural innovations:
- Prototype-based expert routing: At each step or layer, the agent’s trajectory or observation is encoded into a latent representation and compared against stored per-domain prototype embeddings, facilitating fast retrieval and weighting of relevant experts.
- Feature-level mixture or gating: Retrieved world models’ intermediate features are fused, either through multi-stage attention (world-wise compound attention, (Yoo et al., 4 Sep 2025)) or spatially-varying gating (latent-to-pixel mixture (Shang et al., 26 Sep 2025)).
- Test-time and continual learning extensions: The routing prototypes and mixtures can be refined online, and new domain-specific experts can be synthesized via distillation from mixtures of existing experts, based on few-shot demonstrations (Jang et al., 30 Jan 2026).
This modularity allows agents to dynamically recompose and update their internal models without full retraining, enabling continual adaptation.
2. Prototype-based Retrieval and Routing
At test time, TMoW encodes recent experience (e.g., a trajectory of observations and actions) through a representation extractor , producing an embedding (Yoo et al., 4 Sep 2025, Jang et al., 30 Jan 2026). For each pre-trained world model , a set of prototype vectors is constructed offline by clustering object- or state-level embeddings from training data in domain .
TMoW performs nearest-prototype or cloud-to-cloud distance evaluation to select the K most relevant world models. This retrieval is performed at various abstraction levels—spanning object- to scene-level features—by comparing embeddings at different transformer layers (Jang et al., 30 Jan 2026). The result is a sparse set of world models, each associated with a normalized routing weight computed via temperature-scaled softmax over similarity scores.
Routing in TMoW is adaptive: as new domains are encountered, prototypes are refined at test time by interpolating with newly observed feature vectors, ensuring that the mixture remains aligned with the encountered data manifold. Routing weights can be learned per layer, enabling multi-granular and spatially aware expert composition.
3. Mixture and Fusion Mechanisms
TMoW employs several approaches to fuse the retrieved experts' predictions or intermediate features, contingent upon the concrete architecture:
- World-wise compound attention (Yoo et al., 4 Sep 2025): Fuses the intermediate-layer outputs of the K retrieved world models and the base LLM policy using two cascaded cross-attention modules. Projected features from each world model are stacked as the key-value set, attended by the base model’s current hidden state. The fused world-model feature is then injected back into the LLM via another attention block, effectively aligning and mixing expert knowledge at both “world” and “reasoning” stages.
- Sigmoid gating over concatenated latent and pixel-space features (Shang et al., 26 Sep 2025): In hybrid latent–pixel settings, e.g., robotics with high-dimensional observations, the latent model provides motion-aware priors, while the pixel model offers fine-grained spatial detail. Features from both are linearly projected, concatenated, and passed through a lightweight gating network, producing a per-location mixture coefficient:
where .
The resulting fused features condition the agent’s action policy (e.g., a diffusion-based decoder), integrating multiple world perspectives.
4. Test-Time Inference and Expert Expansion
TMoW's test-time algorithm proceeds as follows: for each step, recent observations form a trajectory , encoded as . The K nearest experts (per prototype distance) from the world model bank are retrieved. The composite fusion module (compound attention or gating) produces a mixed hidden representation or action distribution, from which the action is sampled and executed (Yoo et al., 4 Sep 2025, Jang et al., 30 Jan 2026).
If a novel domain is encountered for which existing models are insufficient, TMoW can synthesize a new domain expert by distilling a weighted mixture of the closest existing adapters. This mixture is initialized according to the top-K routing weights and fine-tuned on few-shot demonstrations using teacher-forcing loss. The new adapter and its prototype embeddings are appended to the expert bank, facilitating continual expansion without catastrophic forgetting (Jang et al., 30 Jan 2026).
Removal of experts (un-implanting) is immediate, performed by omitting a model from the mixture at the next time step (Yoo et al., 4 Sep 2025).
5. Empirical Results and Ablations
TMoW demonstrates superior adaptability and sample efficiency on challenging embodied benchmarks. On VirtualHome (16 seen/62 unseen tasks, 20 scenes), TMoW achieves 80.16% zero-shot SR on unseen domains, outperforming SayCanPay by over 30 percentage points (49.53%) and reducing average pending steps. Similar gains (20–30 percentage points in SR, multi-step reductions in Pending Steps) are observed on ALFWorld, RLBench, and real-world robotic setups (Jang et al., 30 Jan 2026, Yoo et al., 4 Sep 2025).
Ablation studies in (Jang et al., 30 Jan 2026) indicate:
| Variant | Zero-shot SR (%) | Notes |
|---|---|---|
| Single expert (K=1) | 65.43 | Lower diversity, fails on novel cases |
| Multi-granular (object+scene) prototypes | 80.74 | Best overall, needs full hierarchy |
| No prototype refinement | 73.30 | Lags due to drift in feature space |
| Distilled expert expansion (few-shot) | 81.56 | +18.39pp over training from scratch |
In MoWM on CALVIN, simple gating-based fusion surpasses both pixel- or latent-only baselines and cross-attention fusion, with marked gains on long-horizon manipulation tasks (Shang et al., 26 Sep 2025).
6. Insights, Limitations, and Practicalities
The main drivers of TMoW’s cross-domain robustness are:
- Selective retrieval and sparsity: Only the most relevant domain knowledge is activated for each episode, minimizing negative transfer.
- Compound attention and mixture routing: Dynamic reweighting aligns partial similarities across domains and tasks.
- Prototype refinement: Continual update of routing prototypes at deployment is crucial for handling distribution shift and rare events (Jang et al., 30 Jan 2026).
- Expert distillation: Rapid few-shot expansion to new domains leverages compositionality and avoids costly from-scratch learning.
Key limitations include:
- Computational and memory overhead scale with the number and size of experts K and N.
- Dependence on sufficient base coverage by pre-trained world models; inadequate diversity in the model bank limits adaptation.
- The base reasoning model (e.g., LLM) quality directly affects mixture performance, since it anchors the fusion and output layers (Yoo et al., 4 Sep 2025).
- For multi-modal or high-dimensional domains (vision+language+control), synchronization of latent and pixel-level representations and the choice of fusion operator remain critical (Shang et al., 26 Sep 2025).
7. Applications and Extension Pathways
TMoW is applicable wherever rapid adaptation in embodied, dynamic, or open-ended environments is paramount. Practical recommendations for porting TMoW to new domains include pre-training hybrid world models on domain video data, learning only the fusion/policy head per target task, and monitoring gating statistics to investigate the mixture’s focus during inference (Shang et al., 26 Sep 2025). Prototype routing with both object- and scene-level features is essential for generalization, and moderate top-K sparsity (e.g., K=3) achieves the best robustness–diversity trade-off (Jang et al., 30 Jan 2026). TMoW’s modular expansion further enables robust continual learning: as new tasks arise, distilled adapters provide strong data efficiency and mitigate catastrophic forgetting.
Across multiple research groups and experimental setups, TMoW defines a unified, extensible paradigm for test-time model composition in embodied AI, substantially advancing zero- and few-shot adaptation capabilities (Yoo et al., 4 Sep 2025, Shang et al., 26 Sep 2025, Jang et al., 30 Jan 2026).