WorldMM: Multimodal Models & Memory Agents
- WorldMM is a framework of multimodal world models and memory agents that integrate perceptual and symbolic data for dynamic reasoning, acting, and learning.
- It employs modular tokenization and shared-plus-private latent decomposition to reconcile modality and granularity disparities in cross-modal representations.
- Adaptive retrieval from episodic, semantic, and visual memories enables WorldMM to support long-horizon video reasoning, robotic control, and interactive simulation.
WorldMM refers to a class of multimodal world models and memory agents designed for reasoning, acting, and learning in dynamic, heterogeneous environments that contain both perceptual and high-level symbolic information. These systems unify diverse modalities—visual, audio, text, geometry, and more—within structured memory and representation architectures to enable sample-efficient policy learning, robust scene understanding, and long-horizon video reasoning. Recent state-of-the-art WorldMM implementations combine modular tokenization, adaptive memory, unified 3D representations, and iterative cross-modal retrieval, supporting flexible downstream tasks such as robotic control, interactive simulation, and complex video question answering.
1. Core Architectural Principles
WorldMM architectures are characterized by compositional modules tailored for distinct but interleaved roles:
- Multimodal Representation: Raw inputs of arbitrary modality (images, continuous vectors, discrete symbols, categorical grids) are tokenized via per-modality encoders (e.g., VQ-VAE for images, uniform binning for continuous data, learned table for symbols). The resulting tokens are concatenated and mapped into a common latent space via modality-specific lookup embedding tables (Cohen et al., 17 Feb 2025).
- World Model: A sequence model, typically a transformer variant (e.g., RetNet+POP or autoregressive GPT), operates on token streams and is tasked to generate next-token distributions, reward prediction, and termination signals. In reinforcement learning contexts, an LSTM-based actor-critic controller consumes latent tokens to issue actions or reason in "imagination" (Cohen et al., 17 Feb 2025, Zhang et al., 10 Oct 2025).
- Memory and Retrieval: For long-horizon or high-capacity settings, external multimodal memories—episodic (temporal KGs), semantic (evolving KGs), and visual (feature-indexed archives)—are constructed. An adaptive retrieval agent selects among these memories and queries relevant substructures iteratively during inference (Yeo et al., 2 Dec 2025).
This modularity enables scaling to additional modalities and temporal horizons, while allowing fine-grained decoupling of representation, modeling, and retrieval policies.
2. Multimodal Tokenization and Scene Representation
Tokenization and scene encoding in WorldMM is modality-aware. Key strategies include:
- Modular Multi-modality Tokenization: Each modality is independently encoded, quantized, and converted into discrete token streams. Images are processed by VQ-VAE, with spatial latents quantized to a codebook; continuous vectors are symlog-compressed and uniformly quantized; categorical and grid data are embedded or averaged across channels (Cohen et al., 17 Feb 2025, Zhang et al., 10 Oct 2025).
- Shared-Plus-Private Latent Decomposition: In spatial representations (e.g., MMOne, WorldMirror), each spatial primitive (e.g., 3D Gaussian) stores a shared latent and a set of modality-specific residues , with orthogonality regularization. This enables disentanglement of cross-modal and private cues, improving both expressiveness and transfer (Gu et al., 15 Jul 2025).
- Gradient-Driven Modality Resolution: During training, conflicting gradients across modalities can trigger splitting of multi-modal entities into single-modality components. For each spatial element, if the gradient difference , separate modality-specific copies are formed, each with its own indicator and feature branch (Gu et al., 15 Jul 2025).
- Unified MMTokenizers: For interactive domains (e.g., robot manipulation), MMTokenizers aggregate RGB, depth, and segmentation mask streams into compact, discrete codes. Cross-attention and masking emphasize dynamic regions while controlling token budget (Zhang et al., 10 Oct 2025).
These strategies resolve "property disparity" (differing modality geometries, units) and "granularity disparity" (spatial versus coarse modalities), and are directly extensible to new modalities such as audio or tactile sensors.
3. Memory-Augmented Reasoning and Retrieval
Long-context reasoning over dynamic or extended video necessitates scalable and adaptive memory modules:
- Episodic Memory: Textual knowledge-graphs indexed at multiple temporal scales (seconds to hours), constructed from captions and fact triplets via LLM prompting. These graphs index factual events, supporting queries at variable durations.
- Semantic Memory: An evolving, coarser high-level knowledge graph updated via triplet extraction, consolidation (removal of outdated/conflicting edges), and aggregation of persistent scene or task knowledge.
- Visual Memory: A feature-indexed archive of segment embeddings (e.g., VLM2Vec) and a dense frame-wise index for direct visual grounding. Supports both similarity search and timestamped retrieval (Yeo et al., 2 Dec 2025).
- Adaptive Retrieval Agent: At inference time, a retrieval agent (implemented as a prompted LLM) selects which memory to query and with what prompt, iterating until it chooses to stop. Retrieval is based on Personalized PageRank, embedding similarity, or direct segment fetch; results are passed to a response agent LLM for answer synthesis (Yeo et al., 2 Dec 2025).
This approach enables dynamic selection of granularities and modalities, addressing complex queries that may span multiple temporal resolutions or require multi-modal evidence.
4. Training Objectives and Optimization
WorldMM models combine modality-specific, cross-modal, and auxiliary objectives:
- VQGAN-based Reconstruction Losses: VQ-VAE or VQGAN encoders reconstruct each modality with modality-adapted loss functions— or for pixel/frame, cross-entropy for masks, perceptual (LPIPS) losses, and adversarial terms as needed.
- Transformer Cross-Entropy: For sequence modeling, cross-entropy is minimized over next-token or masked token prediction, typically restricted to dynamic/object tokens.
- Reward/Return Regression via Classification: Mathematical framing replaces direct MSE with a softmax over exponentially spaced bins in symlog space, with targets defined by smoothed "half-life Gaussian" distributions and cross-entropy loss. This yields more stable training for unbounded targets (Cohen et al., 17 Feb 2025).
- Contrastive and Ranking Losses: Alignment of query embeddings and memory segment features uses InfoNCE-style losses. Optional margin ranking can enforce semantic memory consistency if retrieval is end-to-end trained (Yeo et al., 2 Dec 2025).
- Composite Multi-task Losses: When incorporating geometry (WorldMirror), all prediction heads—points, depth, normals, poses, 3D Gaussians—are supervised jointly with task-specialized losses, including cross-view depth consistency and 3DGS rendering consistency (Liu et al., 12 Oct 2025).
This multi-layered loss design enables effective scaling across modalities and tasks, and supports plug-and-play extension to novel sensors and objectives.
5. Empirical Evaluation and Benchmarking
WorldMM methodologies have been empirically validated across diverse domains and benchmarks:
- Planning-free RL Agents: On Atari-100K, WorldMM (Simulus/M³) achieves a human-normalized median score of 0.982—surpassing prior work by over 30%, and becoming the first planning-free world model to reach human-level performance on the benchmark. Similarly, strong gains are observed on DMC Proprioception-500K and the multimodal Craftax-1M, especially when intrinsic motivation, prioritized replay, and regression-as-classification are active (Cohen et al., 17 Feb 2025).
- Multimodal Scene Representation: MMOne delivers 0.4–0.5 dB PSNR improvements versus strong single- or naive-joint baselines in RGB-thermal and RGB-language tasks, while reducing the number of required Gaussians by two-thirds. Adding modalities improves all targets, demonstrating genuine cross-modal synergy (Gu et al., 15 Jul 2025).
- Interactive Robotic Manipulation: iMoWM attains state-of-the-art video prediction quality (e.g., PSNR 23.82 vs. iVideoGPT 23.40 on BAIR), improved absolute relative depth error, and yields +15–20% higher MBRL success rates after 200k steps. The MMTokenizer dramatically reduces per-frame synthesis cost (10 s vs. 860 s per frame) (Zhang et al., 10 Oct 2025).
- 3D World Reconstruction and Novel View Synthesis: WorldMirror demonstrates up to 58% improvement in point map reconstruction across priors, and over 2.4 dB PSNR improvement in novel view synthesis (RealEstate10K), delivering all 3D predictions in a single forward pass (Liu et al., 12 Oct 2025).
- Long-video Reasoning: On five QA benchmarks, WorldMM yields a mean 8.4% accuracy gain over state-of-the-art (WorldMM-GPT: 69.5% vs. HippoRAG: 57.0%), with ablation showing all memory types contribute, and multi-turn retrieval conferring an additional ≈9% gain (Yeo et al., 2 Dec 2025).
A major community benchmark, MMWorld, probes multiple reasoning facets (explanation, counterfactual, prediction, etc.) across seven disciplines. SOTA models (e.g., GPT-4V) achieve only ≈52.3% accuracy, highlighting the open challenge in general-purpose multimodal world modeling (He et al., 2024).
6. Limitations and Future Directions
Current WorldMM models exhibit several limitations:
- Static Scene Bias: Most scene representation frameworks operate on static or slowly changing environments. Dynamic and highly non-stationary scenes (e.g., crowds, autonomous driving) are underrepresented and require architecture scaling and new training regimes (Liu et al., 12 Oct 2025, Gu et al., 15 Jul 2025).
- Heuristic Hyperparameters: Some decomposition and splitting thresholds (e.g., gradient difference ) are tuned heuristically. Learning or adapting these thresholds would improve robustness (Gu et al., 15 Jul 2025).
- Memory Construction Overhead: WorldMM’s memory modules often rely on semi-offline LLM pipelines for captioning, triplet extraction, and semantic consolidation, limiting real-time or streaming deployment and raising privacy concerns for personalized long-horizon egocentric traces (Yeo et al., 2 Dec 2025).
- End-to-End Retrieval Training: Retrieval policies in current systems are predominantly prompt-based LLMs rather than learned modules, limiting their optimization for retrieval efficacy and latency (Yeo et al., 2 Dec 2025).
- Modality Conflict and Coverage: While decomposition and disentanglement approaches reduce negative interference, antagonistic gradients and under-optimized modality heads can persist, especially as the number of modalities grows (Gu et al., 15 Jul 2025).
Future research aims at end-to-end retrieval learning, automated semantic consolidation, explicit handling of rapidly varying scenes, and hybrid optimization/fine-tuning pipelines that seamlessly integrate feed-forward and post-hoc refinement (Yeo et al., 2 Dec 2025, Gu et al., 15 Jul 2025, Liu et al., 12 Oct 2025).
7. Benchmarking and Open Research Problems
MMWorld is the preeminent benchmark for multi-discipline, multi-faceted video world modeling. It covers 1,910 videos in 69 subdisciplines and 6,627 QA pairs, evaluating models across explanation, counterfactual, prediction, expert domain knowledge, temporal ordering, attribution, and procedure understanding. Key findings:
- SOTA MLLMs (e.g., GPT-4V) plateau at ≈52.3% accuracy—well above chance but with substantial room for improvement.
- Audio-only and vision-only ablations reveal heterogeneous modality reliance, with models often brittle when a single modality is withheld.
- Domain-specific skills (e.g., medical, finance) remain weak for current models.
- Hallucination, temporal chain-of-thought, and open-ended reasoning are identified as crucial failure modes.
The authors highlight the need for improved temporal reasoning, better integration of expert knowledge bases, richer video–policy pipelines, and community-driven benchmark expansion (He et al., 2024).
In sum, WorldMM denotes a neural architecture and agent paradigm that unifies modular, cross-modal representation with explicit, scalable memory and flexible retrieval to tackle large, complex, and multi-scale perception, reasoning, and policy tasks. Its development and evaluation span RL, robotics, 3D scene understanding, and long-context video reasoning, and establish foundational tools for general-purpose embodied intelligence (Cohen et al., 17 Feb 2025, Gu et al., 15 Jul 2025, Zhang et al., 10 Oct 2025, Liu et al., 12 Oct 2025, Yeo et al., 2 Dec 2025, He et al., 2024).