MMBench2 Visual World Modeling Benchmark

Updated 4 July 2026

MMBench2 is a comprehensive benchmark and dataset for visual world modeling, featuring 65,600 trajectories, 23M frames, and 210 continuous-control tasks across 10 domains.
It integrates full action and reward labels with mixed-quality behaviors and live simulators to enable controlled training and precise hallucination diagnosis.
The benchmark introduces internal predictors and coverage-aware sampling strategies that link offline reconstruction metrics to improved online control performance.

Searching arXiv for MMBench2 and the cited paper to ground the article. MMBench2 is a large-scale, massively multitask benchmark and dataset for visual world modeling, introduced to support both the training of high-capacity action-conditioned video world models and the systematic analysis and mitigation of hallucination in their rollouts (Hansen et al., 25 Jun 2026). It consists of 65,600 trajectories, 23M frames, or 427 hours of RGB video at 224×224 resolution and 15 fps, spanning 210 continuous-control tasks in 10 domains, with ground-truth actions and rewards and live simulators for all tasks. Its stated goals are to understand hallucination in world models at a fine-grained, stage-by-stage level, develop and validate predictors that tell when and where a world model will hallucinate, and use those predictors for mitigation through coverage-aware sampling during pretraining and curiosity-based online data collection plus finetuning.

1. Definition and distinguishing characteristics

MMBench2 is positioned as both a dataset and a benchmark protocol for visual world modeling. Compared to prior datasets and benchmarks—offline RL datasets, robot imitation sets, video corpora, and the original MMBench—it is described as the first to jointly provide broad multi-domain continuous-control tasks, mixed-quality behaviors, full action and reward labels, and live environments tightly matched to the dataset. The mixed-quality behaviors explicitly include random, noisy, expert, human, curiosity-driven behaviors.

This design supports three linked uses. First, it enables tightly controlled training corpora for action-conditioned world models. Second, it supports offline and online probing of coverage gaps. Third, it connects offline hallucination analysis to downstream control performance. A common misconception addressed by the benchmark design is that visually fluent rollouts are necessarily dynamically faithful. In the benchmark’s framing, hallucination refers precisely to cases where outputs remain visually plausible and fluent while becoming decoupled from the ground-truth dynamics.

Component	Value	Role
Trajectories	65,600	Offline corpus
Frames	23M	Training and evaluation data
Video duration	427 hours	Scale of visual coverage
Tasks	210	Multitask scope
Domains	10	Cross-domain diversity
Labels	Actions, rewards	Action-conditioned and control evaluation
Simulators	Live for all tasks	Online collection and MPC evaluation

The benchmark is therefore not limited to static reconstruction quality. It is structured to evaluate whether model fidelity, hallucination diagnostics, and mitigation strategies transfer to control in live environments.

2. Dataset composition, task structure, and splits

The MMBench2 corpus comprises 65,600 trajectories, 427 hours of video at 15 fps, 224×224 RGB observations, and 210 tasks covering 10 domains. Action spaces are continuous, with dimensionality from 1–16, and are zero-padded to 16 with a validity mask. Each timestep exposes visual observation $o_t \in \mathbb{R}^{224\times 224 \times 3}$ , action $a_t \in \mathbb{R}^{d_a}$ , $1 \le d_a \le 16$ , reward $r_t \in \mathbb{R}$ , and a live simulator implementing transition dynamics

$s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$

The low-dimensional state $s_t$ is available but is not used for world model training; the setup is image-based.

The 10 domains are DMControl, DMControl Extended, Meta-World, ManiSkill3, MuJoCo, MiniArcade, Box2D, RoboDesk, OGBench, and Continuous Atari (CALE). These domains cover locomotion, dexterous and tabletop manipulation, goal-conditioned navigation, arcade-style games and Atari-like tasks, and physics-driven control. This breadth matters because the dataset is intended to expose both perceptual variability and action-conditioned dynamics variability across heterogeneous environments.

The split structure is central to the benchmark. The 210 tasks are divided into 200 pretraining tasks and 10 held-out “unseen” tasks for transfer and finetuning experiments. The pretraining corpus has approximately 260 episodes per task on average, but episode lengths are highly heterogeneous, ranging from 25 steps to 1000 steps. The resulting frame distribution is heavy-tailed: the top 20 tasks contribute approximately 26% of all frames, while the bottom 20 tasks contribute only 0.7%. This non-uniformity is explicitly treated as central to the argument that coverage gaps drive hallucination.

The benchmark also specifies protocol splits. Pretraining uses ~20M frames from the 200 tasks for tokenizer and dynamics training. Testing uses the remaining ~3M frames from the same 200 tasks for evaluation of reconstructions, rollouts, and hallucination predictors. Transfer and finetuning experiments use 10 “seen” tasks and 10 unseen tasks. For finetuning, new data are collected using expert, random, no-op, curiosity, human policies, with 50 trajectories per task.

3. Hallucination taxonomy and evaluation protocol

The benchmark formalizes world modeling as a three-stage pipeline:

Encoder: $z_t = \mathrm{Encode}(o_t)$
Dynamics: $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$
Decoder: $\hat o_{t+1} = \mathrm{Decode}(\hat z_{t+1})$

Within this pipeline, “hallucination” denotes any failure where the output $\hat o_{t+1:t+H}$ is visually plausible and fluent but decoupled from the true dynamics. The paper distinguishes three hallucination modes, each tied to a stage of the pipeline (Hansen et al., 25 Jun 2026).

Perceptual hallucination occurs in the encoder–decoder stage alone, at horizon $a_t \in \mathbb{R}^{d_a}$ 0. The reconstruction of a single observation $a_t \in \mathbb{R}^{d_a}$ 1 is already wrong before any rollout dynamics are applied. Examples given include an unseen maze layout reconstructed as a different but plausible layout, or a novel object mapped onto a similar in-distribution object. Formally, if $a_t \in \mathbb{R}^{d_a}$ 2 and $a_t \in \mathbb{R}^{d_a}$ 3, perceptual hallucination is a large perceptual discrepancy between $a_t \in \mathbb{R}^{d_a}$ 4 and $a_t \in \mathbb{R}^{d_a}$ 5 that changes semantic structure.

Action-marginalized hallucination occurs in one-step dynamics prediction when the predicted next latent $a_t \in \mathbb{R}^{d_a}$ 6 is essentially insensitive to the input action $a_t \in \mathbb{R}^{d_a}$ 7. The benchmark diagnoses this by comparing teacher-forced one-step prediction error using true actions versus batch-shuffled actions. It defines the action shuffle ratio

$a_t \in \mathbb{R}^{d_a}$ 8

and declares actions “ignored” when

$a_t \in \mathbb{R}^{d_a}$ 9

Scene-diverging hallucination arises in multi-step autoregressive rollout when error accumulation produces physically implausible events. The benchmark compares rollout quality against a trivial baseline that repeats the last true frame over the horizon. For rollout $1 \le d_a \le 16$ 0 and ground truth $1 \le d_a \le 16$ 1, it defines

$1 \le d_a \le 16$ 2

and labels a rollout as scene-diverging when

$1 \le d_a \le 16$ 3

The benchmark protocol evaluates world models at several levels. Recon PSNR (dB) measures single-frame reconstruction quality. LPIPS is used in tokenizer comparisons. Rollout PSNR gain $1 \le d_a \le 16$ 4 measures rollout fidelity relative to the repeat-last-frame baseline. ASR measures action sensitivity. Binary hallucination labels are defined by the thresholds above for action-ignored and scene-diverging events. To connect model quality to decision making, the benchmark evaluates downstream control using MPC with CEM, with planning horizon $1 \le d_a \le 16$ 5 and replanning every 16 steps, and reports normalized task score $1 \le d_a \le 16$ 6.

4. Internal predictors and the coverage hypothesis

A central contribution associated with MMBench2 is the proposal of three internal signals that can be computed from the world model itself, without labels or extra training, and that correlate strongly with hallucination. The raw signals are normalized by scene motion $1 \le d_a \le 16$ 7, defined as RMS latent change at that step, estimated per-task over the dataset or online.

The first signal is the tokenizer round-trip residual $1 \le d_a \le 16$ 8, defined from

$1 \le d_a \le 16$ 9

Its stated intuition is that if the decoded frame is off-manifold for the tokenizer, re-encoding it will push it back toward the latent manifold, producing a large residual. It is therefore associated with perceptual hallucination.

The second signal is flow instability $r_t \in \mathbb{R}$ 0. The dynamics model is trained as a shortcut flow-matching model that predicts a clean latent through multiple Euler substeps. If $r_t \in \mathbb{R}$ 1 denotes the predicted clean latent at substep $r_t \in \mathbb{R}$ 2, then schematically

$r_t \in \mathbb{R}$ 3

Low values indicate rapid convergence to a stable prediction; high values indicate oscillation or drift.

The third signal is inter-seed variance $r_t \in \mathbb{R}$ 4. For fixed context and action, the model runs $r_t \in \mathbb{R}$ 5 independent denoising trajectories with different seeds, producing $r_t \in \mathbb{R}$ 6, and defines

$r_t \in \mathbb{R}$ 7

Its stated interpretation is epistemic uncertainty: if seeds diverge, multi-step rollouts will fan out.

For approximately 9000 held-out 24-frame sequences, Spearman correlation between rollout $r_t \in \mathbb{R}$ 8 and each predictor is reported as about $r_t \in \mathbb{R}$ 9, strongly negative. As binary classifiers, the reported AUROC values for detecting action-ignored events ( $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 0) are 0.887 for $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 1, 0.868 for $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 2, and 0.873 for $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 3. For detecting scene-diverging events ( $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 4), the corresponding AUROCs are 0.919, 0.939, and 0.934. All three predictors are reported to outperform raw unnormalized versions, latent scene motion $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 5, kNN distance in latent space, and per-task frame count (Hansen et al., 25 Jun 2026).

These signals are then embedded in a broader data-centric perspective. The paper’s central claim is that hallucination in world models is primarily a data coverage issue. Low-coverage regions of the state–action space are those where the empirical visitation density is small. Figure-based evidence is described for tasks such as point maze, cup catch, and lunar lander, where hallucination predictors cluster around the periphery of the visited state distribution. The article’s interpretation is therefore stage-specific: limited scene diversity degrades perceptual generalization, narrow action coverage encourages collapse toward average transitions, and one-step errors in poorly covered regions compound into scene divergence.

5. Coverage-aware training and curiosity-driven adaptation

MMBench2 is not only a diagnostic benchmark; it is also used to validate mitigation strategies. The first strategy is coverage-aware training, which changes how existing data are sampled. Because the dataset is highly imbalanced across tasks, frame-uniform sampling gives tasks probability proportional to frame count, while task-uniform sampling upweights rare tasks: $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 6 so that $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 7, whereas

$s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 8

with $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t),\quad r_t = r(s_t, a_t).$ 9 sampled uniformly within task $s_t$ 0. Explicit loss reweighting was also tested, but sampling rebalancing was found more effective.

Starting from a pretrained base model, the study extends training by 30k steps for tokenizer and/or 30k for dynamics under coverage-aware sampling. The reported variants are Tok ft, Dyn ft, and Both. Relative to the base model, the quantitative changes are as follows. For Recon PSNR, Tok ft yields +0.46 dB, Dyn ft −0.01 dB, and Both +0.44 dB. For ASR, Tok ft yields +0.02, Dyn ft +0.27, and Both +0.29. For Rollout $s_t$ 1, Tok ft yields +0.42 dB, Dyn ft +0.68 dB, and Both +0.88 dB. The hallucination predictors also improve: $s_t$ 2 by −0.20, $s_t$ 3 by −0.07, and $s_t$ 4 by −0.14. The stated interpretation is that coverage-aware sampling improves perceptual reconstructions, action sensitivity, and multi-step rollout quality simultaneously, with finetuning both tokenizer and dynamics performing best.

The second strategy is online mitigation via curiosity rewards and finetuning. The benchmark mainly uses the tokenizer residual $s_t$ 5 as a curiosity score. For a candidate action sequence $s_t$ 6, the model imagines a trajectory and computes

$s_t$ 7

then plans actions that maximize this score. The planner is CEM, with horizon $s_t$ 8, replanning every $s_t$ 9 steps, population size 32, 3 iterations, 2 rollouts per candidate, and warm-start from the BC prior.

For the transfer and finetuning experiments, data collection on 10 seen + 10 unseen tasks uses No-op actions, Random policy, Expert policy, Human play, and Curiosity policy, with 50 trajectories per task. Tokenizer and dynamics are then finetuned on these new trajectories, typically for 50k tokenizer and 30k dynamics steps, and evaluated both offline and in closed loop using MPC.

On the 10 unseen tasks, the reported downstream normalized performance is 0.118 for the random policy baseline and 0.276 for the base pretrained model without extra data. Finetuning on curiosity data, with both tokenizer and dynamics finetuned, yields Recon PSNR 36.05 dB, Rollout $z_t = \mathrm{Encode}(o_t)$ 0 +3.00 dB, ASR 2.00, $z_t = \mathrm{Encode}(o_t)$ 1, and task performance 0.325. For comparison, expert-policy finetuning gives 0.362, human-play finetuning gives 0.362, and finetuning on all data combined gives Recon PSNR 37.91 dB, Rollout $z_t = \mathrm{Encode}(o_t)$ 2 +4.02 dB, $z_t = \mathrm{Encode}(o_t)$ 3, and task performance 0.390. The paper states that curiosity-based data reaches about 90% of the performance achievable with expert or human trajectories, without human supervision or privileged policies.

6. Baseline model, findings, limitations, and research significance

The benchmark is instantiated with a 350M-parameter world model, architecturally similar to Dreamer 4, comprising a tokenizer of approximately 100M parameters and a dynamics model of approximately 250M parameters (Hansen et al., 25 Jun 2026). The tokenizer is an encoder–decoder Transformer that patchifies each $z_t = \mathrm{Encode}(o_t)$ 4 RGB frame with stride 14 into 256 patch tokens, appends 64 learnable latent queries, uses a Transformer encoder with $z_t = \mathrm{Encode}(o_t)$ 5, 8 heads, 12 layers, and MLP ratio 4, and produces 64 latent tokens, each 64-dimensional, bounded by $z_t = \mathrm{Encode}(o_t)$ 6: $z_t = \mathrm{Encode}(o_t)$ 7 Its training objective is masked autoencoding with a mask ratio sampled from $z_t = \mathrm{Encode}(o_t)$ 8, reconstructing masked pixels only, with RMS-normalized MSE and LPIPS loss.

The dynamics model consumes one action token, one shortcut-conditioning token, 32 packed spatial latent tokens, 4 register tokens, and optional agent tokens for reward and BC heads. Its Transformer backbone uses block-causal layers with spatial self-attention, temporal causal self-attention, and MLP, with $z_t = \mathrm{Encode}(o_t)$ 9, 8 heads, 16 layers, MLP ratio 4, RoPE, QK-normalization, and RMSNorm. It is trained with the shortcut flow-matching objective with noise levels $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$ 0, $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$ 1, a self-consistency bootstrap fraction $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$ 2, and inference via shortcut Euler with step size $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$ 3, or 8 substeps per frame. After pretraining, a reward predictor and a deterministic Gaussian BC policy over the 16-d padded action are added. Training details include 300k tokenizer steps, 180k dynamics pretraining steps, sequence length $\hat z_{t+1} = f_\theta(z_{\le t}, a_{\le t})$ 4, AdamW, learning rate 1e-4, weight decay 1e-2, and effective batch sizes 96 for tokenizer and 512 for dynamics.

The principal findings associated with MMBench2 are presented as follows. Hallucination is predictable: the three internal signals are highly correlated with rollout error and produce AUROC values above 0.86 across hallucination modes. Hallucination is coverage-driven: visual analyses show high predictor values concentrated in low-density regions of state space. Coverage-aware sampling works: a single data-centric sampling modification improves reconstructions, action sensitivity, rollout fidelity, and all three hallucination predictors. Curiosity-driven data collection is effective and data-efficient: using hallucination predictors as curiosity rewards to collect 50 trajectories per unseen task improves unseen-task rollout quality and MPC performance. Pretrained world models transfer zero-shot, but finetuning helps: the base model reaches 0.276 on unseen tasks versus 0.118 for the random baseline, and targeted finetuning closes part of the remaining gap. The paper also reports that off-the-shelf tokenizers such as Wan 2.1 VAE, SD-VAE, and Cosmos perform worse than the in-domain tokenizer on training tasks, while Wan 2.1 performs better on unseen tasks unless the in-domain tokenizer is finetuned, after which the in-domain tokenizer matches or outperforms it.

The stated limitations are equally important. The study operates at approximately 350M parameters on 210 simulated tasks, leaving open whether the findings scale to billion-parameter models or real-world robotics with noisy sensors and partial observability. Although MMBench2 spans many simulated domains, it does not include real-world data. Training requires substantial computation, reported as approximately 58 GPU days on 8×H100. Downstream evaluation is centered on MPC-based control over relatively short horizons and specific task sets.

These design choices and findings suggest that MMBench2 functions as a benchmark-and-methodology package for studying hallucination in world models. It standardizes large-scale multitask visual world modeling with full action and reward labels and live simulators, formalizes hallucination into perceptual, action-marginalized, and scene-diverging modes, provides internal label-free predictors, and links offline fidelity metrics to online control performance. A plausible implication is that its main scientific contribution lies not only in scale, but in making coverage a measurable and actionable variable in world-model reliability.

Markdown Report Issue Upgrade to Chat

References (1)

Hallucination in World Models is Predictable and Preventable (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMBench2.