Trajectory-Conditioned 3D World Models
- Trajectory-conditioned 3D world models are generative systems that synthesize future 3D environments from explicit action or camera trajectories, ensuring spatial and temporal consistency.
- They leverage multi-view fusion and specialized trajectory encoding—using joint angles, pose tokens, or camera paths—to achieve precise, robust scene generation.
- Key applications span robotic control, autonomous driving, and digital twin simulation, while challenges include long-horizon stability and accurate geometric representation.
Trajectory-conditioned 3D world models are generative systems that synthesize future states of a three-dimensional environment, conditioned explicitly on action or camera trajectories. These models are foundational in robotic control, embodied AI, autonomous driving, 3D simulation, digital twin construction, and controllable 3D scene generation. The core technical advance is to move beyond naive autoregressive rollouts or uni-modal, single-view predictions, instead leveraging trajectory-conditioned input—whether it be explicit action trajectories, camera paths, or interaction histories—across multiple synchronized viewpoints or spatial representations, thereby ensuring geometric and physical consistency both in image and world coordinates.
1. Definitions and Core Concepts
Trajectory-conditioned 3D world models synthesize sequences of multimodal observations—video frames, 3D occupancy grids, or other spatial signals—based on a specified trajectory. The "trajectory" may refer to low-level robot joint actions, egocentric hand motions, camera paths, or ego-vehicle navigation plans. Conditioning explicitly injects this trajectory information at every step, coupling the model's temporal evolution directly to user-specified or policy-generated actions. This approach stands in contrast to unconditional or solely observation-conditioned models, which model the world’s future evolution without direct control of its dynamical path.
Key properties of a trajectory-conditioned 3D world model include:
- Multi-view synthesis: robust prediction across multiple synchronized cameras or virtual viewpoints.
- Action fidelity: explicit, frame-synchronous incorporation of action/control signals into the generative process.
- Spatial and temporal consistency: maintaining world geometry over long rollouts and under varying viewpoint or action.
- Evaluation via 3D- or action-grounded metrics, such as object-mask alignment or trajectory pixel overlap.
Representative model families include MTV-World (Su et al., 17 Nov 2025), Ctrl-World (Guo et al., 11 Oct 2025), OccSora (Wang et al., 2024), PreWorld (Li et al., 11 Feb 2025), ANWM (Zhang et al., 26 Dec 2025), and Matrix-3D (Yang et al., 11 Aug 2025).
2. Model Architectures and Trajectory Encoding
World model architectures vary in their choice of signal (pixel, latent, occupancy), action representation, and fusion strategy. However, all high-performing systems share certain trajectory-encoding mechanisms.
- Low-level action to spatial signal (e.g., MTV-World (Su et al., 17 Nov 2025)): Joint angles are mapped to 3D end-effector positions via forward kinematics, then projected (using camera intrinsics/extrinsics) into per-view 2D trajectories. These 2D trails are rendered as "trajectory videos," forming a visual control signal input for each view.
- Pose-conditioned tokenization (Ctrl-World (Guo et al., 11 Oct 2025)): Actions are mapped to 7D Cartesian-space poses and embedded as tokens, concatenated with visual tokens, and linked to observations via frame-wise cross-attention, enabling precise spatio-temporal alignment.
- Planner-based and ego trajectory conditioning (OccSora (Wang et al., 2024), PreWorld (Li et al., 11 Feb 2025), ANWM (Zhang et al., 26 Dec 2025)): Future trajectory plans (sequences of waypoints or control actions) are embedded and provided as context vectors through MLPs or attention layers, dictating spatial-temporal occupancy dynamics or long-horizon egocentric view synthesis.
- Camera trajectory conditioning in generative 3D (Matrix-3D (Yang et al., 11 Aug 2025), Director3D (Li et al., 2024)): Trajectories over SE(3) are provided via mesh-render videos or explicit camera parameter sequences, with dense or sparse geometry serving as cross-view conditioning.
Fusion strategies are typically early (channel-wise concatenation of video and trajectory features), mid-level (cross-attention of pose embeddings into transformer blocks), or late (residual modulation of predicted frames by trajectory context).
3. Multi-View and 3D Representation Learning
True 3D understanding in world models mandates multi-view learning, geometric regularization, and sometimes explicit 3D spatial parameterization.
- Multi-view latent stacking and cross-view attention (MTV-World (Su et al., 17 Nov 2025), Ctrl-World (Guo et al., 11 Oct 2025)): Latent feature sequences from each view are concatenated, often with reference tokens (e.g., an initial frame) prepended to maintain per-view appearance constancy. Multi-view transformers employing self- and cross-attention link trajectory and observation information spatially and temporally.
- Occupancy and volumetric representations (OccSora, PreWorld): The environment is represented as a 4D tensor (3D space × time), tokenized and quantized into codebooks (OccSora), or preserved as continuous features (PreWorld), which can be directly forecasted conditioned on state or action.
- Geometric manifold regularization (GRWM (Xia et al., 30 Oct 2025)): Latent spaces are explicitly regularized to ensure that nearby points along true physical trajectories remain close in latent space, using all-pairs temporal slowness and uniformity losses, producing a manifold preserving the topology of the environment.
No explicit 3D lifting is necessary in sufficiently expressive multi-view architectures; geometric coherence emerges through cross-view fusion and attention.
4. Training Objectives, Diffusion, and Loss Formulations
Modern trajectory-conditioned world models primarily employ latent diffusion techniques and geometric or perceptual losses for both fidelity and consistency.
- Diffusion modeling (MTV-World, Ctrl-World, OccSora, Matrix-3D): Latent tokens (embedding videos or occupancy sequences) are corrupted with noise and denoised through transformer or U-Net backbones, with loss formulated as the difference between actual and denoised (or noise-predicted) tokens, e.g.,
where includes trajectory and context.
- Spatial/geometric consistency: Object location matching via Jaccard Index (MTV-World) or occupancy IoU (OccSora, PreWorld) is invoked at evaluation or as auxiliary loss.
- Reconstruction and perceptual losses: â„“2 or FID/FVD for video reconstruction, and LPIPS for perceptual similarity.
- Geometric regularizers: Additional losses such as temporal slowness and latent uniformity (GRWM) directly target the topology preservation of latent manifolds.
- Volume rendering losses: For occupancy field models, volume-rendered predictions are supervised via rendered depth/semantic/RGB fields to leverage cheaper 2D annotation (PreWorld).
5. Evaluation Metrics and Automated Assessment
Evaluation of trajectory-conditioned 3D world models spans raw pixel fidelity, semantic/occupancy accuracy, action-to-effect consistency, and spatial interaction metrics.
- Mask-based spatial consistency: Object masks extracted by vision-language and video object segmentation models allow for rigorous comparison of model-predicted object locations to ground truth via the Jaccard Index, (MTV-World (Su et al., 17 Nov 2025)).
- Perceptual and generative quality: LPIPS, FID, FVD, DreamSim, and SSIM/PSNR are universally employed.
- Task-anchored metrics: Absolute trajectory error (ATE), success rate (SR), navigation error (NE), and object interaction precision.
- Automated pipelines: Mask extraction (VLM+RVOS) and referential queries automate evaluation of both physical motion and interaction in complex, multi-arm or navigation scenarios.
6. Applications and Benchmarks
Trajectory-conditioned 3D world models underpin numerous embodied AI applications:
- Robotic manipulation: MTV-World (Su et al., 17 Nov 2025) and Ctrl-World (Guo et al., 11 Oct 2025) demonstrate precise control over dual-arm or manipulator tasks, generating consistent multi-camera rollouts and ranking or improving policy performance via imagination.
- Autonomous driving: OccSora (Wang et al., 2024) and PreWorld (Li et al., 11 Feb 2025) forecast complete 4D occupancy volumes conditioned on candidate ego-trajectories, enabling model-predictive planning and closed-loop simulation.
- Embodied digital twin simulation: Dexterous World Models (DWM, (Kim et al., 19 Dec 2025)) fuse egocentric hand-mesh motion and scene context, synthesizing human-object interaction videos that preserve static geometry and plausible dynamics.
- Aerial navigation forecasting: ANWM (Zhang et al., 26 Dec 2025) introduces physics-inspired Future Frame Projection to improve long-range generation and trajectory selection for UAV navigation in large-scale 3D environments.
- Omnidirectional and open-world 3D generation: Matrix-3D (Yang et al., 11 Aug 2025) and Director3D (Li et al., 2024) condition panoramic or multi-view video diffusion models on explicit camera trajectories and scene meshes, supporting user-driven exploration and controllable 3D scene synthesis.
Leading models are evaluated on DROID (robotic), nuScenes (AV), Matrix-Pano (panoramic), JRDB-GlobMultiPose (human social scenes), and custom synthetic or real-world multi-view benchmarks.
7. Current Challenges and Directions
Despite substantial empirical and practical advances, several challenges remain:
- Long-horizon stability: Even the best models exhibit drift or degraded physical interaction beyond 16–32 seconds without recurrent anchoring (e.g., pose-conditioned memory retrieval in Ctrl-World).
- Geometric representation quality: Poor latent manifold structure significantly degrades rollout fidelity; plug-and-play regularizers (GRWM) robustly address this by decoupling reconstruction from trajectory topology (Xia et al., 30 Oct 2025).
- Sparse and noisy annotation: Semi/self-supervised pipelines (PreWorld) that leverage 2D volume rendering supervision rather than full 3D labels address the annotation bottleneck with minor accuracy trade-off (Li et al., 11 Feb 2025).
- Multi-agent, multi-object modeling: Coordinated multi-agent prediction (Trajectory2Pose (Jeong et al., 2024)) remains a challenge, requiring reciprocal, graph-based attention over trajectory/pose embeddings for global and local consistency.
- Extension to open-world and generative scenarios: Systems such as Director3D require scaling to more compositional, articulated, or open-domain prompts and scenes (Li et al., 2024).
- Computational efficiency: Diffusion-based models yield high-quality, consistent rollouts but incur significant compute for inference, motivating research on acceleration and model distillation.
As research progresses, a shift toward unified, plug-and-play latent regularization, scalable semi/self-supervised training, and domain-adaptive multi-view control is anticipated, with mainstream world models transitioning toward closed-loop, trajectory-aware, long-horizon 3D reasoning.