Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Abstract: Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big idea)
The paper introduces Re2Pix, a new way to predict what will happen next in a video—especially driving videos. Instead of trying to guess future pictures all at once, Re2Pix first predicts the “meaning” or structure of the scene (like where cars, people, and roads are), and only then turns that into realistic images. Think of it like making a storyboard (semantics) before painting the final frames (pixels).
What questions the researchers asked
- Can we predict future video frames more reliably if we separate “what is happening” (semantics) from “how it looks” (appearance)?
- Will this separation make videos look more consistent over time (so objects don’t flicker or change identity)?
- Can it also make training faster and require less data?
- How do we keep the system strong when its own predicted “meanings” aren’t perfect?
How the method works (in simple steps)
Here’s the basic two-stage plan, using an everyday analogy.
Like planning before painting:
- Plan the scene (Semantics)
- Paint the scene (Pixels)
More concretely:
- Stage 1: Predict scene structure (the “plan”)
- A powerful pre-trained vision model (like a very smart image reader) turns each video frame into a map of semantic features—this captures “what’s in the scene” without worrying about colors and textures.
- A small transformer model learns to predict the next frames’ semantic features one step at a time (autoregressive: predict the next, feed it back in, predict the one after, and so on).
- Stage 2: Turn structure into pictures (the “painting”)
- A diffusion model (a modern technique for generating images and videos) works in a compact space called a VAE latent space (think of it as a simplified version of the video that’s easier to handle).
- It takes the predicted semantics from Stage 1 and uses them as guidance to generate future frames that look realistic and match the predicted scene structure.
- Finally, a decoder converts those compact latents back into full-color frames.
Key challenges and fixes:
- The training–testing mismatch problem:
- During training, the model can use perfect semantic features (from real future frames).
- During testing, it must rely on its own predicted semantic features, which are noisier.
- Two strategies to fix this:
- Nested dropout: randomly “hide” the fine details in the semantic features during training so the generator learns to handle imperfect inputs and focus on the most important, coarse structure first.
- Mixed supervision: train mostly with ground-truth semantics but sometimes (about 10% of the time) with the predicted semantics, so the generator adjusts to both perfect and imperfect inputs.
Helpful translations of technical terms:
- Vision Foundation Model (VFM): a large, pre-trained image understanding network (e.g., DINOv2) that produces rich “meaning maps” of images.
- VAE (Variational Autoencoder): a tool that shrinks videos into a compact form (latents) and can reconstruct them back to images.
- Diffusion model: a generator that starts from noisy data and learns to remove noise step by step to create realistic frames.
- Autoregressive: predicting one future step at a time and using that prediction to help with the next step.
What the researchers found
Across several driving datasets (like Cityscapes, nuScenes, and KITTI), Re2Pix:
- Kept scene semantics more consistent over time
- Objects stayed stable across frames (better segmentation scores), and estimated depth was more reliable.
- Produced better-looking and smoother videos
- Scored higher on common quality measures (lower FID/FVD scores mean the video looks more realistic and coherent over time).
- Trained much faster
- Generation quality reached the same level up to about 7× sooner.
- Semantic consistency reached the same level up to about 14× sooner.
The method also outperformed strong baselines, including versions that try to align internal features (like REPA/VideoREPA), and even did well compared to much larger, heavily pre-trained systems. Importantly, the gains weren’t just because of having more model parameters—the two-stage, “semantics first” design made the difference.
They also showed:
- The approach is robust to which vision model provides the semantics (it worked with both DINOv2 and SigLIP-2).
- Training tricks (nested dropout and mixed supervision) were crucial for handling the imperfect predictions used at test time.
- The model generalized well, even to new datasets it wasn’t directly trained on (zero-shot tests on KITTI).
Why this matters
- Safer, more reliable predictions: For tasks like autonomous driving, predicting the near future accurately is vital. Stable semantics (keeping track of where each object is and what it’s doing) can help planning and decision-making.
- Clearer control and understanding: By separating “understand the scene” from “render the picture,” engineers can improve each part independently, making systems easier to tune and debug.
- Faster development: Because training converges faster, researchers and practitioners can iterate more quickly, using less compute and time.
- A general blueprint: This “representations before pixels” idea can inspire other video tasks—any time you need both good understanding and good visuals, structuring the problem in two stages can help.
In short, Re2Pix shows that thinking about meaning first and pixels second leads to future video predictions that are more consistent, more realistic, and faster to train—an encouraging step toward dependable video forecasting in complex, real-world settings like driving.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of the paper’s unresolved issues, limitations, and concrete open questions to guide future research.
- Domain generality beyond driving: How well does Re2Pix transfer to non-driving domains (e.g., egocentric, sports, indoor scenes) and multi-camera or multi-sensor settings (e.g., LiDAR, radar, maps, text prompts)?
- Dependence on specific VFMs: Re2Pix relies on frozen 2D image VFMs (DINOv2, SigLIP-2). How do results change with larger/backbone variants (e.g., ViT-L/G), video-native VFMs (e.g., V-JEPA), or modest VFM fine-tuning?
- PCA and channel ordering assumptions: Nested dropout assumes PCA-ordered channels map from coarse-to-fine semantics. Is this consistent across datasets/encoders, and what happens if PCA is computed on a different domain or not used at all?
- Stage-1 forecasting uncertainty: The semantic predictor is trained with Smooth L1 regression, yielding a deterministic forecast. Can probabilistic or diffusion-based feature forecasting capture multi-modal futures and reduce averaging?
- Long-horizon error compounding: The paper predicts a short horizon (12 future frames). How do autoregressive feature errors accumulate for longer horizons, and what strategies (scheduled sampling, noise injection, curriculum) best mitigate drift?
- Train-test mismatch strategy scope: The fixed 90/10 mixed-supervision ratio is chosen empirically. Are different ratios, schedules (increasing predicted-features share), or adaptive schemes more effective across datasets and horizons?
- Nested dropout design choices: Equal-probability channel truncation is a heuristic. How sensitive is performance to the selection distribution, channel counts, and alternative perturbations (token/patch dropout, additive noise, feature masking)?
- Fusion mechanism optimality: Early channel-wise summation is simple but may limit conditioning capacity. Do cross-attention, FiLM/AdaLN conditioning, learned gating, or multi-scale fusion yield better alignment and controllability?
- Temporal subsampling effects: Stage-1 operates on every r-th frame to match the VAE’s temporal rate. How does varying r affect fast motion, fine temporal details, and identity stability?
- Geometry and 3D consistency: Re2Pix uses per-frame 2D VFM features and does not model ego-motion or 3D constraints. Can adding depth/pose inputs, geometric priors, or 3D feature spaces improve physical plausibility and stability?
- Evaluation scope of semantics: Semantic consistency is reported at a single future frame (frame 19). How do per-frame and sequence-level metrics (mean mIoU over all predicted frames, ID consistency, tracking metrics like IDF1) compare?
- Pseudo-depth reliability: Depth evaluation uses pseudo-labels (Depth Anything V2). How do conclusions change with ground-truth depth or other geometry metrics (e.g., flow error, multi-view consistency)?
- Metric bias and circularity: Segmentation metrics use a DINOv2-based head, which may align with the conditioning space used by Re2Pix. Does evaluation with diverse, independent segmentation/depth models alter conclusions?
- Diversity and multi-modality: The paper does not measure sample diversity (e.g., min-FVD, LPIPS diversity, multi-hypothesis metrics). How diverse are the futures when sampling the diffusion model, and can semantic conditioning constrain or enable controllable diversity?
- Action-/control-conditioned prediction: Re2Pix is unconditional with respect to driver actions. How does it perform in closed-loop settings or when conditioned on planned trajectories, and how does it compare on control-centric metrics (ADE/FDE, driving scores)?
- Inference efficiency and latency: The number of diffusion steps and end-to-end inference times are not reported. Can the approach meet real-time constraints (e.g., ≤100 ms/frame), or be accelerated via distillation/consistency models?
- Data/sample efficiency: Claims of faster convergence are shown for one setup. How does the method scale with less data, different dataset sizes, larger models, and pretraining vs. from-scratch regimes?
- VAE dependence and bottlenecks: The WAN2.1 VAE appears fixed; its effect on reconstruction quality, temporal coherence, and information bottlenecks is unstudied. Would alternative VAEs or joint VAE fine-tuning improve results?
- Joint training of stages: Stage-1 and Stage-2 are trained separately with limited coupling (mixed supervision). Can end-to-end co-training or bi-level optimization (backpropagating pixel losses into the feature predictor) further reduce mismatch?
- Robustness and OOD conditions: Performance under challenging conditions (night, rain, fog), strong occlusions, camera jitter, or sensor noise is not evaluated. How robust is semantic forecasting and synthesis to such perturbations?
- Failure mode analysis: The paper does not quantify where/why the model fails (e.g., small/rare classes, fast-moving objects, long occlusions). Which scenarios most degrade semantic fidelity and temporal coherence?
- Physical plausibility and consistency: No metrics enforce or evaluate adherence to dynamics/kinematics. Can physics-informed losses or constraints reduce implausible yet semantically consistent predictions?
- Contribution attribution: The relative impact of Stage-1 quality vs. Stage-2 conditioning is not disentangled. How do controlled degradations in predicted features translate to generation metrics, and where are the breakpoints?
- Control over semantics: It is unclear how editable the intermediate VFM features are for user control (e.g., moving objects, changing layout). What tools or learned mappings could enable interpretable, fine-grained semantic editing?
- Sensitivity to model size and context length: Beyond a single baseline enlargement, scaling laws for the feature predictor and diffusion transformer with longer spatial/temporal contexts remain unexplored.
- KITTI zero-shot semantics: KITTI evaluation reports only generation metrics (no semantics). How does semantic consistency transfer to new domains without labels, and can unsupervised measures (e.g., self-consistency) be used?
- Safety and deployment: No uncertainty calibration or reliability estimates are provided. How can predictive confidence over semantics and pixels be quantified for safety-critical decision-making?
Practical Applications
Immediate Applications
Below are concrete ways the paper’s semantics-first, hierarchical video prediction framework (Re2Pix) can be applied today, leveraging its open-source implementation, training efficiency, and demonstrated gains in temporal semantic consistency and perceptual quality.
- Semantics-first video prediction for autonomous driving and ADAS
- Sectors: Automotive, Robotics
- What: Integrate Re2Pix as a predictive perception module to anticipate near-future frames around the ego-vehicle, improving tracking stability, occlusion handling, and short-horizon planning (e.g., 0.5–2 s).
- Tools/workflows/products:
- “Predictive Perception SDK” that feeds predicted VAE latents/frames to planners;
- A plug-in for existing world-model stacks (e.g., replacing baseline latent-prediction modules with Stage-1+Stage-2).
- Assumptions/dependencies: Availability of domain video data for fine-tuning Stage 1; GPU for real-time inference; safety validation for deployment; alignment with vehicle sensor timing.
- Simulation-ready future frame synthesis for training and testing autonomy stacks
- Sectors: Automotive, Robotics, Simulation Software
- What: Use the predicted semantics to condition photorealistic future frames for scenario replay and rare case amplification in simulators.
- Tools/workflows/products:
- Scenario augmentation tool that perturbs predicted semantic maps and re-renders variants;
- Data curation pipeline that mines sequences with high prediction uncertainty for human review.
- Assumptions/dependencies: Simulator integration; careful evaluation of domain shift; compute for batch generation.
- Predictive video analytics and anomaly anticipation in CCTV/traffic cameras
- Sectors: Public Safety, Smart Cities, Transportation
- What: Reduce false alarms by comparing observed vs. predicted future frames to flag deviations (e.g., stalled vehicles, pedestrian intrusion).
- Tools/workflows/products:
- “Future-Frame Divergence” monitor that scores deviations per region;
- Dashboard for operators with predicted overlays.
- Assumptions/dependencies: Legal/privacy compliance; calibration to camera viewpoints; edge-accelerated inference or server offloading.
- Warehouse and factory mobile robotics: short-horizon predictive perception
- Sectors: Industrial Robotics, Logistics
- What: Anticipate human/robot trajectories and occlusions for safer navigation using semantics-first forecasts that reduce flicker/identity drift.
- Tools/workflows/products:
- ROS-compatible module providing predicted feature maps and synthesized views;
- Integration with MPC controllers that ingest predictions at fixed latency.
- Assumptions/dependencies: Domain adaptation of Stage 1 features; verified latency budgets; GPU/accelerator on-board or low-latency edge link.
- Video-based quality and safety monitoring on production lines
- Sectors: Manufacturing
- What: Predict imminent visual states (e.g., misalignment, object collisions) to trigger early interventions.
- Tools/workflows/products:
- Alerting service comparing predicted vs. actual frames;
- Operator UI with “next-frame” previews.
- Assumptions/dependencies: Stationary camera setups; labeled exemplars for threshold calibration; robust lighting-invariance via VFM features.
- Sports broadcast and analytics overlays
- Sectors: Media, Sports Tech
- What: Generate short-horizon anticipated views to enhance replays, coaching tools, or on-screen guidance (e.g., likely ball path or player motion).
- Tools/workflows/products:
- Broadcast plugin that composites predicted frames;
- Analyst tool exporting predicted semantics for downstream stats.
- Assumptions/dependencies: Camera calibration; model tuning for sport domain; editorial review workflows.
- Predictive video compression and streaming enhancements (prototype)
- Sectors: Telecom, Media Streaming
- What: Use predicted frames as high-quality references for inter-frame coding or packet loss concealment to improve perceived quality.
- Tools/workflows/products:
- Experimental codec module that replaces standard motion-compensated prediction with semantics-guided generative prediction.
- Assumptions/dependencies: Standards compliance; rigorous subjective testing; computational overhead may require model distillation.
- Video editing: future extension and temporally consistent inpainting
- Sectors: Creative Software
- What: Extend clips forward with consistent scene semantics, or fill gaps while preserving identities across frames.
- Tools/workflows/products:
- NLE plugin exposing “Predict Next N Frames” and “Semantic-Guided Fill”;
- Controls to adjust semantic components for a compute–quality trade-off.
- Assumptions/dependencies: User-tunable latency; rights to use source footage; GPU requirements.
- Research and ML engineering: robust conditioning strategies for generative models
- Sectors: Academia, AI/ML Platforms
- What: Adopt nested dropout across semantic channels and mixed supervision (90/10 GT/predicted) to close train-test gaps in any teacher-forced conditional generator.
- Tools/workflows/products:
- Library components implementing nested dropout over PCA/feature channels;
- Training recipes showing faster convergence (7× FID/FVD, 14× segmentation).
- Assumptions/dependencies: Access to VFM features; baseline diffusion/autoencoder pipeline; reproducibility across domains.
- Adaptive compute–quality knobs for edge deployment
- Sectors: Edge AI, Embedded Systems
- What: Dynamically reduce semantic components during inference (e.g., from 1152 to 256) in tight compute budgets with graceful performance trade-offs.
- Tools/workflows/products:
- Runtime policy that adjusts PCA component count by load;
- QoS controller targeting FID/FVD budgets.
- Assumptions/dependencies: Pretraining with nested dropout; accurate load monitors; predictable latency profiles.
- Education and training in vision world models
- Sectors: Education
- What: Use the open-source code to teach hierarchical forecasting vs. direct pixel prediction, and study the impact of semantics on generative quality.
- Tools/workflows/products:
- Course labs comparing baselines, REPA-style alignment, and Re2Pix;
- Assignments exploring VFM swaps (DINOv2, SigLIP-2).
- Assumptions/dependencies: GPU access in labs; dataset licenses (e.g., Cityscapes/nuScenes).
- Policy and safety auditing for autonomy
- Sectors: Policy, Standards, Safety Engineering
- What: Use divergence between predicted and observed futures as a diagnostic to surface risky edge cases for audit and retraining.
- Tools/workflows/products:
- Automated test suites generating prediction–reality discrepancy reports;
- Curated benchmark of failure modes.
- Assumptions/dependencies: Access to operational data; privacy-preserving processing; agreed-upon thresholds and metrics.
Long-Term Applications
These opportunities leverage the same semantics-first principles but will likely require additional research in scaling, latency, domain transfer, or regulation before deployment.
- Closed-loop predictive control in autonomous vehicles and mobile robots
- Sectors: Automotive, Robotics
- What: Tight integration of Stage-1 semantics and Stage-2 synthesis into planners (e.g., MPC, RL) for longer-horizon, uncertainty-aware control.
- Potential products/workflows: Planning stacks consuming multi-step semantic forecasts with calibrated uncertainty; risk-aware controllers.
- Assumptions/dependencies: Real-time guarantees with diffusion or distilled surrogates; certified safety; robust out-of-distribution handling.
- City-scale digital twins and traffic optimization
- Sectors: Smart Cities, Transportation Planning
- What: Use semantics-guided prediction on dense camera networks to simulate future traffic states for signal timing, routing, and infrastructure planning.
- Potential products/workflows: “Semantics-forecast layer” for digital twin platforms; policy simulators testing interventions.
- Assumptions/dependencies: Data integration across sensors; governance and privacy; sustained compute budgets.
- AR/VR latency reduction via predictive rendering
- Sectors: AR/VR, Gaming
- What: Predict near-future frames/structures to mask motion-to-photon latency and stabilize dynamic occlusions.
- Potential products/workflows: Predictive compositor that blends rendered frames with semantics-guided predictions.
- Assumptions/dependencies: Ultra-low latency inference (on-headset); lightweight distilled models; user comfort and safety validation.
- Surgical and clinical video anticipation
- Sectors: Healthcare
- What: Anticipate instrument motion and tissue interactions during minimally invasive surgery to assist or warn clinicians.
- Potential products/workflows: OR assistance tools overlaying predicted semantics; training simulators with realistic future synthesis.
- Assumptions/dependencies: Medical-grade validation; domain-specific VFM adaptation; explainability and liability frameworks.
- Drone-based predictive perception for inspection and agriculture
- Sectors: Aerial Robotics, Energy, Agriculture
- What: Forecast visual states in flight (e.g., impending occlusions, moving obstacles) to adjust trajectory and sensing.
- Potential products/workflows: Flight controllers integrating semantic forecasts; predictive anomaly detection for turbine/line inspections.
- Assumptions/dependencies: Robustness to motion blur and rapid viewpoint changes; tight power/compute budgets.
- Predictive streaming and resilient video conferencing
- Sectors: Communications
- What: Use predicted future frames to conceal packet loss and prefetch content during network jitter.
- Potential products/workflows: “Predict-and-correct” modules in RTC stacks leveraging semantics-guided synthesis.
- Assumptions/dependencies: Standardization; privacy; efficient lightweight models for client devices.
- Generative simulation for training embodied AI and game engines
- Sectors: Gaming, Simulation, Robotics
- What: World models that evolve semantic states and render consistent photorealistic futures for RL and content generation.
- Potential products/workflows: Game engine plugins where designers manipulate semantics to steer long-horizon outcomes.
- Assumptions/dependencies: Multi-agent semantics; long-horizon stability; controllability APIs.
- Predictive maintenance from vision feeds at scale
- Sectors: Energy, Industrial IoT
- What: Forecast machine states and incipient faults from camera feeds (e.g., overheating signs, vibration-induced motion patterns).
- Potential products/workflows: Plant-wide monitors using divergence from predicted semantics as early warnings.
- Assumptions/dependencies: Multimodal fusion with sensors; domain adaptation; low false-alarm rates.
- Retail analytics and crowd flow forecasting
- Sectors: Retail, Facilities Management
- What: Predict short-term movement patterns to optimize staffing, layout, and safety (e.g., congestion, queuing).
- Potential products/workflows: Store dashboards with predictive overlays; automated interventions (e.g., queue opening).
- Assumptions/dependencies: Strong privacy protections; signage and consent; generalization to varied environments.
- Standards and benchmarks for semantics-first video prediction
- Sectors: Standards Bodies, Academia, Industry Consortia
- What: Establish evaluation suites and metrics emphasizing temporal semantic consistency and robust conditioning.
- Potential products/workflows: Open benchmarks, leaderboards, and certification protocols for predictive perception modules.
- Assumptions/dependencies: Community adoption; agreed-upon metrics (e.g., mIoU over moving classes, FVD/FID trade-offs).
- Green AI through accelerated training and distillation
- Sectors: AI Infrastructure, Sustainability
- What: Leverage the reported 7×–14× convergence speed-ups to lower training emissions; distill diffusion to fast deterministic predictors for deployment.
- Potential products/workflows: Training pipelines with nested-dropout/mixed-supervision defaults; distillation recipes for edge models.
- Assumptions/dependencies: Reproducible gains across domains; hardware-aware optimization.
Cross-cutting assumptions and dependencies
- Foundation dependencies: Frozen Vision Foundation Models (e.g., DINOv2, SigLIP-2), a 3D VAE (e.g., WAN2.1), and a diffusion transformer backbone (DiT) are required; licenses and hardware must be available.
- Domain adaptation: Stage-1 semantic predictor must generalize to target domains; performance degrades with large domain shift without adaptation.
- Latency and compute: Diffusion inference is costly; practical deployments may require distillation, quantization, and component reduction (leveraging nested dropout).
- Safety-critical use: Extensive validation, monitoring for train–test mismatch, and calibrated uncertainty are necessary before real-world operation.
- Data governance: Continuous video prediction may implicate privacy and retention policies; ensure compliance and ethical use.
Glossary
- 3D VAE: A variational autoencoder that processes video with explicit temporal (3D) structure for compression and decoding. "For the 3D VAE, we employ the WAN2.1 VAE [74]"
- Absolute Relative Error (AbsRel): A depth estimation metric measuring relative error between predicted and ground-truth depth. "Depth evaluation uses Absolute Rela- tive Error (AbsRel) and threshold accuracy (81)"
- AdaLN-LoRA: A LoRA-based adaptive normalization layer used to condition diffusion models on noise level efficiently. "Noise-level conditioning is implemented via LoRA-based adaptive normalization (AdaLN-LoRA) [32]"
- autoregressive: A modeling approach that predicts future elements using previously predicted outputs as inputs. "At inference time, the model operates autoregressively:"
- causal variational autoencoder: A VAE designed to encode/decode sequences respecting temporal causality, often compressing along time. "a causal variational autoencoder that compresses videos along both spatial and temporal dimensions."
- channel-wise summation: An operation that fuses feature maps by adding channels elementwise to combine modalities. "Both fea- ture maps are then embedded inde- pendently and combined by channel- wise summation, ensuring joint con- ditioning from the outset."
- Classifier-Free Guidance (CFG): A technique to steer diffusion generation by contrasting guided and unguided predictions; here used as inspiration for representation guidance. "a CFG-inspired representation guidance scheme that con- trasts predictions at different levels of semantic granularity."
- cross-attention: Attention mechanism that conditions one sequence on another; removed here since the model is not text-conditioned. "we remove the cross- attention layers"
- denoising objective: The training loss for diffusion models that learns to invert a noise-adding process and predict clean latents. "The model is trained to minimize the denoising objective:"
- DINOv2: A vision foundation model providing robust semantic features used as the predicted representation space. "Specifically, we employ the DINOv2 [53] image encoder En (.)"
- Diffusion Transformer (DiT): A transformer architecture tailored for diffusion models, here used for video latent denoising. "built upon the Diffusion Transformer (DiT) framework [54]"
- EDM formulation: The Elucidated Diffusion Model training/sampling setup providing stable diffusion objectives and schedules. "Our diffusion transformer follows the EDM formulation [36] used in Cosmos-Predict [1]"
- early fusion: Conditioning strategy that merges semantic features with VAE latents at the input level for stable guidance. "We adopt an early fusion strategy that token-wise merges VFM features with VAE latents at the input level"
- early semantic alignment: Aligning semantic and generative latent spaces at the model input to guide synthesis from the outset. "This design enables early semantic alignment be- tween the predicted scene structure and the generative latent space"
- FID: Fréchet Inception Distance; measures perceptual realism of generated frames. "We compute FID [27] and FVD [69] over all pre- dicted frames"
- FVD: Fréchet Video Distance; measures spatiotemporal coherence and realism of generated videos. "We compute FID [27] and FVD [69] over all pre- dicted frames"
- ground-truth features: Semantic features extracted from true future frames used during training (but not inference). "models trained exclusively with ground-truth features tend to overfit to these ideal representations"
- latent space: A compressed representation space (e.g., from a VAE) where prediction and generation are performed. "operate in the latent space of a learned variational au- toencoder [41,59, 74]"
- latent video diffusion model: A diffusion generator that operates in VAE latent space to synthesize future frames. "In the second stage, a latent video diffusion model generates future frames"
- LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning technique used within normalization here. "Noise-level conditioning is implemented via LoRA-based adaptive normalization (AdaLN-LoRA) [32]"
- masked transformer: A transformer trained to predict masked tokens/features, used here for semantic feature forecasting. "We adopt the masked transformer architecture from [38]"
- mean Intersection-over-Union (mIoU): A segmentation metric averaging IoU across classes. "We report mIoU over all classes (A) and over moving-object classes (M)"
- mixed supervision: Training strategy mixing clean ground-truth and predicted semantic features to reduce train–test mismatch. "mixed supervision exposes the generator to both ground-truth and predicted features (90/10 mixture)"
- Nested Dropout: A stochastic truncation of feature channels that retains leading components to encourage robustness. "we apply nested dropout [43,58] to all semantic feature maps h1:K."
- PCA: Principal Component Analysis; dimensionality reduction used to form ordered semantic channels. "project- ing them to Ch = 1152 channels via PCA"
- patchified: Process of splitting latent feature maps into non-overlapping patches for transformer input. "The VAE la- tents z are patchified with a spa- tial size of 2 × 2"
- pseudo-depth: Depth maps produced by a pretrained estimator rather than ground-truth sensors/labels. "we generate pseudo-depth using Depth Anything V2 [83]"
- query/key normalization: Normalization applied to attention queries/keys to stabilize training. "query/key normalization before attention [14, 17, 79]"
- RMSNorm: Root Mean Square Layer Normalization, a normalization variant used in transformer blocks. "and RMSNorm [90] with learnable scales in all self-attention blocks."
- Rotary Position Embeddings (RoPE): Positional encoding technique enabling efficient relative positional representations. "including 3D-factorized Rotary Position Embeddings (RoPE) [65]"
- semantic segmentation: Pixel-wise classification of an image into semantic classes. "We evaluate semantic segmentation (mIoU) and depth estimation on the generated frame 19."
- teacher forcing: Training approach feeding ground-truth context to stabilize sequence learning. "During training, we follow a teacher- forcing approach:"
- temporal subsampling ratio: The factor by which frames are subselected to match temporal resolutions across stages. "where r is the temporal subsampling ratio of the VAE encoder."
- threshold accuracy (δ1): A depth accuracy metric measuring fraction of pixels within a threshold of relative error (rendered as “81” in the paper). "and threshold accuracy (81)"
- VAE latent space: The compressed space produced by a VAE in which diffusion operates. "in the VAE latent space [74]"
- Variational Autoencoder (VAE): A generative model that encodes inputs into a probabilistic latent space and decodes them back. "a Variational Autoencoder (VAE), which preserves sufficient detail for high-fidelity reconstruction."
- Vision Foundation Model (VFM): A large pretrained vision model providing general-purpose features. "Vision Foundation Model (VFM)"
- WAN2.1 VAE: A specific causal 3D VAE architecture used for video compression in this work. "we employ the WAN2.1 VAE [74]"
- world modeling: Learning predictive models of environments for planning and control. "In control- lable or world-modeling settings"
- zero-shot generalization: Evaluating on datasets or domains not seen during training. "zero-shot general- ization on KITTI"
Collections
Sign up for free to add this paper to one or more collections.