HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

Published 30 Apr 2026 in cs.CV | (2604.28196v1)

Abstract: Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while LLMs demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H-EmbodVis/HERMESV2.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a unified BEV-based framework that fuses semantic reasoning and geometric prediction, outperforming specialist models in autonomous driving benchmarks.
It leverages LLM-enhanced world queries and joint geometric optimization to ensure spatial consistency and semantic interpretability in future scene predictions.
Empirical results show a 41.6% Chamfer Distance reduction and improvements in CIDEr, METEOR, and ROUGE-L scores, validating its effectiveness.

Unified Driving World Modeling with HERMES++

Motivation and Problem Analysis

Current driving world models for autonomous systems exhibit a research dichotomy: generative models excel at forecasting scene evolution but lack semantic interpretability, while large vision-LLMs (VLMs/LLMs) provide semantic reasoning but lack predictive geometric fidelity. This structural gap impedes the deployment of holistic, interpretable autonomous driving agents, particularly for safety-critical applications that demand both deep understanding and future anticipation of complex, dynamic scenes.

The HERMES++ framework addresses this by unifying 3D scene understanding and future geometry prediction into a single model. The approach leverages the Bird’s-Eye View (BEV) representation to consolidate high-dimensional multi-view perceptual information into a spatial format amenable to LLM processing, introducing components for semantic-geometric fusion and robust geometric consistency enforcement.

Figure 1: Problem landscape and comparative results. HERMES++ achieves superior unification of scene generation and understanding via BEV, outperforming both prior generative and specialist architectures.

Architectural Overview

HERMES++ consists of several tightly integrated modules:

Visual Tokenizer: Utilizes a convolutional backbone (OpenCLIP ConvNeXt-L) to encode multi-view images, projecting features into a compact BEV grid. Downsampling and flattening produce token sequences suitable for LLM interfacing without loss of critical spatial semantics or excessive token count.
LLM-Enhanced World Queries: These are contextually anchored queries, initialized from BEV tokens and temporally modulated via ego-motion embeddings. The design enables cross-attention with language instructions, facilitating the injection of semantic priors into geometric evolution.
Current-to-Future Link: A stack of transformer blocks conditions future BEV features on both LLM-processed instructions and world queries. Textual Injection and Ego Modulation mechanisms allow the future scene prediction to be steered by both language and planned agent trajectory.
Differentiable BEV-to-Point Render: Generates future point cloud reconstructions from BEV features using volumetric upsampling and neural SDF modeling, allowing end-to-end differentiability for geometric supervision.
Joint Geometric Optimization: Combines explicit rendering Loss with implicit alignment (cosine and Gram matrix) to a frozen, geometry-aware latent feature prior from a pre-trained 3D encoder, enforcing both local and global structural integrity of predicted scenes.
Figure 2: HERMES++ pipeline. Multi-view BEV tokenization, language fusion, temporally-bridged geometric evolution, and joint geometric supervision are integrated for unified understanding and generation.

Numerical Results

HERMES++ demonstrates consistently strong quantitative results across both prediction and understanding benchmarks:

3-Second Point Cloud Prediction: Achieves a Chamfer Distance (CD) reduction of 41.6% over ViDAR and outperforms DriveX with only single-frame inputs.
3D Scene Understanding: Shows a CIDEr gain of 9.2% over the specialist Omni-Q on OmniDrive-nuScenes, despite requiring no auxiliary detection/lane supervision, and maintains competitive METEOR and ROUGE-L scores.
Unified Performance: Model scaling from 1.8B to 3.8B parameters yields further improvements, and the architecture generalizes robustly across LLM backbones (InternVL2, Llama-3.2, Qwen3), confirming its architectural universality.

Additional ablations validate design choices, showing:

BEV tokenization drastically outperforms naïve multi-view flattening for future scene generation due to preserved geometric structure.
The proposed Joint Geometric Optimization regime reduces CD by over 12% relative to explicit constraints alone and produces features that are geometrically faithful and artifact-free.
Figure 3: Qualitative results. HERMES++ delivers fine-grained semantic recognition (e.g., signage text, object identification) while rigorously tracking physical scene geometry in future predictions.

Figure 4: Comparison of BEV-based and multi-view-based input. Only BEV input maintains structural coherence in predicted geometry; flattened multi-view inputs lead to collapse artifacts.

Figure 5: Internal representations. Joint geometric optimization yields compact, geometry-conformant features (c) compared to explicit-only supervision (a, showing camera projection biases).

Theoretical Implications and Practical Significance

HERMES++ addresses fundamental limitations of isolated world modeling and scene understanding by establishing explicit semantically-conditioned spatiotemporal bridges within a shared BEV substrate, enabling bidirectional cross-task supervision. Importantly:

The world query mechanism provides a high-throughput channel for semantic information transfer into geometric prediction, which is essential for interpretable forecasting and safe planning.
Joint geometric optimization ensures that learned representations are not only semantically meaningful but also physically plausible and spatially consistent, closing the gap between simulation and real-world reliability.
Advancement to multi-modal, multi-task unified modeling establishes a scalable research direction for foundational world models capable of supporting flexible downstream agent behaviors, planning, and interaction.

Future Directions

Several open research avenues remain:

Further leveraging multi-modal large model priors (e.g., video-language pretraining) for BEV-based input remains unresolved.
Extension of the unified modeling framework to broader modalities (LiDAR, radar, event cameras) and action spaces (planning/control) is plausible.
Near-term work may involve efficient adaptation of HERMES++ for real-time interactive inference and active exploration settings.

Conclusion

HERMES++ substantiates the feasibility and advantages of a unified architecture for 3D scene understanding and future geometry prediction in autonomous driving scenarios. By combining compressed geometric representations, deep language-model reasoning, explicit semantic-to-geometric linkage, and robust geometric regularization, the framework sets a new benchmark for interpretable, predictive, and generalizable driving world models.

Key claims supported by the empirical results and ablation studies:

Unified BEV-Language modeling achieves stronger joint performance than specialist baselines across both semantic and geometric tasks, even without detection/map supervision.
Joint geometric optimization is critical for correct structural prediction; explicit-only approaches yield artifacts and inferior accuracy.
Cross-task interaction (world queries, textual injection) is essential for bridging understanding and prediction in temporal progression.

The architectural design and results of HERMES++ are likely to inform a new generation of interpretable, predictive foundational models for autonomous vehicles and embodied agents—systems that must not only “see” and “describe,” but also accurately simulate and anticipate complex real-world scene evolutions.