VLM Trajectory Planner for Autonomous Driving

Updated 27 December 2025

The VLM trajectory planner integrates a high-level Vision-Language Model for semantic context with a low-level Model Predictive Control for real-time trajectory optimization.
It leverages multimodal inputs—including visual scenes, vehicle states, and reference memory—to dynamically adjust safety and comfort parameters using chain-of-thought reasoning.
Empirical evaluations demonstrate improved Post Encroachment Time, reduced RMS acceleration, and enhanced controller stability compared to baseline methods.

A Vision-LLM (VLM) Trajectory Planner for autonomous driving is a hierarchical control architecture that synergistically combines the semantic and reasoning capabilities of VLMs with the real-time optimization strength of model-based controllers, typically Model Predictive Control (MPC). Such frameworks are motivated by the need for interpretable, adaptive, and safety-critical driving decisions derived from rich multimodal context, including visual inputs, vehicle states, and historic reference knowledge. The VLM trajectory planner paradigm is exemplified by approaches such as VLM-MPC, which demonstrates closed-loop integration of high-level reasoning and low-level control across diverse driving scenarios (Long et al., 2024).

1. System Architecture and Hierarchical Integration

VLM trajectory planners are fundamentally structured as layered systems. The canonical two-level design consists of:

Upper Layer: A Vision-LLM operating asynchronously at low frequency (e.g., 0.2 Hz), responsible for reasoning from front-camera images, ego and traffic state, and reference memory. The VLM outputs control parameters such as prediction horizon (N), speed-tracking weight (Q), effort weight (R), headway weight (Qh), desired velocity ( $v^d$ ), and desired time-headway ( $h^d$ ).
Lower Layer: A real-time MPC running at higher rates (2 Hz or faster), which minimizes a quadratic cost over predicted future states and control inputs, executing trajectory optimization based on both dynamic feedback and the high-level parameters produced by VLMs.

This loose coupling lets the VLM interpret the scene and past experience to adapt control objectives while MPC ensures robust execution under physical constraints (engine lag, actuator limits, etc.). State feedback links the two layers, and the VLM updates parameters at slower intervals while MPC maintains continuity in inter-update periods.

2. VLM Reasoning and Input/Output Structure

The reasoning component leverages modern multimodal foundation models (GPT-4o, Llama 3.1, CLIP ViT-B/32) with chain-of-thought (CoT) prompt engineering. Inputs include:

Visual Scene: Encoded via CLIP to produce environment descriptors (weather, lighting, road condition, obstacles).
Vehicle State: $s_t = [x, v, a]$ for ego, $s^\text{pre}_t = [x^\text{pre}, v^\text{pre}, a^\text{pre}]$ for the preceding vehicle, plus stop-line position $x^L$ .
Reference Memory ( $M$ ): Aggregated safety and comfort parameters from matched prior scenarios, allowing parameter adaptation based on context similarity.

The prompt includes $M$ , environment descriptor, current states, and the last output $O_{t-\Delta T}$ . The VLM then executes logical steps: assess environment, set horizon $N$ , balance safety ( $Q_h$ ) and comfort ( $R$ , $Q$ ), choose $v^d, h^d$ within allowable bounds.

3. Model Predictive Control Formulation

The MPC typically solves a finite-horizon optimization problem in continuous or discrete time:

State Model: $x_k = [s_k; v_k; a_k]$ , with continuous dynamics:

$\frac{ds}{dt} = v, \quad \frac{dv}{dt} = a, \quad \frac{da}{dt} = \frac{u(t) - a(t)}{T_a}$

for engine time lag $T_a$ .

Cost Function:

$J = \sum_{i=0}^{N-1} \Big[ (v_{k+i} - v^{d})^2 Q + u_{k+i}^2 R + (\Delta s_{k+i} - h^{d} v_{k+i})^2 Q_h \Big] + (v_{k+N} - v^{d})^2 Q_f$

Subject to inequality constraints on velocity, acceleration, control effort, and spacing relative to other vehicles.

Parameter Injection: All weights, references, and the prediction horizon $N$ are set by VLM reasoning outputs; MPC optimization and real-time execution use these parameters for each control cycle.

4. Workflow and Inter-layer Interaction

A typical VLM-MPC workflow proceeds as:

Reference memory $M$ is initialized and updated per-scenario using nuScenes data.
For each execution window:
- VLM aggregates current image, environmental encoding, vehicle and traffic state, memory, and prior outputs to compute high-level parameter output $O_t$ .
- MPC holds $O_t$ constant, solving at 2 Hz using a warm-started quadratic program.
- State feedback is provided to VLM for the next reasoning cycle, enabling adaptivity to rapid contextual changes.

Pseudocode for this interaction is provided in (Long et al., 2024).

5. Empirical Evaluation and Safety Metrics

Evaluation utilizes the nuScenes dataset, including scenarios with adverse weather, intersections, and parking. Key metrics:

Post Encroachment Time (PET): Direct measure of collision risk (threshold > 1 s); VLM-MPC consistently exceeds this across settings, outperforming both hand-tuned MPC and simpler LLM-based controllers.
Driving Comfort (RMS Acceleration): Lower is preferred for smooth driving; VLM-MPC delivers 0.33–0.50 m/s², superior to baselines.
Stability: No invalid VLM outputs across hundreds of environments.

Baseline comparisons and ablation studies reveal that reference memory and environmental encoding are essential. Removing the memory increases VLM output failure rates to 12% and degrades PET by 15%.

6. Contextual Adaptation and Failure Modes

The VLM layer is responsible for context-driven reconfiguration of planning objectives, directly enabling adaptation to environmental variations (e.g., rain, night driving, dynamic traffic). The use of visual scene encoders and memory modules provides resilience to unfamiliar scenarios. Ablations indicate that loss of historic context (reference memory) or scene parsing (environment encoder) degrades safety and comfort, with increased failure rates and collision risk.

7. Comparisons and Broader Significance

The VLM-MPC architecture defines a paradigm for interpretable trajectory planning with explicit semantic reasoning. By leveraging VLMs for scene assessment and logical inference while entrusting MPC with dynamic feasibility and optimality, the architecture produces significant improvements in safety (as measured by PET), comfort, and controller robustness. The approach is extensible to multi-agent settings, and further generalizes to hierarchical controllers operating with asynchronous high-level reasoning and low-level actuation. Its design principles are reflected in recent related work integrating VLMs with MPCs, memory modules, and chain-of-thought prompting (Long et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VLM Trajectory Planner for Autonomous Driving.