Transferability of CSN to Other VLA Architectures

Determine whether applying Causal Scene Narration (CSN)—the intent-constraint aligned, quantitatively grounded, and structured text-input framework evaluated with LMDrive—improves closed-loop driving performance when integrated with other Vision-Language-Action architectures such as DriveVLM.

Background

The paper introduces Causal Scene Narration (CSN), which restructures text inputs to LMDrive by aligning navigation intent with environmental constraints, adding quantitative grounding, and separating information into structured components. Multi-town CARLA evaluations show substantial improvements in Driving Score for LMDrive and a preference-aligned variant.

While the authors argue that CSN should be architecture-agnostic in principle, their experiments are limited to the LMDrive family. They explicitly state that they have not tested CSN on other VLA architectures such as DriveVLM, leaving open whether the observed improvements generalize beyond LMDrive.

References

We have not tested whether this transfers to other architectures such as DriveVLM.

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving  (2604.01723 - Li et al., 2 Apr 2026) in Discussion, Subsubsection "Robustness Across Weight Configurations"