Tri-Modal Driving Module

Updated 8 February 2026

Tri-modal driving modules are integrated systems that combine visual, LiDAR, and an auxiliary modality to enhance contextual perception and decision-making in autonomous vehicles.
They employ advanced fusion architectures like token fusion and cross-modal attention to align multi-sensor data for improved hazard detection and behavioral planning.
Experimental results demonstrate that incorporating a third modality boosts driving scores, reduces infractions, and improves explainability compared to bi- or uni-modal systems.

A tri-modal driving module is a system architecture that processes and fuses three distinct input modalities—such as vision, LiDAR, and a third domain (often language, audio, driver state, or HD-map)—in order to enhance situational awareness, behavioral decision-making, safety, and explainability in autonomous or assisted driving. These modules are central to modern autonomous driving stacks and advanced hazard detection frameworks, providing richer context and improved robustness compared to bi-modal or uni-modal approaches.

1. Modalities and Input Representations

Tri-modal driving modules support diverse modality triplets, selected according to the system’s operational intent. Major instantiations include:

Vision–LiDAR–Language (DriveMLM): Surround camera images ( $I \in \mathbb{R}^{T \times N_I \times H \times W \times 3}$ ) and LiDAR point clouds ( $L \in \mathbb{R}^{K\times4}$ ) are jointly processed with textual system messages and user instructions. Visual streams are encoded by a frozen ViT-g/14 followed by a Temporal Q-Former, generating compact spatio-temporal embeddings $X_I \in \mathbb{R}^{N_I\times N_Q\times D}$ . LiDAR is processed with a Sparse Pyramid Transformer (SPT), yielding $X_L \in \mathbb{R}^{N_Q\times D}$ . Text inputs are tokenized via standard LLaMA embeddings, yielding $X_M$ and $X_U$ for system and user tokens respectively (Wang et al., 2023).
Road Video–Face Video–Driver Audio (Multimodal Hazard Detection): Road-condition video, facial video, and audio captured over identical temporal windows are processed by separate depthwise-separable CNN backbones and 1D-conv temporal encoders, extracting feature sequences $\varphi_r, \varphi_v, \varphi_a \in \mathbb{R}^{T\times d}$ (Zhouxiang et al., 5 Feb 2025).
Vision–LiDAR–Driver Attention (M2DA): Multi-camera RGB images, BEV LiDAR grids ( $P \in \mathbb{R}^{256\times 256}$ ), and a dynamically predicted driver-attention mask $S \in [0,1]^{H\times W}$ for the front view. Visual features are reweighted by $S$ before encoding (Xu et al., 2024).
Vision–LiDAR–HD Map (MMFN): RGB images, BEV LiDAR histograms, and vectorized HD-maps parsed from OpenDRIVE, with map features structured by VectorNet and spatially aligned via lightweight ResNet modules (Zhang et al., 2022).

This multi-sensor input structure allows for the extraction and alignment of complementary geometric, semantic, rule-based, or cognitive signals required for robust closed-loop reasoning.

2. Fusion Architectures and Mathematical Formulation

Fusion mechanisms are designed to align and synthesize modality-encoded features at either the token, feature, or latent stage, supporting both spatial and temporal integration.

Token/Sequence Fusion (DriveMLM): Outputs from each encoder ( $f_\mathrm{cam}$ , $f_\mathrm{lidar}$ , $f_\mathrm{txt}$ ) are concatenated to form $F_\mathrm{all} = [f_\mathrm{cam}; f_\mathrm{lidar}; f_\mathrm{txt}] \in \mathbb{R}^{N_\mathrm{all} \times D}$ and processed by a unified causal Transformer decoder (Wang et al., 2023).
Pairwise Attention Fusion (Multimodal Hazard Detection): Intermediate features are fused with cross-modal attention for all ordered pairs $(i,j)$ :

$A_{ij} = \mathrm{softmax}\left( \frac{Q_i K_j^\top}{\sqrt{d}} \right) \in \mathbb{R}^{T \times T},\quad H_{i\gets j} = A_{ij} V_j$

Modal representations are updated as $\varphi_i^* = \varphi_i + \alpha_{i\gets j} H_{i\gets j}$ , where $\alpha_{i\gets j}$ is a learned balancing parameter (Zhouxiang et al., 5 Feb 2025).

Local-Global Cross-Attention (M2DA): Features are decomposed into local tokens (with 2D positional encoding) and global tokens (pooled statistics per modality). Two cross-attention stages are applied:
- Lidar-guided: $\mathcal{K}_{\mathrm{inter}}$ via attention to $\mathcal{K}_{I_\mathrm{lidar}}$
- Image-guided: $\mathcal{K}_{\mathrm{fused}}$ via attention to concatenated $\mathcal{K}_{I_{\mathrm{image}}}$

Layer-normalized outputs are then batch-fused and propagated (Xu et al., 2024).

Multi-Stage Multi-Head Attention (MMFN): After each CNN/attention downsampling stage, spatial features are concatenated, forming $F^{(\ell)}_\mathrm{in} = [F^{(\ell)}_\mathrm{cam}; F^{(\ell)}_\mathrm{lidar}; F^{(\ell)}_\mathrm{map}]$ . Cross-modal multi-head attention is

$\mathbf{A} = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{D_k}} \right) V$

Features are split and routed back, allowing multi-level context blending (Zhang et al., 2022).

This formalism supports direct examination of cross-modal correlations, alignment of geometric and semantic cues, and integrated uncertainty management.

3. Training Protocols and Datasets

Tri-modal driving modules are trained end-to-end with multimodal datasets encompassing all supported sensors and, where applicable, natural language, attention, or driver state:

Behavioral Cloning with Rule-Based or Human Annotations (DriveMLM, MMFN, M2DA): Supervision derives from either human driving logs (280 h in CARLA with rich sensor and language annotations (Wang et al., 2023)), or rule-based experts that consider time-to-collision and lane semantics (MMFN (Zhang et al., 2022)). For M2DA, driver attention maps are generated and subsampled for alignment loss.
Hazard Detection Dataset (Driver Assistance, (Zhouxiang et al., 5 Feb 2025)): Contains $\sim$ 4,000 synchronized road, face, and audio clips (3s, 30 fps), labeled as safe/dangerous with both automatic and manual annotation. Data augmentation includes random maskouts to foster robustness to missing or corrupted modalities.
Imitation and Auxiliary Losses (M2DA): Total loss consists of waypoint (L1), perception heatmap, traffic-state, and driver-attention alignment terms, with Kullback–Leibler divergence, correlation coefficient, and SIM metrics for the attention mask (Xu et al., 2024).

Optimization employs variants of AdamW or SGD, dynamic learning rate decay, and dropout regularization. Batch sizes and epochs are selected according to dataset size and memory footprint, with all submodules co-trained unless noted.

4. Decision State Alignment, Output Heads, and Behavioral Integration

DriveMLM Decision States: The head predicts discrete path ( $p_t \in \{\text{FOLLOW}, \text{LEFT\_CHANGE}, \ldots\}$ ) and speed ( $v_t \in \{\text{KEEP}, \text{ACCELERATE}, \ldots\}$ ) decisions, closely aligned with Apollo’s behavioral planning interface. The system message enumerates the vocabulary for transparency. Once $(p_t, v_t)$ is produced, these are directly consumed by the motion planner and controller (Wang et al., 2023).
Hazard Detection Binary Output: The final classifier in (Zhouxiang et al., 5 Feb 2025) produces $\{\mathrm{safe}, \mathrm{dangerous}\}$ , with cross-entropy supervision.
Imitation-based Waypoint Regression: Both MMFN and M2DA output a sequence of waypoints from fused features, which are then translated (by GRUs or direct decoding) into low-level controls (Zhang et al., 2022, Xu et al., 2024).
Explanatory Output: DriveMLM generates an explanation token stream after path and speed decision tokens, providing rationale for each driving action in natural language (Wang et al., 2023).

5. Experimental Results and Ablation Studies

Tri-modal architectures consistently outperform bi-modal or uni-modal variants across metrics:

Model/Modality	Driving Score (DS)	Route Completion (RC, %)	Infraction Score/Rate	Hazard Acc. (%)
DriveMLM (MV+TQ)	76.1	98.1	0.78	n/a
Apollo FSM	71.4	92.2	0.80	n/a
M2DA	72.6 (Town05)	89.7	0.80	n/a
MMFN (VectorMap)	75.6	82.2	0.50 (/km)	n/a
Driver Assist (A+V+R)	n/a	n/a	n/a	96.9

Ablations reveal that:

Removing the third modality (audio, attention, or HD-map) reduces accuracy or driving score by 3–16 points, depending on stack and dataset (Wang et al., 2023, Xu et al., 2024, Zhang et al., 2022, Zhouxiang et al., 5 Feb 2025).
For hazard detection, dropping audio drops accuracy from 96.9% to 93.8% (A+V), and any uni-modal solution falls below 88% (Zhouxiang et al., 5 Feb 2025).
Tri-modal fusion achieves lower infraction rates and higher human-likeness in failure modes, e.g., collision rates drop from 0.10 to 0.02, and red-light infractions become negligible once driver-attention is included (Xu et al., 2024).
Explanatory text and decision accuracy in DriveMLM (decision accuracy 75.2% vs. Apollo 18.5%; BLEU-4 40.5) demonstrate state-of-the-art interpretability (Wang et al., 2023).

6. Integration, Latency, and Practical Considerations

Tri-modal stacks are typically designed for plug-and-play deployment within modular autonomous driving systems:

DriveMLM supports direct substitution for the conventional Apollo FSM planner; the only interface change is at the Python API that maps decision symbols to planning calls. Real-time performance (60 ms end-to-end on A100–40GB, <32 GB VRAM requirement) enables practical deployment (Wang et al., 2023).
MMFN and M2DA fuse features at multiple depths, allowing for spatial alignment across modalities with minimal latency—critical for online closed-loop evaluation in simulation or real vehicles (Zhang et al., 2022, Xu et al., 2024).
Dynamic Handling of Missing Modalities: Models with cross-modal attention can degrade gracefully when one input is missing or noisy, as the attention mechanism naturally re-weights trust in available evidence (Zhouxiang et al., 5 Feb 2025).

System messages or pre-defined templates ensure traffic-rule-aligned decision spaces, facilitating regulatory compliance and human-in-the-loop debuggability.

7. Significance and Theoretical Implications

Tri-modal driving modules formalize the process of integrating visual, geometric, semantic, and cognitive cues, pushing driving policy inference closer to robust, interpretable, and human-aligned decision-making. Empirically, they:

Improve robustness to adverse data conditions and rare events by leveraging complementary sensors or cognitive priors.
Outperform bi-modal designs in safety-critical metrics, route completion, and infraction rates.
Enable transparent reasoning (via natural language explanations or attention maps), which aids both developer insight and regulatory audit.

A plausible implication is that further scaling or refinement of these modules—with richer fusion schemas or additional cognitive streams—will be crucial for reliable generalization and deployment at scale in complex, real-world environments.

Key references include DriveMLM (Wang et al., 2023), M2DA (Xu et al., 2024), MMFN (Zhang et al., 2022), and the multimodal driver assistance hazard detector (Zhouxiang et al., 5 Feb 2025), which collectively establish tri-modal fusion as a technical standard for high-performance autonomous and assisted driving.