TransFuser v6: End-to-End Driving Policy

Updated 30 December 2025

TransFuser v6 is an end-to-end imitation-learning policy that integrates rich sensor inputs using a token-based transformer, narrowing the gap between expert demonstrations and student observations.
It fuses multi-modal data from surrounding cameras, LiDAR, radar, and GNSS waypoints, enabling precise continuous vehicular control through advanced self- and cross-attention mechanisms.
TFv6 achieves significant performance gains on simulated benchmarks and demonstrates effective sim-to-real transfer, highlighting its potential for scalable autonomous driving deployment.

TransFuser v6 (TFv6) is an end-to-end imitation-learning policy for autonomous driving that addresses Learner–Expert Asymmetry by narrowing gaps between privileged expert demonstrations and student agent observations. Developed as part of the LEAD framework, TFv6 establishes new performance benchmarks across simulated and real-world autonomous driving tasks through architectural advances, targeted interventions, and multi-modal fusion strategies (Nguyen et al., 23 Dec 2025).

1. Model Architecture and Sensor Modalities

TFv6 maps rich multi-modal sensor data to continuous vehicular control via a token-based transformer backbone. Its input modalities comprise:

Surround-view RGB Cameras: Up to six synchronized automotive-grade images per timestep, encoded by a ResNet-34 or RegNetY-032 backbone, then lifted to $N_\mathrm{bev}$ two-dimensional BEV tokens $T_\mathrm{img}^{(i)} \in \mathbb{R}^{N_\mathrm{bev} \times d_\mathrm{model}}$ .
Spinning LiDAR Point Clouds: Voxelized and embedded to yield $T_\mathrm{lidar} \in \mathbb{R}^{N_\mathrm{bev} \times d_\mathrm{model}}$ tokens.
Automotive Radars: Four units up to 75 detections per frame, encoded as $T_\mathrm{radar} \in \mathbb{R}^{M_\mathrm{radar} \times d_\mathrm{model}}$ .
GNSS “Target Points”: Three navigation waypoints (previous, current, future), each tokenized as $T_\mathrm{TP} \in \mathbb{R}^{3 \times d_\mathrm{model}}$ .

All tokens are assigned learned modality- and position-specific positional embeddings (PE). The fusion module employs $L = 6$ layers of multi-head self- and cross-attention, stacking the input tokens $E = [T_\mathrm{img}^{(1)}; \dots; T_\mathrm{img}^{(\#\mathrm{cams})}; T_\mathrm{lidar}; T_\mathrm{radar}; T_\mathrm{TP}]$ into a latent representation $Z \in \mathbb{R}^{N_\mathrm{tot} \times d_\mathrm{model}}$ .

Trajectory and speed queries cross-attend to $Z$ for decoding future waypoints $\hat{Y} = \{\hat{y}_1, \dots, \hat{y}_K\}$ and a target speed $\hat{v}$ , which are fed to a PID controller for fine-grained steering, throttle, and brake outputs. TFv6 omits GRU layers in its route decoding stage.

2. Core Mathematical Formulation

TFv6's perception, fusion, and policy prediction procedures follow transformer best practices, with formulation details as follows:

Multi-Head Self-Attention: For query $Q$ , key $K$ , and value $V$ :

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$

For $h$ heads: $h_i = \mathrm{Attention}(Q W^Q_i, K W^K_i, V W^V_i)$ and $\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(h_1,\dots,h_h) W^O$ .

Positional Encoding: Each modality $m$ receives learned embeddings $PE_m \in \mathbb{R}^{N \times d_\mathrm{model}}$ :

$X' = X + PE$

and tokens are computed as $T_m = \phi_m(\mathrm{input}) W_m + PE_m$ , with $\phi_m$ the encoder and $W_m$ the projection.

Joint Fusion and Decoding: Given token stack $E$ :

$Z = \mathrm{Transformer}_L(E) = \mathrm{LayerNorm}(E + \mathrm{MultiHead}(E,E,E)) \rightarrow \dots$

and route decoding via cross-attention:

$A_\mathrm{route} = \mathrm{MultiHead}(Q_\mathrm{route}, Z, Z), \qquad \hat{Y} = \mathrm{MLP}(A_\mathrm{route})$

3. Interventions for Learner–Expert Asymmetry

TFv6 explicitly targets three principal asymmetries between expert and learner:

Visibility Asymmetry: The expert's input is constrained to precisely the student camera frusta, including occlusion, weather, and time-of-day effects. Traffic light inclusion and speed limit readings are bounded to visible context, with capped speed $v_\mathrm{cap} = \min(v_\mathrm{posted}, v_\mathrm{flow})$ , where $v_\mathrm{flow}$ is inferred traffic median speed.
Uncertainty Asymmetry: To match perception-induced safety margins:
- The expert brakes not only for predicted collisions but whenever any actor is within $d_\mathrm{safe}$ of the ego.
- Cruising speed is reduced under visibility degradation by $\alpha_\mathrm{vis} < 1$ ( $v_\mathrm{expert} \leftarrow \alpha_\mathrm{vis} \cdot v_\mathrm{expert}$ ).
- Actor bounding boxes are expanded by $\Delta_\mathrm{bbox}$ (e.g., $+20\%$ ) in unprotected turn scenarios.
Intent Asymmetry: TFv6 replaces single target point specification with three temporally ordered points, facilitating multi-lane and ambiguous maneuvers. TP switching during training occurs when within 3 meters, giving earlier exposure to future navigational goals. These tokens are fused with BEV representations at the earliest model stage.

4. Training Objectives and Optimization

TFv6 optimization employs both supervised imitation and auxiliary perception losses:

Imitation Loss:

$L_\mathrm{imit} = \mathbb{E}_{(s,a^*)} \left[ || \hat{Y}_\theta(s) - Y^* ||^2 + || \hat{v}_\theta(s) - v^* ||^2 \right]$

where $s$ is observation (sensors and TPs), $a^*$ expert action.

Auxiliary Sim-to-Real Perception Losses:

$L_\mathrm{det} = \mathbb{E}_x \left[ \mathrm{SmoothL1}(b_\mathrm{pred}, b_\mathrm{gt}) + \mathrm{CE}(c_\mathrm{pred}, c_\mathrm{gt}) \right]$

for detection and semantic segmentation; $b_\mathrm{pred}$ and $b_\mathrm{gt}$ are predicted and ground truth boxes, $c_\mathrm{pred}$ and $c_\mathrm{gt}$ class logits.

Total Loss:

$L_\mathrm{total} = \lambda_1 L_\mathrm{imit} + \lambda_2 L_\mathrm{det} + \ldots$

with primary weighting on imitation.

5. Evaluation on Simulated Driving Benchmarks

TFv6 demonstrates statistically significant improvements on key closed-loop CARLA benchmarks:

Task/Benchmark	Baseline (TFv5/SoTA)	TFv6 Performance	Relative Gain
Bench2Drive	TFv5: DS 83.5, SR 67.3	DS 95.2 (+11.7), SR 86.8 (+19.5)	+14% DS, +29% SR
	HiP-AD: DS 86.8, SR 69.1
Longest6 v2	TFv5: DS 23, RC 70%	DS 62 (+39), RC 91% (+21)	+169% DS, +30% RC
	SimLingo: DS 22, RC 70%
	HiP-AD: DS 7, RC 56%
Town13 (Validation)	TFv5: DS 1.08, NDS 2.12	DS 2.65 (+1.57), NDS 4.04 (+1.92)	+145% DS, +91% NDS
	Expert: DS 36.3, NDS 58.5

Infraction breakdowns (Figure 1) indicate that visibility and uncertainty alignment reduce collision rates, while multi-point intent representation decreases target-fixation incidents, with a modest rise in route deviations.

6. Sim-to-Real Transfer and Curriculum Strategy

TFv6 adapts to real-world open-loop driving tasks (NAVSIM v1/v2, WOD-E2E) via the LTFv6 (latent TransFuser v6) variant, which omits LiDAR and radar and replaces LiDAR input with positional encoding. A curriculum of mixed synthetic (CARLA) and real data co-training is employed:

Epochs 1–30: Mixed real + synthetic
Epochs 31–120: Real only

Empirical results show consistent gains (Table 6):

Benchmark	Base LTFv6	+LEAD Data	Expert/Human Upper Bound
NAVSIM v1 PDMS	85.4	86.4	94.5
NAVSIM v2 EPDMS	28.3	31.4	51.3 (planner)
WOD-E2E RFS	7.51	7.76	8.10 (human)

This suggests that architectural and data-level interventions, as instantiated in TFv6 and LEAD, result in real-world gains under substantial domain shift.

7. Context and Significance

TFv6 sets a new state of the art in CARLA closed-loop driving while also demonstrating sim-to-real transferability. By minimizing Learner–Expert Asymmetry in visibility, uncertainty, and intent representation, TFv6 closes performance gaps endemic to sensor-based imitation learning. The framework bridges the data and architecture requirements between simulated and operational domains, supporting scalable deployment and cross-domain adaptation (Nguyen et al., 23 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TransFuser v6 (TFv6).