Papers
Topics
Authors
Recent
2000 character limit reached

TransFuser v6: End-to-End Driving Policy

Updated 30 December 2025
  • TransFuser v6 is an end-to-end imitation-learning policy that integrates rich sensor inputs using a token-based transformer, narrowing the gap between expert demonstrations and student observations.
  • It fuses multi-modal data from surrounding cameras, LiDAR, radar, and GNSS waypoints, enabling precise continuous vehicular control through advanced self- and cross-attention mechanisms.
  • TFv6 achieves significant performance gains on simulated benchmarks and demonstrates effective sim-to-real transfer, highlighting its potential for scalable autonomous driving deployment.

TransFuser v6 (TFv6) is an end-to-end imitation-learning policy for autonomous driving that addresses Learner–Expert Asymmetry by narrowing gaps between privileged expert demonstrations and student agent observations. Developed as part of the LEAD framework, TFv6 establishes new performance benchmarks across simulated and real-world autonomous driving tasks through architectural advances, targeted interventions, and multi-modal fusion strategies (Nguyen et al., 23 Dec 2025).

1. Model Architecture and Sensor Modalities

TFv6 maps rich multi-modal sensor data to continuous vehicular control via a token-based transformer backbone. Its input modalities comprise:

  • Surround-view RGB Cameras: Up to six synchronized automotive-grade images per timestep, encoded by a ResNet-34 or RegNetY-032 backbone, then lifted to NbevN_\mathrm{bev} two-dimensional BEV tokens Timg(i)RNbev×dmodelT_\mathrm{img}^{(i)} \in \mathbb{R}^{N_\mathrm{bev} \times d_\mathrm{model}}.
  • Spinning LiDAR Point Clouds: Voxelized and embedded to yield TlidarRNbev×dmodelT_\mathrm{lidar} \in \mathbb{R}^{N_\mathrm{bev} \times d_\mathrm{model}} tokens.
  • Automotive Radars: Four units up to 75 detections per frame, encoded as TradarRMradar×dmodelT_\mathrm{radar} \in \mathbb{R}^{M_\mathrm{radar} \times d_\mathrm{model}}.
  • GNSS “Target Points”: Three navigation waypoints (previous, current, future), each tokenized as TTPR3×dmodelT_\mathrm{TP} \in \mathbb{R}^{3 \times d_\mathrm{model}}.

All tokens are assigned learned modality- and position-specific positional embeddings (PE). The fusion module employs L=6L = 6 layers of multi-head self- and cross-attention, stacking the input tokens E=[Timg(1);;Timg(#cams);Tlidar;Tradar;TTP]E = [T_\mathrm{img}^{(1)}; \dots; T_\mathrm{img}^{(\#\mathrm{cams})}; T_\mathrm{lidar}; T_\mathrm{radar}; T_\mathrm{TP}] into a latent representation ZRNtot×dmodelZ \in \mathbb{R}^{N_\mathrm{tot} \times d_\mathrm{model}}.

Trajectory and speed queries cross-attend to ZZ for decoding future waypoints Y^={y^1,,y^K}\hat{Y} = \{\hat{y}_1, \dots, \hat{y}_K\} and a target speed v^\hat{v}, which are fed to a PID controller for fine-grained steering, throttle, and brake outputs. TFv6 omits GRU layers in its route decoding stage.

2. Core Mathematical Formulation

TFv6's perception, fusion, and policy prediction procedures follow transformer best practices, with formulation details as follows:

  • Multi-Head Self-Attention: For query QQ, key KK, and value VV:

Attention(Q,K,V)=softmax(QKTdk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V

For hh heads: hi=Attention(QWiQ,KWiK,VWiV)h_i = \mathrm{Attention}(Q W^Q_i, K W^K_i, V W^V_i) and MultiHead(Q,K,V)=Concat(h1,,hh)WO\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(h_1,\dots,h_h) W^O.

  • Positional Encoding: Each modality mm receives learned embeddings PEmRN×dmodelPE_m \in \mathbb{R}^{N \times d_\mathrm{model}}:

X=X+PEX' = X + PE

and tokens are computed as Tm=ϕm(input)Wm+PEmT_m = \phi_m(\mathrm{input}) W_m + PE_m, with ϕm\phi_m the encoder and WmW_m the projection.

Z=TransformerL(E)=LayerNorm(E+MultiHead(E,E,E))Z = \mathrm{Transformer}_L(E) = \mathrm{LayerNorm}(E + \mathrm{MultiHead}(E,E,E)) \rightarrow \dots

and route decoding via cross-attention:

Aroute=MultiHead(Qroute,Z,Z),Y^=MLP(Aroute)A_\mathrm{route} = \mathrm{MultiHead}(Q_\mathrm{route}, Z, Z), \qquad \hat{Y} = \mathrm{MLP}(A_\mathrm{route})

3. Interventions for Learner–Expert Asymmetry

TFv6 explicitly targets three principal asymmetries between expert and learner:

  • Visibility Asymmetry: The expert's input is constrained to precisely the student camera frusta, including occlusion, weather, and time-of-day effects. Traffic light inclusion and speed limit readings are bounded to visible context, with capped speed vcap=min(vposted,vflow)v_\mathrm{cap} = \min(v_\mathrm{posted}, v_\mathrm{flow}), where vflowv_\mathrm{flow} is inferred traffic median speed.
  • Uncertainty Asymmetry: To match perception-induced safety margins:
    • The expert brakes not only for predicted collisions but whenever any actor is within dsafed_\mathrm{safe} of the ego.
    • Cruising speed is reduced under visibility degradation by αvis<1\alpha_\mathrm{vis} < 1 (vexpertαvisvexpertv_\mathrm{expert} \leftarrow \alpha_\mathrm{vis} \cdot v_\mathrm{expert}).
    • Actor bounding boxes are expanded by Δbbox\Delta_\mathrm{bbox} (e.g., +20%+20\%) in unprotected turn scenarios.
  • Intent Asymmetry: TFv6 replaces single target point specification with three temporally ordered points, facilitating multi-lane and ambiguous maneuvers. TP switching during training occurs when within 3 meters, giving earlier exposure to future navigational goals. These tokens are fused with BEV representations at the earliest model stage.

4. Training Objectives and Optimization

TFv6 optimization employs both supervised imitation and auxiliary perception losses:

  • Imitation Loss:

Limit=E(s,a)[Y^θ(s)Y2+v^θ(s)v2]L_\mathrm{imit} = \mathbb{E}_{(s,a^*)} \left[ || \hat{Y}_\theta(s) - Y^* ||^2 + || \hat{v}_\theta(s) - v^* ||^2 \right]

where ss is observation (sensors and TPs), aa^* expert action.

  • Auxiliary Sim-to-Real Perception Losses:

Ldet=Ex[SmoothL1(bpred,bgt)+CE(cpred,cgt)]L_\mathrm{det} = \mathbb{E}_x \left[ \mathrm{SmoothL1}(b_\mathrm{pred}, b_\mathrm{gt}) + \mathrm{CE}(c_\mathrm{pred}, c_\mathrm{gt}) \right]

for detection and semantic segmentation; bpredb_\mathrm{pred} and bgtb_\mathrm{gt} are predicted and ground truth boxes, cpredc_\mathrm{pred} and cgtc_\mathrm{gt} class logits.

  • Total Loss:

Ltotal=λ1Limit+λ2Ldet+L_\mathrm{total} = \lambda_1 L_\mathrm{imit} + \lambda_2 L_\mathrm{det} + \ldots

with primary weighting on imitation.

5. Evaluation on Simulated Driving Benchmarks

TFv6 demonstrates statistically significant improvements on key closed-loop CARLA benchmarks:

Task/Benchmark Baseline (TFv5/SoTA) TFv6 Performance Relative Gain
Bench2Drive TFv5: DS 83.5, SR 67.3 DS 95.2 (+11.7), SR 86.8 (+19.5) +14% DS, +29% SR
HiP-AD: DS 86.8, SR 69.1
Longest6 v2 TFv5: DS 23, RC 70% DS 62 (+39), RC 91% (+21) +169% DS, +30% RC
SimLingo: DS 22, RC 70%
HiP-AD: DS 7, RC 56%
Town13 (Validation) TFv5: DS 1.08, NDS 2.12 DS 2.65 (+1.57), NDS 4.04 (+1.92) +145% DS, +91% NDS
Expert: DS 36.3, NDS 58.5

Infraction breakdowns (Figure 1) indicate that visibility and uncertainty alignment reduce collision rates, while multi-point intent representation decreases target-fixation incidents, with a modest rise in route deviations.

6. Sim-to-Real Transfer and Curriculum Strategy

TFv6 adapts to real-world open-loop driving tasks (NAVSIM v1/v2, WOD-E2E) via the LTFv6 (latent TransFuser v6) variant, which omits LiDAR and radar and replaces LiDAR input with positional encoding. A curriculum of mixed synthetic (CARLA) and real data co-training is employed:

  • Epochs 1–30: Mixed real + synthetic
  • Epochs 31–120: Real only

Empirical results show consistent gains (Table 6):

Benchmark Base LTFv6 +LEAD Data Expert/Human Upper Bound
NAVSIM v1 PDMS 85.4 86.4 94.5
NAVSIM v2 EPDMS 28.3 31.4 51.3 (planner)
WOD-E2E RFS 7.51 7.76 8.10 (human)

This suggests that architectural and data-level interventions, as instantiated in TFv6 and LEAD, result in real-world gains under substantial domain shift.

7. Context and Significance

TFv6 sets a new state of the art in CARLA closed-loop driving while also demonstrating sim-to-real transferability. By minimizing Learner–Expert Asymmetry in visibility, uncertainty, and intent representation, TFv6 closes performance gaps endemic to sensor-based imitation learning. The framework bridges the data and architecture requirements between simulated and operational domains, supporting scalable deployment and cross-domain adaptation (Nguyen et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to TransFuser v6 (TFv6).