TransFuser v6: End-to-End Driving Policy
- TransFuser v6 is an end-to-end imitation-learning policy that integrates rich sensor inputs using a token-based transformer, narrowing the gap between expert demonstrations and student observations.
- It fuses multi-modal data from surrounding cameras, LiDAR, radar, and GNSS waypoints, enabling precise continuous vehicular control through advanced self- and cross-attention mechanisms.
- TFv6 achieves significant performance gains on simulated benchmarks and demonstrates effective sim-to-real transfer, highlighting its potential for scalable autonomous driving deployment.
TransFuser v6 (TFv6) is an end-to-end imitation-learning policy for autonomous driving that addresses Learner–Expert Asymmetry by narrowing gaps between privileged expert demonstrations and student agent observations. Developed as part of the LEAD framework, TFv6 establishes new performance benchmarks across simulated and real-world autonomous driving tasks through architectural advances, targeted interventions, and multi-modal fusion strategies (Nguyen et al., 23 Dec 2025).
1. Model Architecture and Sensor Modalities
TFv6 maps rich multi-modal sensor data to continuous vehicular control via a token-based transformer backbone. Its input modalities comprise:
- Surround-view RGB Cameras: Up to six synchronized automotive-grade images per timestep, encoded by a ResNet-34 or RegNetY-032 backbone, then lifted to two-dimensional BEV tokens .
- Spinning LiDAR Point Clouds: Voxelized and embedded to yield tokens.
- Automotive Radars: Four units up to 75 detections per frame, encoded as .
- GNSS “Target Points”: Three navigation waypoints (previous, current, future), each tokenized as .
All tokens are assigned learned modality- and position-specific positional embeddings (PE). The fusion module employs layers of multi-head self- and cross-attention, stacking the input tokens into a latent representation .
Trajectory and speed queries cross-attend to for decoding future waypoints and a target speed , which are fed to a PID controller for fine-grained steering, throttle, and brake outputs. TFv6 omits GRU layers in its route decoding stage.
2. Core Mathematical Formulation
TFv6's perception, fusion, and policy prediction procedures follow transformer best practices, with formulation details as follows:
- Multi-Head Self-Attention: For query , key , and value :
For heads: and .
- Positional Encoding: Each modality receives learned embeddings :
and tokens are computed as , with the encoder and the projection.
- Joint Fusion and Decoding: Given token stack :
and route decoding via cross-attention:
3. Interventions for Learner–Expert Asymmetry
TFv6 explicitly targets three principal asymmetries between expert and learner:
- Visibility Asymmetry: The expert's input is constrained to precisely the student camera frusta, including occlusion, weather, and time-of-day effects. Traffic light inclusion and speed limit readings are bounded to visible context, with capped speed , where is inferred traffic median speed.
- Uncertainty Asymmetry: To match perception-induced safety margins:
- The expert brakes not only for predicted collisions but whenever any actor is within of the ego.
- Cruising speed is reduced under visibility degradation by ().
- Actor bounding boxes are expanded by (e.g., ) in unprotected turn scenarios.
- Intent Asymmetry: TFv6 replaces single target point specification with three temporally ordered points, facilitating multi-lane and ambiguous maneuvers. TP switching during training occurs when within 3 meters, giving earlier exposure to future navigational goals. These tokens are fused with BEV representations at the earliest model stage.
4. Training Objectives and Optimization
TFv6 optimization employs both supervised imitation and auxiliary perception losses:
- Imitation Loss:
where is observation (sensors and TPs), expert action.
- Auxiliary Sim-to-Real Perception Losses:
for detection and semantic segmentation; and are predicted and ground truth boxes, and class logits.
- Total Loss:
with primary weighting on imitation.
5. Evaluation on Simulated Driving Benchmarks
TFv6 demonstrates statistically significant improvements on key closed-loop CARLA benchmarks:
| Task/Benchmark | Baseline (TFv5/SoTA) | TFv6 Performance | Relative Gain |
|---|---|---|---|
| Bench2Drive | TFv5: DS 83.5, SR 67.3 | DS 95.2 (+11.7), SR 86.8 (+19.5) | +14% DS, +29% SR |
| HiP-AD: DS 86.8, SR 69.1 | |||
| Longest6 v2 | TFv5: DS 23, RC 70% | DS 62 (+39), RC 91% (+21) | +169% DS, +30% RC |
| SimLingo: DS 22, RC 70% | |||
| HiP-AD: DS 7, RC 56% | |||
| Town13 (Validation) | TFv5: DS 1.08, NDS 2.12 | DS 2.65 (+1.57), NDS 4.04 (+1.92) | +145% DS, +91% NDS |
| Expert: DS 36.3, NDS 58.5 |
Infraction breakdowns (Figure 1) indicate that visibility and uncertainty alignment reduce collision rates, while multi-point intent representation decreases target-fixation incidents, with a modest rise in route deviations.
6. Sim-to-Real Transfer and Curriculum Strategy
TFv6 adapts to real-world open-loop driving tasks (NAVSIM v1/v2, WOD-E2E) via the LTFv6 (latent TransFuser v6) variant, which omits LiDAR and radar and replaces LiDAR input with positional encoding. A curriculum of mixed synthetic (CARLA) and real data co-training is employed:
- Epochs 1–30: Mixed real + synthetic
- Epochs 31–120: Real only
Empirical results show consistent gains (Table 6):
| Benchmark | Base LTFv6 | +LEAD Data | Expert/Human Upper Bound |
|---|---|---|---|
| NAVSIM v1 PDMS | 85.4 | 86.4 | 94.5 |
| NAVSIM v2 EPDMS | 28.3 | 31.4 | 51.3 (planner) |
| WOD-E2E RFS | 7.51 | 7.76 | 8.10 (human) |
This suggests that architectural and data-level interventions, as instantiated in TFv6 and LEAD, result in real-world gains under substantial domain shift.
7. Context and Significance
TFv6 sets a new state of the art in CARLA closed-loop driving while also demonstrating sim-to-real transferability. By minimizing Learner–Expert Asymmetry in visibility, uncertainty, and intent representation, TFv6 closes performance gaps endemic to sensor-based imitation learning. The framework bridges the data and architecture requirements between simulated and operational domains, supporting scalable deployment and cross-domain adaptation (Nguyen et al., 23 Dec 2025).