Latent TransFuser v6 (LTFv6) for Autonomous Driving
- The paper introduces LTFv6, an end-to-end policy that integrates camera, LiDAR, and radar data with a 3-point route snippet using a Transformer encoder for improved driving control.
- LTFv6 is designed to minimize learner–expert asymmetry through early route token injection and sensor fusion in a shared latent space, achieving state-of-the-art performance on CARLA benchmarks.
- Practical training optimizations, including behavior cloning and auxiliary perception losses, contribute to enhanced robustness and sim2real transfer, setting new performance records in autonomous driving.
Latent TransFuser v6 (LTFv6) is an end-to-end neural policy architecture designed for autonomous driving, with a specific focus on robust multi-sensor fusion, compact route conditioning, and addressing the limitations introduced by the asymmetry between privileged expert demonstrators and sensor-based student agents. Introduced as part of the LEAD framework, LTFv6 advances the state-of-the-art on public CARLA driving benchmarks and demonstrates promising generalization to real-world tasks via architecture, data, and training interventions that minimize learner–expert discrepancies (Nguyen et al., 23 Dec 2025).
1. Model Architecture and Sensor Fusion
LTFv6 processes synchronized multi-modal sensor streams—including cameras, LiDAR, and optionally radar—together with route information to generate low-level driving controls (steering, throttle, brake). Sensor modalities are encoded as fixed-size token sequences:
- Camera Inputs: Each of C images is processed via a convolutional backbone (ResNet-34 or RegNetY-032) to produce feature maps . These are flattened into tokens per camera.
- LiDAR Inputs: Point clouds are voxelized/projected into a canonical BEV grid, processed by a 2D CNN yielding and flattened to tokens.
- Radar Inputs: Each of R radar units produces up to detections per frame, each mapped via an MLP to D-dimensional tokens, totaling tokens.
Tokens across modalities are linearly projected into a shared latent space , each augmented with a modality embedding and a 2D positional encoding for spatial structure retention.
Route conditioning departs from the conventional single-target point; instead, a 3-point route snippet—previous, current, and future waypoints, normalized to —is embedded via MLP into three D-dimensional tokens, prepended to the sensor tokens.
A standard 6-layer Transformer encoder with hidden dimension D fuses all tokens, integrating route intent with perception from the earliest layers via self-attention. Output decoders utilize sets of learned queries:
- Lateral control: queries (typically ) cross-attend to fused tokens to predict a sequence of waypoints in vehicle-centric coordinates.
- Longitudinal control: A single query decodes to a scalar target speed .
A lightweight PID-style controller subsequently translates predicted waypoints and velocity into actionable low-level commands.
2. Training Objectives and Optimization
LTFv6 employs pure behavior cloning on paired sensor and expert-action data , with auxiliary perception losses when ground-truth is available:
- Imitation Loss: The mean-squared error between predicted and expert steering and throttle/brake:
- Auxiliary Perception Losses (optional):
- Detection ( bounding box regression) and segmentation (cross-entropy) losses when CARLA synthetic labels are available.
- Asymmetry Regularizer: A consistency regularizer enforces robustness by penalizing discrepancy between standard expert actions and those recomputed with masked-out state information:
where simulates reduced expert observability.
Optimization utilizes: AdamW (initial LR , weight decay ), batch size 64 (sequences of length 4 s at 10 Hz), cosine decay, and mixed-precision training on 4×A100 GPUs for roughly one week (~200 epochs on 73 hours of data).
3. Minimizing Learner–Expert Asymmetry
Addressing "learner–expert asymmetry" is central: real-time expert policies in CARLA simulations possess privileged information—including bird's-eye visibility and perfect actor state estimates—that student policies (restricted to sensor data) cannot access.
- Visibility Asymmetry: During data collection, each dynamic actor is only considered in expert policy if visible within the student camera frustum and not occluded. Traffic lights/signs similarly must project into at least one camera.
- Uncertainty Asymmetry: Experts' access to perfect velocities/accelerations is neutralized by inflating bounding boxes () at unprotected turns, scaling target speeds in adverse conditions (), and enforcing safety braking on conservative collision predictions.
- Intent Asymmetry and Target-Point Bias: LTFv6 supplies a 3-point waypoint snippet as input tokens to the Transformer, eliminating the late-stage GRU bottleneck and early "goal-fixation" pathologies associated with single-point conditioning.
Together, these interventions drive tighter alignment between the states accessible to expert and student, closing the sim2real gap endemic in standard imitation pipelines.
4. Experimental Protocols and Benchmarks
LTFv6 is evaluated on CARLA 0.9.15 with the Leaderboard 2.0 protocol, using the following scenarios:
- Longest6 v2: 36 routes (~2 km each, Towns 1–6).
- Bench2Drive (B2D): 220 short routes (~150 m, all 12 towns).
- Town13 Validation: 20 long routes (~12.4 km) on an entirely unseen town.
Standard driving metrics include Route Completion (RC), Infraction Score (IS), Driving Score (DS = RC × IS), Normalized Driving Score (NDS), and Success Rate (SR).
| Method | Cameras | LiDAR | Radar | B2D DS | SR | Longest6 DS | RC |
|---|---|---|---|---|---|---|---|
| TFv5 (RegNet-32) | 1× | 51 | 55 | 83.5 | 67.3 | 23 | 70 |
| TFv6 (6× cam) | 6× | 55 | 55 | 91.6 | 79.5 | 43 | 85 |
| TFv6 (best config) | 3× | 51 | 51 | 95.2 | 86.8 | 62 | 91 |
On Town13, TFv6 achieves:
- RC: 39.7
- IS: 0.28
- DS: 2.65
- NDS: 4.04
Key improvements over TFv5:
- +39 DS on Longest6 v2, +21 RC
- +8 DS on B2D
- +1.9 NDS on Town13
These constitute state-of-the-art performance on all CARLA closed-loop benchmarks at the time of publication (Nguyen et al., 23 Dec 2025).
5. Real-World Open-Loop Transfer
The camera-only variant of LTFv6 demonstrates generalization on real-world perception-driven driving benchmarks:
| Method | NavSim v1 (PDMS) | NavSim v2 (EPDMS) | WOD-E2E (RFS) |
|---|---|---|---|
| LTF v1 | 83.8 | 23.1 | — |
| LTF v6 | 85.4 | 28.3 | 7.51 |
| LTF v6 + LEAD pretrain | 86.4 | 31.4 | 7.76 |
Performance consistently improves across NavSim and Waymo datasets when using LEAD-aligned supervision, suggesting that LTFv6’s interventions for sensor–expert alignment benefit beyond simulation.
6. Summary and Significance
LTFv6 unifies multi-sensor fusion and route intent inference in a Transformer-based policy trained under tightly aligned expert supervision. Architectural advances—early route token introduction, elimination of the GRU bottleneck, and conservative closed-loop regularization—address the primary visibility, uncertainty, and intent asymmetries hampering prior sensorimotor policies.
The empirical results establish new CARLA closed-loop records and validate sim2real robustness improvements, marking LTFv6 as a reference model for sensor, intent, and expert–student alignment in end-to-end driving (Nguyen et al., 23 Dec 2025).