Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent TransFuser v6 (LTFv6) for Autonomous Driving

Updated 25 February 2026
  • The paper introduces LTFv6, an end-to-end policy that integrates camera, LiDAR, and radar data with a 3-point route snippet using a Transformer encoder for improved driving control.
  • LTFv6 is designed to minimize learner–expert asymmetry through early route token injection and sensor fusion in a shared latent space, achieving state-of-the-art performance on CARLA benchmarks.
  • Practical training optimizations, including behavior cloning and auxiliary perception losses, contribute to enhanced robustness and sim2real transfer, setting new performance records in autonomous driving.

Latent TransFuser v6 (LTFv6) is an end-to-end neural policy architecture designed for autonomous driving, with a specific focus on robust multi-sensor fusion, compact route conditioning, and addressing the limitations introduced by the asymmetry between privileged expert demonstrators and sensor-based student agents. Introduced as part of the LEAD framework, LTFv6 advances the state-of-the-art on public CARLA driving benchmarks and demonstrates promising generalization to real-world tasks via architecture, data, and training interventions that minimize learner–expert discrepancies (Nguyen et al., 23 Dec 2025).

1. Model Architecture and Sensor Fusion

LTFv6 processes synchronized multi-modal sensor streams—including cameras, LiDAR, and optionally radar—together with route information to generate low-level driving controls (steering, throttle, brake). Sensor modalities are encoded as fixed-size token sequences:

  • Camera Inputs: Each of C images IcR3×H×WI_c \in \mathbb{R}^{3 \times H \times W} is processed via a convolutional backbone (ResNet-34 or RegNetY-032) to produce feature maps FccamRD×H×WF^{cam}_c \in \mathbb{R}^{D \times H' \times W'}. These are flattened into TcamT_{cam} tokens per camera.
  • LiDAR Inputs: Point clouds are voxelized/projected into a canonical Hb×WbH_b \times W_b BEV grid, processed by a 2D CNN yielding FlidarRD×Hb×WbF^{lidar} \in \mathbb{R}^{D \times H_b \times W_b} and flattened to TlidarT_{lidar} tokens.
  • Radar Inputs: Each of R radar units produces up to NdetN_{det} detections per frame, each mapped via an MLP to D-dimensional tokens, totaling TradarT_{radar} tokens.

Tokens across modalities are linearly projected into a shared latent space RD\mathbb{R}^D, each augmented with a modality embedding emmode^{mod}_m and a 2D positional encoding pposp^{pos} for spatial structure retention.

Route conditioning departs from the conventional single-target point; instead, a 3-point route snippet—previous, current, and future waypoints, normalized to [ ⁣ ⁣1,+1]2[\!-\!1, +1]^2—is embedded via MLP into three D-dimensional tokens, prepended to the sensor tokens.

A standard 6-layer Transformer encoder with hidden dimension D fuses all tokens, integrating route intent with perception from the earliest layers via self-attention. Output decoders utilize sets of learned queries:

  • Lateral control: LL queries (typically L=12L=12) cross-attend to fused tokens to predict a sequence of waypoints in vehicle-centric coordinates.
  • Longitudinal control: A single query decodes to a scalar target speed vpredv_{pred}.

A lightweight PID-style controller subsequently translates predicted waypoints and velocity into actionable low-level commands.

2. Training Objectives and Optimization

LTFv6 employs pure behavior cloning on paired sensor and expert-action data (ot,at)(o_t, a^*_t), with auxiliary perception losses when ground-truth is available:

  • Imitation Loss: The mean-squared error between predicted (δ^t,u^t)(\hat \delta_t, \hat u_t) and expert (δt,ut)(\delta^*_t, u^*_t) steering and throttle/brake:

LIL=1Tt=1Tδ^tδt22+u^tut22,\mathcal{L}_{\mathrm{IL}} = \frac{1}{T} \sum_{t=1}^T \|\hat \delta_t - \delta^*_t\|_2^2 + \|\hat u_t - u^*_t\|_2^2,

  • Auxiliary Perception Losses (optional):
    • Detection (L1L_1 bounding box regression) and segmentation (cross-entropy) losses when CARLA synthetic labels are available.
  • Asymmetry Regularizer: A consistency regularizer enforces robustness by penalizing discrepancy between standard expert actions and those recomputed with masked-out state information:

Lalign=1Ttatatmasked22,\mathcal{L}_{\mathrm{align}} = \frac{1}{T} \sum_t \|a^*_t - a^{*\,\mathrm{masked}}_t\|_2^2,

where atmaskeda^{*\,\mathrm{masked}}_t simulates reduced expert observability.

Optimization utilizes: AdamW (initial LR 3×1043\times10^{-4}, weight decay 10210^{-2}), batch size 64 (sequences of length 4 s at 10 Hz), cosine decay, and mixed-precision training on 4×A100 GPUs for roughly one week (~200 epochs on 73 hours of data).

3. Minimizing Learner–Expert Asymmetry

Addressing "learner–expert asymmetry" is central: real-time expert policies in CARLA simulations possess privileged information—including bird's-eye visibility and perfect actor state estimates—that student policies (restricted to sensor data) cannot access.

  • Visibility Asymmetry: During data collection, each dynamic actor is only considered in expert policy if visible within the student camera frustum and not occluded. Traffic lights/signs similarly must project into at least one camera.
  • Uncertainty Asymmetry: Experts' access to perfect velocities/accelerations is neutralized by inflating bounding boxes (α>1\alpha > 1) at unprotected turns, scaling target speeds in adverse conditions (β[0.6,0.9]\beta \in [0.6, 0.9]), and enforcing safety braking on conservative collision predictions.
  • Intent Asymmetry and Target-Point Bias: LTFv6 supplies a 3-point waypoint snippet [pt1,pt,pt+1][p_{t-1}, p_t, p_{t+1}] as input tokens to the Transformer, eliminating the late-stage GRU bottleneck and early "goal-fixation" pathologies associated with single-point conditioning.

Together, these interventions drive tighter alignment between the states accessible to expert and student, closing the sim2real gap endemic in standard imitation pipelines.

4. Experimental Protocols and Benchmarks

LTFv6 is evaluated on CARLA 0.9.15 with the Leaderboard 2.0 protocol, using the following scenarios:

  • Longest6 v2: 36 routes (~2 km each, Towns 1–6).
  • Bench2Drive (B2D): 220 short routes (~150 m, all 12 towns).
  • Town13 Validation: 20 long routes (~12.4 km) on an entirely unseen town.

Standard driving metrics include Route Completion (RC), Infraction Score (IS), Driving Score (DS = RC × IS), Normalized Driving Score (NDS), and Success Rate (SR).

Method Cameras LiDAR Radar B2D DS SR Longest6 DS RC
TFv5 (RegNet-32) 51 55 83.5 67.3 23 70
TFv6 (6× cam) 55 55 91.6 79.5 43 85
TFv6 (best config) 51 51 95.2 86.8 62 91

On Town13, TFv6 achieves:

  • RC: 39.7
  • IS: 0.28
  • DS: 2.65
  • NDS: 4.04

Key improvements over TFv5:

  • +39 DS on Longest6 v2, +21 RC
  • +8 DS on B2D
  • +1.9 NDS on Town13

These constitute state-of-the-art performance on all CARLA closed-loop benchmarks at the time of publication (Nguyen et al., 23 Dec 2025).

5. Real-World Open-Loop Transfer

The camera-only variant of LTFv6 demonstrates generalization on real-world perception-driven driving benchmarks:

Method NavSim v1 (PDMS) NavSim v2 (EPDMS) WOD-E2E (RFS)
LTF v1 83.8 23.1
LTF v6 85.4 28.3 7.51
LTF v6 + LEAD pretrain 86.4 31.4 7.76

Performance consistently improves across NavSim and Waymo datasets when using LEAD-aligned supervision, suggesting that LTFv6’s interventions for sensor–expert alignment benefit beyond simulation.

6. Summary and Significance

LTFv6 unifies multi-sensor fusion and route intent inference in a Transformer-based policy trained under tightly aligned expert supervision. Architectural advances—early route token introduction, elimination of the GRU bottleneck, and conservative closed-loop regularization—address the primary visibility, uncertainty, and intent asymmetries hampering prior sensorimotor policies.

The empirical results establish new CARLA closed-loop records and validate sim2real robustness improvements, marking LTFv6 as a reference model for sensor, intent, and expert–student alignment in end-to-end driving (Nguyen et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent TransFuser v6 (LTFv6).