Papers
Topics
Authors
Recent
Search
2000 character limit reached

DrivoR: Transformer for Autonomous Driving

Updated 15 January 2026
  • DrivoR is a transformer-based autonomous driving architecture that leverages pretrained Vision Transformers and camera-aware register tokens for effective multi-camera perception.
  • It decouples trajectory generation and scoring using lightweight decoder stacks to produce and evaluate candidate trajectories with interpretable sub-scores.
  • Benchmark evaluations on NAVSIM and HUGSIM highlight its competitive performance and computational efficiency with a minimalist design.

DrivoR is a transformer-based, end-to-end autonomous driving architecture that leverages pretrained Vision Transformers (ViTs) and camera-aware register tokens to achieve accurate, efficient, and behavior-conditioned driving. Its design combines multi-camera perception, compact feature compression, decoupled generation and evaluation of candidate trajectories, and interpretable sub-score prediction for decision-making. DrivoR demonstrates superior or on-par performance with contemporary baselines across NAVSIM-v1, NAVSIM-v2, and HUGSIM closed-loop evaluation benchmarks, while maintaining a minimalist architectural footprint and high computational efficiency (Kirby et al., 8 Jan 2026).

1. Architecture Overview

DrivoR processes raw sensory inputs from four surround cameras (front, front-left, front-right, rear), utilizing a pretrained ViT (typically ViT-S or larger). Each input image is divided into 16×1616\times 16 patches; these patches are embedded and, in parallel, supplemented with RR learnable camera-specific “register” tokens QregcRR×DViTQ_\textrm{reg}^c \in \mathbb{R}^{R \times D_{\textrm{ViT}}}. The entire set of image and register tokens undergoes LL transformer layers, after which only the register tokens are retained for downstream modules; patch tokens are discarded.

The architecture is composed of two primary lightweight transformer decoder stacks:

  • Trajectory Generator: Proposes KK future action trajectories given the compact, multi-camera scene encoding.
  • Scoring Decoder: Assigns interpretable, aspect-specific sub-scores—learned to mimic an external oracle—to each candidate trajectory via cross-modal attention.

2. Register Tokens and Scene Representation

DrivoR’s camera-aware register tokens provide significant reduction in sequence length and computational overhead. For each camera view:

  • Patch tokens XpatchcRNp×DViTX_{\textrm{patch}}^c \in \mathbb{R}^{N_p \times D_{\textrm{ViT}}} and register tokens Qregc(L)RR×DViTQ_\textrm{reg}^{c(L)} \in \mathbb{R}^{R \times D_{\textrm{ViT}}} are produced by the final ViT layer.
  • The input to the planning stack is S=concatc=1..4Qregc(L)R(4R)×DViTS = \textrm{concat}_{c=1..4} Q_\textrm{reg}^{c(L)} \in \mathbb{R}^{(4R) \times D_{\textrm{ViT}}}.
  • Register tokens are camera-specific, allowing the system to learn specialization by viewpoint.

Compared with spatial average pooling or cross-attention-based fusion, register tokens permit focused extraction of planning-relevant information, compressing \sim4K tokens per camera to RR (e.g., 16) and expediting all downstream computation.

3. Candidate Trajectory Generation and Scoring

3.1 Trajectory Generation

  • The trajectory generator decoder accepts:
    • Concatenated register tokens (scene encoding) SS.
    • KK learnable trajectory queries QtrajRK×DtrajQ_{\textrm{traj}} \in \mathbb{R}^{K \times D_{\textrm{traj}}}.
    • Ego vehicle state (pose, velocity, acceleration, command), which is embedded and added to each query.
  • Four attention blocks combine trajectory self-attention with cross-attention to the scene.
  • Output: KK tokens TiT_i, each mapped via MLP to trajectory τiRnp×3\tau_i \in \mathbb{R}^{n_p \times 3} over npn_p future steps (position, heading).

The learning objective is a minimum-of-N (MoN) or winner-take-all (WTA) L1 regression to the human reference trajectory:

Ltraj=mini=1..Kτiτ^1.\mathcal{L}_{\textrm{traj}} = \min_{i=1..K} \|\tau_i - \hat{\tau}\|_{1} \,.

3.2 Scoring Decoder

  • Each candidate trajectory τi\tau_i is re-embedded via an MLP to provide the input query qiscq_i^{sc}.
  • The scoring decoder processes these queries with the same scene tokens and self-/cross-attention structure.
  • Six independent MLP heads predict sub-scores Gθc(τi)[0,1]G_{\theta_c}(\tau_i) \in [0,1] corresponding to safety, comfort, efficiency, and traffic compliance metrics (e.g., NC, DAC, TTC, EP, Comf).
  • The scoring loss is the binary cross-entropy with oracle-provided sub-scores.

Structural disentanglement between trajectory generation and scoring (as opposed to sharing queries/weights) is empirically critical: combining both into a single decoder degrades the primary driving metric (PDMS) by 5–6 points.

4. Training, Behavior Conditioning, and Inference

The total loss combines the trajectory regression and scoring components (with λs=1\lambda_s = 1 in practice):

L=Ltraj+λsLscore\mathcal{L} = \mathcal{L}_{\textrm{traj}} + \lambda_s \mathcal{L}_{\textrm{score}}

At inference, per-trajectory sub-scores are synthesized into a global meta-score using user-configurable weights λctest\lambda_c^{test}:

  • PDMS-style (product of penalties, sum of behavior metrics) or simple weighted sum.
  • The selected trajectory is τ=argmaxiScore(τi)\tau^* = \arg\max_{i} \mathrm{Score}(\tau_i).

This reward re-weighting at inference time enables flexible behavior conditioning—aggressive, defensive, or otherwise—without retraining.

5. Benchmark Evaluation and Performance

DrivoR matches or exceeds strong neural and classical baselines across a range of closed-loop and open-loop end-to-end driving benchmarks:

Benchmark Metric DrivoR (ViT-S) Best Baselines
NAVSIM-v1 PDMS \uparrow 93.7 DriveSuprim 93.5, Human 94.8
NAVSIM-v2 EPDMS \uparrow 48.3 ZTRS (ViT-99) 48.1
HUGSIM RC / HD-Score 49.8 / 35.7 UniAD 45.9 / 32.7
NAVSIM-v2 Speed ms/fwd (A100) 110 GTRS-Dense 400
NAVSIM-v2 Memory GB peak 0.5 GTRS-Dense 1.6

DrivoR achieves state-of-the-art performance on navigation and closed-loop metrics, while operating with 40\leq 40 million parameters and demonstrating $3$–4×4\times faster throughput compared to competitive models (Kirby et al., 8 Jan 2026).

6. Design Principles and Comparative Analysis

DrivoR’s empirical successes derive from several core principles:

  • Pure-transformer, query-based computation allows flexible conditioning.
  • Token reduction via camera-aware register tokens enables high efficiency without significant sacrifice in accuracy.
  • Decoupling candidate generation and scoring prevents mode collapse and ensures interpretable trajectory selection.
  • Sub-score prediction offers transparency and control, supporting requirement-aware and context-sensitive decision-making.

A plausible implication is that similar register-based compression mechanisms could benefit other high-dimensional multi-camera perception and planning tasks. The lightweight, modular design contrasts with heavily fused, monolithic encoding approaches that are more resource-intensive.

7. Limitations and Directions for Future Research

While DrivoR demonstrates that performance and interpretability can be achieved with a small, modular transformer, several open challenges remain:

  • Further scaling of camera coverage, scene complexity, and behavior diversity may require larger ViTs or alternative fusion strategies.
  • The approach is predicated on oracle sub-scores for training; generalization under domain shift or degraded sensor input is unstudied.
  • Scenario-driven stress-testing and integration into large-scale search-based evaluation infrastructures (e.g., Drivora (Cheng et al., 9 Jan 2026)) remains to be thoroughly explored.

This suggests extensibility toward multi-agent, multi-modal, and language-conditioned settings using similar register-based abstractions, as well as broader adoption in efficiency-critical robotics pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DrivoR.