DrivoR: Transformer for Autonomous Driving

Updated 15 January 2026

DrivoR is a transformer-based autonomous driving architecture that leverages pretrained Vision Transformers and camera-aware register tokens for effective multi-camera perception.
It decouples trajectory generation and scoring using lightweight decoder stacks to produce and evaluate candidate trajectories with interpretable sub-scores.
Benchmark evaluations on NAVSIM and HUGSIM highlight its competitive performance and computational efficiency with a minimalist design.

DrivoR is a transformer-based, end-to-end autonomous driving architecture that leverages pretrained Vision Transformers (ViTs) and camera-aware register tokens to achieve accurate, efficient, and behavior-conditioned driving. Its design combines multi-camera perception, compact feature compression, decoupled generation and evaluation of candidate trajectories, and interpretable sub-score prediction for decision-making. DrivoR demonstrates superior or on-par performance with contemporary baselines across NAVSIM-v1, NAVSIM-v2, and HUGSIM closed-loop evaluation benchmarks, while maintaining a minimalist architectural footprint and high computational efficiency (Kirby et al., 8 Jan 2026).

1. Architecture Overview

DrivoR processes raw sensory inputs from four surround cameras (front, front-left, front-right, rear), utilizing a pretrained ViT (typically ViT-S or larger). Each input image is divided into $16\times 16$ patches; these patches are embedded and, in parallel, supplemented with $R$ learnable camera-specific “register” tokens $Q_\textrm{reg}^c \in \mathbb{R}^{R \times D_{\textrm{ViT}}}$ . The entire set of image and register tokens undergoes $L$ transformer layers, after which only the register tokens are retained for downstream modules; patch tokens are discarded.

The architecture is composed of two primary lightweight transformer decoder stacks:

Trajectory Generator: Proposes $K$ future action trajectories given the compact, multi-camera scene encoding.
Scoring Decoder: Assigns interpretable, aspect-specific sub-scores—learned to mimic an external oracle—to each candidate trajectory via cross-modal attention.

2. Register Tokens and Scene Representation

DrivoR’s camera-aware register tokens provide significant reduction in sequence length and computational overhead. For each camera view:

Patch tokens $X_{\textrm{patch}}^c \in \mathbb{R}^{N_p \times D_{\textrm{ViT}}}$ and register tokens $Q_\textrm{reg}^{c(L)} \in \mathbb{R}^{R \times D_{\textrm{ViT}}}$ are produced by the final ViT layer.
The input to the planning stack is $S = \textrm{concat}_{c=1..4} Q_\textrm{reg}^{c(L)} \in \mathbb{R}^{(4R) \times D_{\textrm{ViT}}}$ .
Register tokens are camera-specific, allowing the system to learn specialization by viewpoint.

Compared with spatial average pooling or cross-attention-based fusion, register tokens permit focused extraction of planning-relevant information, compressing $\sim$ 4K tokens per camera to $R$ (e.g., 16) and expediting all downstream computation.

3. Candidate Trajectory Generation and Scoring

3.1 Trajectory Generation

The trajectory generator decoder accepts:
- Concatenated register tokens (scene encoding) $S$ .
- $K$ learnable trajectory queries $Q_{\textrm{traj}} \in \mathbb{R}^{K \times D_{\textrm{traj}}}$ .
- Ego vehicle state (pose, velocity, acceleration, command), which is embedded and added to each query.
Four attention blocks combine trajectory self-attention with cross-attention to the scene.
Output: $K$ tokens $T_i$ , each mapped via MLP to trajectory $\tau_i \in \mathbb{R}^{n_p \times 3}$ over $n_p$ future steps (position, heading).

The learning objective is a minimum-of-N (MoN) or winner-take-all (WTA) L1 regression to the human reference trajectory:

$\mathcal{L}_{\textrm{traj}} = \min_{i=1..K} \|\tau_i - \hat{\tau}\|_{1} \,.$

3.2 Scoring Decoder

Each candidate trajectory $\tau_i$ is re-embedded via an MLP to provide the input query $q_i^{sc}$ .
The scoring decoder processes these queries with the same scene tokens and self-/cross-attention structure.
Six independent MLP heads predict sub-scores $G_{\theta_c}(\tau_i) \in [0,1]$ corresponding to safety, comfort, efficiency, and traffic compliance metrics (e.g., NC, DAC, TTC, EP, Comf).
The scoring loss is the binary cross-entropy with oracle-provided sub-scores.

Structural disentanglement between trajectory generation and scoring (as opposed to sharing queries/weights) is empirically critical: combining both into a single decoder degrades the primary driving metric (PDMS) by 5–6 points.

4. Training, Behavior Conditioning, and Inference

The total loss combines the trajectory regression and scoring components (with $\lambda_s = 1$ in practice):

$\mathcal{L} = \mathcal{L}_{\textrm{traj}} + \lambda_s \mathcal{L}_{\textrm{score}}$

At inference, per-trajectory sub-scores are synthesized into a global meta-score using user-configurable weights $\lambda_c^{test}$ :

PDMS-style (product of penalties, sum of behavior metrics) or simple weighted sum.
The selected trajectory is $\tau^* = \arg\max_{i} \mathrm{Score}(\tau_i)$ .

This reward re-weighting at inference time enables flexible behavior conditioning—aggressive, defensive, or otherwise—without retraining.

5. Benchmark Evaluation and Performance

DrivoR matches or exceeds strong neural and classical baselines across a range of closed-loop and open-loop end-to-end driving benchmarks:

Benchmark	Metric	DrivoR (ViT-S)	Best Baselines
NAVSIM-v1	PDMS $\uparrow$	93.7	DriveSuprim 93.5, Human 94.8
NAVSIM-v2	EPDMS $\uparrow$	48.3	ZTRS (ViT-99) 48.1
HUGSIM	RC / HD-Score	49.8 / 35.7	UniAD 45.9 / 32.7
NAVSIM-v2 Speed	ms/fwd (A100)	110	GTRS-Dense 400
NAVSIM-v2 Memory	GB peak	0.5	GTRS-Dense 1.6

DrivoR achieves state-of-the-art performance on navigation and closed-loop metrics, while operating with $\leq 40$ million parameters and demonstrating $3$– $4\times$ faster throughput compared to competitive models (Kirby et al., 8 Jan 2026).

6. Design Principles and Comparative Analysis

DrivoR’s empirical successes derive from several core principles:

Pure-transformer, query-based computation allows flexible conditioning.
Token reduction via camera-aware register tokens enables high efficiency without significant sacrifice in accuracy.
Decoupling candidate generation and scoring prevents mode collapse and ensures interpretable trajectory selection.
Sub-score prediction offers transparency and control, supporting requirement-aware and context-sensitive decision-making.

A plausible implication is that similar register-based compression mechanisms could benefit other high-dimensional multi-camera perception and planning tasks. The lightweight, modular design contrasts with heavily fused, monolithic encoding approaches that are more resource-intensive.

7. Limitations and Directions for Future Research

While DrivoR demonstrates that performance and interpretability can be achieved with a small, modular transformer, several open challenges remain:

Further scaling of camera coverage, scene complexity, and behavior diversity may require larger ViTs or alternative fusion strategies.
The approach is predicated on oracle sub-scores for training; generalization under domain shift or degraded sensor input is unstudied.
Scenario-driven stress-testing and integration into large-scale search-based evaluation infrastructures (e.g., Drivora (Cheng et al., 9 Jan 2026)) remains to be thoroughly explored.

This suggests extensibility toward multi-agent, multi-modal, and language-conditioned settings using similar register-based abstractions, as well as broader adoption in efficiency-critical robotics pipelines.

Markdown Report Issue Upgrade to Chat

References (2)

Driving on Registers (2026)

Drivora: A Unified and Extensible Infrastructure for Search-based Autonomous Driving Testing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DrivoR.