DrivoR: Transformer for Autonomous Driving
- DrivoR is a transformer-based autonomous driving architecture that leverages pretrained Vision Transformers and camera-aware register tokens for effective multi-camera perception.
- It decouples trajectory generation and scoring using lightweight decoder stacks to produce and evaluate candidate trajectories with interpretable sub-scores.
- Benchmark evaluations on NAVSIM and HUGSIM highlight its competitive performance and computational efficiency with a minimalist design.
DrivoR is a transformer-based, end-to-end autonomous driving architecture that leverages pretrained Vision Transformers (ViTs) and camera-aware register tokens to achieve accurate, efficient, and behavior-conditioned driving. Its design combines multi-camera perception, compact feature compression, decoupled generation and evaluation of candidate trajectories, and interpretable sub-score prediction for decision-making. DrivoR demonstrates superior or on-par performance with contemporary baselines across NAVSIM-v1, NAVSIM-v2, and HUGSIM closed-loop evaluation benchmarks, while maintaining a minimalist architectural footprint and high computational efficiency (Kirby et al., 8 Jan 2026).
1. Architecture Overview
DrivoR processes raw sensory inputs from four surround cameras (front, front-left, front-right, rear), utilizing a pretrained ViT (typically ViT-S or larger). Each input image is divided into patches; these patches are embedded and, in parallel, supplemented with learnable camera-specific “register” tokens . The entire set of image and register tokens undergoes transformer layers, after which only the register tokens are retained for downstream modules; patch tokens are discarded.
The architecture is composed of two primary lightweight transformer decoder stacks:
- Trajectory Generator: Proposes future action trajectories given the compact, multi-camera scene encoding.
- Scoring Decoder: Assigns interpretable, aspect-specific sub-scores—learned to mimic an external oracle—to each candidate trajectory via cross-modal attention.
2. Register Tokens and Scene Representation
DrivoR’s camera-aware register tokens provide significant reduction in sequence length and computational overhead. For each camera view:
- Patch tokens and register tokens are produced by the final ViT layer.
- The input to the planning stack is .
- Register tokens are camera-specific, allowing the system to learn specialization by viewpoint.
Compared with spatial average pooling or cross-attention-based fusion, register tokens permit focused extraction of planning-relevant information, compressing 4K tokens per camera to (e.g., 16) and expediting all downstream computation.
3. Candidate Trajectory Generation and Scoring
3.1 Trajectory Generation
- The trajectory generator decoder accepts:
- Concatenated register tokens (scene encoding) .
- learnable trajectory queries .
- Ego vehicle state (pose, velocity, acceleration, command), which is embedded and added to each query.
- Four attention blocks combine trajectory self-attention with cross-attention to the scene.
- Output: tokens , each mapped via MLP to trajectory over future steps (position, heading).
The learning objective is a minimum-of-N (MoN) or winner-take-all (WTA) L1 regression to the human reference trajectory:
3.2 Scoring Decoder
- Each candidate trajectory is re-embedded via an MLP to provide the input query .
- The scoring decoder processes these queries with the same scene tokens and self-/cross-attention structure.
- Six independent MLP heads predict sub-scores corresponding to safety, comfort, efficiency, and traffic compliance metrics (e.g., NC, DAC, TTC, EP, Comf).
- The scoring loss is the binary cross-entropy with oracle-provided sub-scores.
Structural disentanglement between trajectory generation and scoring (as opposed to sharing queries/weights) is empirically critical: combining both into a single decoder degrades the primary driving metric (PDMS) by 5–6 points.
4. Training, Behavior Conditioning, and Inference
The total loss combines the trajectory regression and scoring components (with in practice):
At inference, per-trajectory sub-scores are synthesized into a global meta-score using user-configurable weights :
- PDMS-style (product of penalties, sum of behavior metrics) or simple weighted sum.
- The selected trajectory is .
This reward re-weighting at inference time enables flexible behavior conditioning—aggressive, defensive, or otherwise—without retraining.
5. Benchmark Evaluation and Performance
DrivoR matches or exceeds strong neural and classical baselines across a range of closed-loop and open-loop end-to-end driving benchmarks:
| Benchmark | Metric | DrivoR (ViT-S) | Best Baselines |
|---|---|---|---|
| NAVSIM-v1 | PDMS | 93.7 | DriveSuprim 93.5, Human 94.8 |
| NAVSIM-v2 | EPDMS | 48.3 | ZTRS (ViT-99) 48.1 |
| HUGSIM | RC / HD-Score | 49.8 / 35.7 | UniAD 45.9 / 32.7 |
| NAVSIM-v2 Speed | ms/fwd (A100) | 110 | GTRS-Dense 400 |
| NAVSIM-v2 Memory | GB peak | 0.5 | GTRS-Dense 1.6 |
DrivoR achieves state-of-the-art performance on navigation and closed-loop metrics, while operating with million parameters and demonstrating $3$– faster throughput compared to competitive models (Kirby et al., 8 Jan 2026).
6. Design Principles and Comparative Analysis
DrivoR’s empirical successes derive from several core principles:
- Pure-transformer, query-based computation allows flexible conditioning.
- Token reduction via camera-aware register tokens enables high efficiency without significant sacrifice in accuracy.
- Decoupling candidate generation and scoring prevents mode collapse and ensures interpretable trajectory selection.
- Sub-score prediction offers transparency and control, supporting requirement-aware and context-sensitive decision-making.
A plausible implication is that similar register-based compression mechanisms could benefit other high-dimensional multi-camera perception and planning tasks. The lightweight, modular design contrasts with heavily fused, monolithic encoding approaches that are more resource-intensive.
7. Limitations and Directions for Future Research
While DrivoR demonstrates that performance and interpretability can be achieved with a small, modular transformer, several open challenges remain:
- Further scaling of camera coverage, scene complexity, and behavior diversity may require larger ViTs or alternative fusion strategies.
- The approach is predicated on oracle sub-scores for training; generalization under domain shift or degraded sensor input is unstudied.
- Scenario-driven stress-testing and integration into large-scale search-based evaluation infrastructures (e.g., Drivora (Cheng et al., 9 Jan 2026)) remains to be thoroughly explored.
This suggests extensibility toward multi-agent, multi-modal, and language-conditioned settings using similar register-based abstractions, as well as broader adoption in efficiency-critical robotics pipelines.