NuScenes Detection Score (NDS)
- NuScenes Detection Score (NDS) is a comprehensive metric designed to evaluate 3D object detection in autonomous driving by combining mAP with five distinct error metrics.
- It calculates performance using translation, scale, orientation, velocity, and attribute errors, ensuring balanced evaluation of both detection coverage and quality.
- Empirical studies show that NDS correlates strongly with real-world driving safety, making it a vital benchmark for multi-sensor autonomous systems.
The nuScenes Detection Score (NDS) is a comprehensive scalar metric for evaluating 3D object detection performance in autonomous driving contexts. NDS aggregates mean average precision (mAP) with five orthogonal true-positive (TP) error metrics—translation (ATE), scale (ASE), orientation (AOE), velocity (AVE), and attribute (AAE) errors—each computed over recall-sampled true positives. This metric, designed expressly for multi-sensor 3D detection, balances detection coverage and geometric, kinetic, and semantic quality, yielding a unified assessment for ranking and benchmarking detectors. NDS has demonstrated superior correlation with real-world driving outcomes compared to conventional metrics, especially when deployed in closed-loop driving stacks (Schreier et al., 2023, Caesar et al., 2019, Ji et al., 17 Apr 2025).
1. Formal Definition and Components
NDS is explicitly defined as follows. For a set of detector outputs and ground-truth annotations, let:
- : Mean Average Precision, evaluated based on center-distance criteria.
- : Average Translation Error (meters), mean Euclidean distance between predicted and ground-truth box centers.
- : Average Scale Error (unitless), defined as after alignment.
- : Average Orientation Error (radians), mean absolute difference in yaw.
- : Average Velocity Error (m/s), mean velocity vector error.
- : Average Attribute Error (unitless), (e.g., “moving” vs. “stopped”).
The canonical NDS formula (per the nuScenes devkit and its research context) is:
or equivalently,
Each error component is capped at 1 before subtraction, so larger errors are proportionally down-weighted, and perfect performance yields . In some experimental settings, may be omitted if attribute annotations are unavailable (Schreier et al., 2023, Caesar et al., 2019, Ji et al., 17 Apr 2025).
2. Calculation Protocol and Metric Implementation
NDS is computed using a two-phase procedure:
- PR Curve Aggregation for mAP: Detection outputs are matched by center-distance (e.g., thresholds at 0.5, 1, 2, 4 m), generating per-class and per-threshold average precision curves. The final averages these across classes and thresholds.
- Quality Error Metrics:
- For each true-positive (TP) detection (as matched under a center-distance criterion), compute ATE, ASE, AOE, AVE, and AAE. Each is averaged over TPs at recall sampling points , then across classes to yield mean-TP errors.
- Each metric’s contribution to NDS is computed as , ensuring that excessive errors are saturated at zero contribution (Caesar et al., 2019).
The nuScenes development kit automates this process, enabling consistent leaderboard ranking. Use of center-distance rather than IoU for TP matching addresses cases where box overlap is ill-defined or velocity/attribute information is critical for evaluation (Caesar et al., 2019).
3. Motivations and Design Rationale
NDS addresses the limitations of legacy 2D detection metrics by providing:
- Holistic 3D Evaluation: Conventional IoU-based mAP conflates translation, scale, and orientation while omitting dynamic and attribute information. NDS explicitly separates these, reflecting the nuanced demands of multi-sensor, 3D-perception for autonomous vehicles.
- Balanced Trade-Off: By splitting the score equally between detection coverage and five error modes, NDS avoids overfitting to a single aspect (such as spatial precision), instead requiring models to demonstrate all-around competence.
- Practical Scalar: A single number, normalized to the interval , enables straightforward model comparison while still offering breakdowns for diagnostic purposes (Caesar et al., 2019).
This design facilitates targeted model improvements (e.g., focusing on kinematic or semantic fidelity) and reflects operational needs in real-world driving environments.
4. Empirical Correlation with Driving Performance
Closed-loop experiments using CARLA urban driving simulation show that NDS exhibits superior predictive power with respect to actual driving outcomes:
| Metric | (Driving Score) | (Collisions) |
|---|---|---|
| nuScenes Detection Score (NDS) | 0.852 | 0.907 |
| Average Precision (mAP) | 0.805 | 0.903 |
| ADE (planner-centric) | 0.784 | 0.770 |
| AOS | 0.742 | 0.894 |
| FDE (planner-centric) | 0.703 | 0.653 |
NDS, particularly with quality components included, outperforms mAP and planner-centric metrics both in completion rate and safety proxies (collision count). Center-distance-based matching and the inclusion of kinematic and semantic cues were observed to be better correlated with safe driving behavior than IoU-based or planner-centric metrics (Schreier et al., 2023).
5. Component Significance and Ablation
Ablation studies reveal:
- Full NDS (with quality terms) has a substantially higher correlation to driving score () than NDS using mAP@1m alone ().
- Inverse distance–weighted variants—which emphasize close-distance errors—achieve higher correlation to collision rates (ID-mAP $0.955$, ID-NDS $0.920$), corroborating the importance of near-field detection for safety-critical evaluation (Schreier et al., 2023).
Removal or reweighting of individual error components can degrade correlation with practical driving outcomes, underscoring the intentional balance encoded in the canonical NDS weighting.
6. Practical Benchmarking and State-of-the-Art Use
The nuScenes leaderboard ranks submissions by NDS and reports per-component metrics. This encourages well-rounded detector optimization:
- High NDS scores in camera-only and multi-modal settings are used as state-of-the-art (SOTA) benchmarks (e.g., RoPETR achieves NDS using ViT-L, with substantial velocity error reduction) (Ji et al., 17 Apr 2025).
- Improvements in velocity estimation (lower mAVE) have disproportionate impact on mobile agent detection, as confirmed by error breakdowns and NDS component analysis (Ji et al., 17 Apr 2025).
In typical workflows, both mAP and NDS (with individual error metrics) are reported to fully characterize performance (Caesar et al., 2019, Ji et al., 17 Apr 2025).
7. Strengths, Limitations, and Considerations
Strengths:
- Strong offline–online fidelity enables benchmarking without the need for expensive closed-loop evaluation; offline metric computation completes in minutes per model versus multi-day driving tests (Schreier et al., 2023).
- Unified evaluation across diverse modalities and detector types.
- Directly applicable as an offline proxy for real-world safety and operational performance.
Limitations:
- Experiments to date often fix the driving stack (single planner/controller) and driving metric; alternative downstream planners or richer behavioral metrics could alter correlations.
- The attribute error term () is often omitted when attribute annotations are incomplete, potentially underestimating performance variance in attribute-rich or heavily semantic classes (Schreier et al., 2023, Caesar et al., 2019).
A plausible implication is that refinements or extensions of NDS may be warranted as detection and planning stacks diversify and as importance weights shift across application domains.
References:
- "On Offline Evaluation of 3D Object Detection for Autonomous Driving" (Schreier et al., 2023)
- "nuScenes: A multimodal dataset for autonomous driving" (Caesar et al., 2019)
- "RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding" (Ji et al., 17 Apr 2025)