Papers
Topics
Authors
Recent
2000 character limit reached

NuScenes Detection Score (NDS)

Updated 7 December 2025
  • NuScenes Detection Score (NDS) is a comprehensive metric designed to evaluate 3D object detection in autonomous driving by combining mAP with five distinct error metrics.
  • It calculates performance using translation, scale, orientation, velocity, and attribute errors, ensuring balanced evaluation of both detection coverage and quality.
  • Empirical studies show that NDS correlates strongly with real-world driving safety, making it a vital benchmark for multi-sensor autonomous systems.

The nuScenes Detection Score (NDS) is a comprehensive scalar metric for evaluating 3D object detection performance in autonomous driving contexts. NDS aggregates mean average precision (mAP) with five orthogonal true-positive (TP) error metrics—translation (ATE), scale (ASE), orientation (AOE), velocity (AVE), and attribute (AAE) errors—each computed over recall-sampled true positives. This metric, designed expressly for multi-sensor 3D detection, balances detection coverage and geometric, kinetic, and semantic quality, yielding a unified assessment for ranking and benchmarking detectors. NDS has demonstrated superior correlation with real-world driving outcomes compared to conventional metrics, especially when deployed in closed-loop driving stacks (Schreier et al., 2023, Caesar et al., 2019, Ji et al., 17 Apr 2025).

1. Formal Definition and Components

NDS is explicitly defined as follows. For a set of detector outputs and ground-truth annotations, let:

  • mAP\mathrm{mAP}: Mean Average Precision, evaluated based on center-distance criteria.
  • ATE\mathrm{ATE}: Average Translation Error (meters), mean Euclidean distance between predicted and ground-truth box centers.
  • ASE\mathrm{ASE}: Average Scale Error (unitless), defined as 1IoU3D1-\text{IoU}_{3D} after alignment.
  • AOE\mathrm{AOE}: Average Orientation Error (radians), mean absolute difference in yaw.
  • AVE\mathrm{AVE}: Average Velocity Error (m/s), mean 2\ell_2 velocity vector error.
  • AAE\mathrm{AAE}: Average Attribute Error (unitless), 1attribute accuracy1-\text{attribute accuracy} (e.g., “moving” vs. “stopped”).

The canonical NDS formula (per the nuScenes devkit and its research context) is:

NDS=0.5mAP  +  0.1(1min(1,ATE))  +  0.1(1min(1,ASE))  +  0.1(1min(1,AOE))  +  0.1(1min(1,AVE))  +  0.1(1min(1,AAE))\mathrm{NDS} = 0.5\,\mathrm{mAP} \;+\; 0.1\,(1-\min(1,\mathrm{ATE})) \;+\; 0.1\,(1-\min(1,\mathrm{ASE})) \;+\; 0.1\,(1-\min(1,\mathrm{AOE})) \;+\; 0.1\,(1-\min(1,\mathrm{AVE})) \;+\; 0.1\,(1-\min(1,\mathrm{AAE}))

or equivalently,

NDS=110(5mAP+e{ATE,ASE,AOE,AVE,AAE}(1min(1,e)))\mathrm{NDS} = \frac{1}{10}\left(5\,\mathrm{mAP} + \sum_{e\in\{\mathrm{ATE},\,\mathrm{ASE},\,\mathrm{AOE},\,\mathrm{AVE},\,\mathrm{AAE}\}} (1-\min(1,e)) \right)

Each error component is capped at 1 before subtraction, so larger errors are proportionally down-weighted, and perfect performance yields NDS=1\mathrm{NDS} = 1. In some experimental settings, AAE\mathrm{AAE} may be omitted if attribute annotations are unavailable (Schreier et al., 2023, Caesar et al., 2019, Ji et al., 17 Apr 2025).

2. Calculation Protocol and Metric Implementation

NDS is computed using a two-phase procedure:

  1. PR Curve Aggregation for mAP: Detection outputs are matched by center-distance (e.g., thresholds at 0.5, 1, 2, 4 m), generating per-class and per-threshold average precision curves. The final mAP\mathrm{mAP} averages these across classes and thresholds.
  2. Quality Error Metrics:
    • For each true-positive (TP) detection (as matched under a center-distance criterion), compute ATE, ASE, AOE, AVE, and AAE. Each is averaged over TPs at recall sampling points 0.1\geq 0.1, then across classes to yield mean-TP errors.
    • Each metric’s contribution to NDS is computed as 1min(1,e)1-\min(1,e), ensuring that excessive errors are saturated at zero contribution (Caesar et al., 2019).

The nuScenes development kit automates this process, enabling consistent leaderboard ranking. Use of center-distance rather than IoU for TP matching addresses cases where box overlap is ill-defined or velocity/attribute information is critical for evaluation (Caesar et al., 2019).

3. Motivations and Design Rationale

NDS addresses the limitations of legacy 2D detection metrics by providing:

  • Holistic 3D Evaluation: Conventional IoU-based mAP conflates translation, scale, and orientation while omitting dynamic and attribute information. NDS explicitly separates these, reflecting the nuanced demands of multi-sensor, 3D-perception for autonomous vehicles.
  • Balanced Trade-Off: By splitting the score equally between detection coverage and five error modes, NDS avoids overfitting to a single aspect (such as spatial precision), instead requiring models to demonstrate all-around competence.
  • Practical Scalar: A single number, normalized to the interval [0,1][0,1], enables straightforward model comparison while still offering breakdowns for diagnostic purposes (Caesar et al., 2019).

This design facilitates targeted model improvements (e.g., focusing on kinematic or semantic fidelity) and reflects operational needs in real-world driving environments.

4. Empirical Correlation with Driving Performance

Closed-loop experiments using CARLA urban driving simulation show that NDS exhibits superior predictive power with respect to actual driving outcomes:

Metric rr (Driving Score) rr (Collisions)
nuScenes Detection Score (NDS) 0.852 0.907
Average Precision (mAP) 0.805 0.903
ADE (planner-centric) 0.784 0.770
AOS 0.742 0.894
FDE (planner-centric) 0.703 0.653

NDS, particularly with quality components included, outperforms mAP and planner-centric metrics both in completion rate and safety proxies (collision count). Center-distance-based matching and the inclusion of kinematic and semantic cues were observed to be better correlated with safe driving behavior than IoU-based or planner-centric metrics (Schreier et al., 2023).

5. Component Significance and Ablation

Ablation studies reveal:

  • Full NDS (with quality terms) has a substantially higher correlation to driving score (r=0.852r=0.852) than NDS using mAP@1m alone (r=0.820r=0.820).
  • Inverse distance–weighted variants—which emphasize close-distance errors—achieve higher correlation to collision rates (ID-mAP $0.955$, ID-NDS $0.920$), corroborating the importance of near-field detection for safety-critical evaluation (Schreier et al., 2023).

Removal or reweighting of individual error components can degrade correlation with practical driving outcomes, underscoring the intentional balance encoded in the canonical NDS weighting.

6. Practical Benchmarking and State-of-the-Art Use

The nuScenes leaderboard ranks submissions by NDS and reports per-component metrics. This encourages well-rounded detector optimization:

  • High NDS scores in camera-only and multi-modal settings are used as state-of-the-art (SOTA) benchmarks (e.g., RoPETR achieves NDS =70.9%=70.9\% using ViT-L, with substantial velocity error reduction) (Ji et al., 17 Apr 2025).
  • Improvements in velocity estimation (lower mAVE) have disproportionate impact on mobile agent detection, as confirmed by error breakdowns and NDS component analysis (Ji et al., 17 Apr 2025).

In typical workflows, both mAP and NDS (with individual error metrics) are reported to fully characterize performance (Caesar et al., 2019, Ji et al., 17 Apr 2025).

7. Strengths, Limitations, and Considerations

Strengths:

  • Strong offline–online fidelity enables benchmarking without the need for expensive closed-loop evaluation; offline metric computation completes in minutes per model versus multi-day driving tests (Schreier et al., 2023).
  • Unified evaluation across diverse modalities and detector types.
  • Directly applicable as an offline proxy for real-world safety and operational performance.

Limitations:

  • Experiments to date often fix the driving stack (single planner/controller) and driving metric; alternative downstream planners or richer behavioral metrics could alter correlations.
  • The attribute error term (AAE\mathrm{AAE}) is often omitted when attribute annotations are incomplete, potentially underestimating performance variance in attribute-rich or heavily semantic classes (Schreier et al., 2023, Caesar et al., 2019).

A plausible implication is that refinements or extensions of NDS may be warranted as detection and planning stacks diversify and as importance weights shift across application domains.


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to NuScenes Detection Score (NDS).