Safety Assessment Score for Object Perception

Updated 23 December 2025

Safety Assessment Score is a quantitative metric that integrates classical detection measures with context-aware, criticality-weighted safety indicators.
It employs methodologies such as entropy-based uncertainty, collision-probability aggregates, and planner-aligned scores to evaluate perception risk.
The approach is validated through real-world benchmarks and is essential for system certification, runtime monitoring, and safety assurance.

A Safety Assessment Score (SAS) for object perception quantitatively expresses the ability of a perception system to detect, localize, and correctly characterize objects that are safety-relevant, in a manner that is directly tied to the probability and severity of adverse outcomes (e.g., collisions, mission aborts, surgical mistakes). SAS frameworks unify classical perception metrics, such as precision and recall, with scenario-aware and criticality-weighted measures that prioritize failures according to their safety impact. Over the past years, a diverse set of scoring methodologies has been developed—spanning uncertainty-driven detectors, risk-based reachability, criticality aggregation, cyber-physical rankings, safety-zone analysis, and cross-domain reliability-weighting—each with a well-defined mapping from perception outputs to a scalar or structured evaluation interpretable for safety assurance and system certification.

1. Metric Foundations: Beyond Classical Detection Measures

Traditional object detection metrics (Precision, Recall, mAP) treat all objects uniformly, obscuring the safety-critical context of each detection or miss. Research consistently demonstrates that missing a pedestrian crossing at close range and missing a distant static vehicle yield radically different risk profiles, yet count equally under classical metrics.

Safety assessment scores depart from this baseline by incorporating:

Criticality or relevance ranking based on proximity, relative velocity, time-to-collision (TTC), and predicted injury/damage severity (Bansal et al., 2021, Gamerdinger et al., 17 Dec 2025, Ceccarelli et al., 2022).
Scenario-driven weights capturing the likelihood and impact of a miss or false alarm for downstream decision-making (Volk et al., 16 Dec 2025, Gamerdinger et al., 17 Dec 2025, Peng et al., 2022).
Planner- or mission-linked cost functions, which penalize perception errors that are most likely to cause a hazardous or ineffective system behavior (Topan et al., 2022, Khandal et al., 2021, Bernhard et al., 2021).

This foundational shift is exemplified by methods such as risk-ranked recall (Bansal et al., 2021), safety-weighted recall and reliability-weighted precision (Ceccarelli et al., 2022), and criticality metric aggregation (Gamerdinger et al., 17 Dec 2025), which modulate the influence of each detection outcome according to quantified risk.

2. Formulations and Scoring Methodologies

2.1 Entropy- and Uncertainty-Based Scores

The PeSOTIF perception-SOTIF entropy $H$ explicitly quantifies the epistemic uncertainty of a probabilistic object detector, incorporating stochastic ensemble variability and penalizing disagreement across ensemble members (Peng et al., 2022):

$H^* = - \sum_{c=1}^C [\hat{p}_c \log \hat{p}_c + (1-\hat{p}_c)\log(1-\hat{p}_c)]$ (total multi-class entropy).
$H = H^* \cdot [1 + f_p \cdot (T-d)]$ where $d$ is the number of agreeing ensemble members, $f_p$ a penalty for missed/ghost detections.

Key-object annotations $(f_h=1)$ act as ground truth for must-warn situations. An uncertainty threshold $\omega$ designates detections as "uncertain" (i.e., potentially unsafe) if $H\geq\omega$ , enabling computation of safety-linked metrics such as Alert Coverage Rate (ACR) and Uncertainty Quality Score (UQS).

2.2 Credibility and Collision-Probability Aggregates

Extending beyond predictive uncertainty, a practical SAS combines operationally meaningful quantities:

Average Label Confidence (ALC)
Mean Corruption mAP (MCmAP)
Average Misclassification Error (AME)
Flip Probability ( $FP_\rho$ ) under perturbation

as in (Khandal et al., 2021):

$\mathrm{SAS} = w_1 \cdot \textrm{ALC} + w_2 \cdot (1-\textrm{AME}) + w_3 \cdot \textrm{MCmAP} + w_4 \cdot (1-FP_\rho)$

Weights may be tuned to emphasize online robustness or offline resilience under adverse conditions.

2.3 Criticality-Indexed and Planner-Aligned Scores

Several frameworks employ reachability theory, kinematic collision ranking, or Hamilton-Jacobi safety zones to define the set of "truly safety-critical" objects, assigning scores such as:

Criticality score $\kappa(B)$ for each object $B$ (Ceccarelli et al., 2022):

$\kappa(B) = 1 - [1-\kappa_d][1-\kappa_r][1-\kappa_t]$

where $\kappa_d$ penalizes missed close objects, $\kappa_r$ future close approaches, $\kappa_t$ time to encounter.

Risk-ranked recall $R^3_r$ for discrete risk-ranks (imminent/potential/low) (Bansal et al., 2021).
Dynamically-computed safety zones via backward reachable sets (Topan et al., 2022), with Safety- $F_1$ as the harmonized measure of correct detection and minimal "phantom braking."

In these approaches, scoring is direct: missing a critical object (collision course, inside the dynamic safety zone) causes the score to drop sharply, while missing distant or benign objects has negligible effect.

A recurring design objective is to distill heterogeneous detection, classification, and tracking uncertainties into one interpretable scalar or a few context-aware summary indices.

3.1 Weighted Linear/Geometric Aggregation

Normalized metrics (e.g., detection rate, classification accuracy, localization RMSE, risk-weighted recall) are mapped to $[0,1]$ and linearly or geometrically combined, with higher weights assigned to safety-critical errors or zones (Hoss et al., 2021, Volk et al., 16 Dec 2025):

$\mathrm{SAS} = \sum_i w_i s_i \qquad \sum_i w_i = 1$

3.2 Min/Power-Mean and Score-Bucket Approaches

To avoid concealment of hazardous failures by many innocuous successes, composite scores often use minimum or power-mean functions, exponentially penalizing the worst misses (e.g., the EPSM metric) (Gamerdinger et al., 17 Dec 2025):

$S_{\mathrm{obj}} = 1 - \frac{16 W_0 + 4 W_1 + \sum_{i=2}^{n-1} W_i}{16 C_0 + 4 C_1 + \sum_{i=2}^{n-1} C_i}$

where $W_0$ is the highest-critically-missed object.

3.3 Normative/Regulatory Bands

Multiple works recommend banded interpretation schemes, with thresholds demarcating "insufficient," "very bad," "good," or "excellent" perception safety (Volk et al., 16 Dec 2025, Gamerdinger et al., 17 Dec 2025).

4. Empirical Validation, Tuning, and Comparative Analysis

Empirical studies demonstrate that safety assessment scores re-rank detection architectures relative to classical mAP/recall-based rankings:

In nuScenes, models with similar conventional AP diverge substantially under safety-weighted AP, especially when critical objects are rare or occluded (Ceccarelli et al., 2022).
The mAUSC (mean Average Uncompromising Spatial Constraints score) and USC-NDS (USC-domain average with nuScenes Detection Score) have stronger correlation to real-world collision frequency than mAP or even NDS, establishing their superiority for mission-critical system selection (Liao et al., 2022).
Perception-SOTIF entropy achieves high alert coverage on key objects, with moderate false alert rates, supporting its use for perception SOTIF problem detection in long-tail scenarios (Peng et al., 2022).
EPSM, combining criticality and severity through logistic mapping and task-aware aggregation, identifies high-fatality risk scenarios undetected by classic F1 or MODA/MODP metrics (Gamerdinger et al., 17 Dec 2025).

5. Calibration, Uncertainty Propagation, and Guidewords

Credible SAS deployment demands that perception model confidences are well-calibrated; miscalibration (e.g., overconfidence in out-of-distribution objects) biases all downstream assessment. Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and real-time flip-probabilities anchor calibration-checks in online pipelines (Khandal et al., 2021, Hoss et al., 2021).

Systematic safety analysis adapts classical process guidewords (e.g., “No,” “More,” “Part of,” “Other than,” “Early,” “Late,” “Intermittent”) from HAZOP/FMEA to machine-learning-specific perception errors, guiding the identification and scoring of hazards related to object perception (Molloy et al., 2022).

6. Safety-Score Application: Benchmarks, Pipeline Integration, and Acceptance Criteria

In practice, a perception SAS is employed at both the benchmarking (model selection, A/B testing) and system-assurance (runtime monitoring, certification gating) levels. Integration steps include:

Offline analysis: Evaluate on large, scenario-diverse datasets with safety-driven ground truth and scenario exposure frequencies (Hoss et al., 2021, Gamerdinger et al., 17 Dec 2025).
Online usage: Apply lightweight, rapid computations of core metrics (e.g., ALC, SOTIF entropy threshold, per-object criticality) with hard-coded hand-off criteria, failover, or thresholding for real-time decision-making (Khandal et al., 2021, Peng et al., 2022).
Certification: Demonstrate, via aggregated SAS and scenario-stratified reporting, that the residual risk is below the requisite threshold, prioritized by regulator or standard-defined severity bands (Molloy et al., 2022, Volk et al., 16 Dec 2025).

Safety assessment scores thus provides a crucial analytic link between probabilistic perception output and formal safety claims—supporting both system development and post-deployment operational safety management.

7. Limitations, Open Issues, and Ongoing Developments

Despite significant advances, several challenges remain in the definition, validation, and universal adoption of safety assessment scores for object perception:

Calibration and completeness: Uncertainty measures are only as good as the detector’s out-of-distribution calibration (Peng et al., 2022, Khandal et al., 2021).
Scenario coverage: Real-world scenario completeness and the inclusion of rare long-tail hazards—the focal point of PeSOTIF—are essential yet remain an ongoing data challenge (Peng et al., 2022).
Safety case integration: Mapping SAS to system-level claims, and linking perception errors to actual harm via formal assurance arguments, still requires careful scenario linking and (often) planner co-design (Topan et al., 2022, Hoss et al., 2021).
Domain generality: While medical CVS assessment follows similar safety-score strategies, domain adaptation (weights, thresholds, normalization) is required for direct cross-use (Murali et al., 2023).

Continuous evolution of risk-weighted, context-aware safety assessment metrics is therefore central to robust, regulation-compliant deployment of perception-based autonomous systems.