Expert Saliency Scoring

Updated 11 January 2026

Expert saliency scoring is a framework that quantifies the alignment between algorithm-generated saliency maps and human expert judgments.
It employs diverse methodologies—including dot-product alignment, Bayesian inference, and pairwise expert comparisons—to enhance interpretability and reliability.
Applications span visual attention modeling, explainable AI in speech emotion recognition, credit risk scoring, and security prioritization to bolster model trust.

Expert saliency scoring refers to the family of metrics, methodologies, and algorithms designed to quantify the degree of agreement between saliency predictions—typically generated by computational models—and ground-truth signals or judgments that are privileged by domain experts. Unlike traditional performance metrics that evaluate saliency using purely mathematical or empirical criteria, expert saliency scoring explicitly integrates either subjective expert assessments, task-specific domain knowledge, or referential standards constructed from human data. This approach has become central in areas such as visual attention modeling, explainable machine learning (especially in tabular domains), multi-model saliency integration, interpretable speech emotion recognition, and security risk prioritization.

1. Foundations and Taxonomy of Expert Saliency Metrics

Expert saliency scoring encompasses several diverse instantiations, unified by the goal of aligning model-derived saliency maps or explanations with human expertise or perceptual ground truths. The foundation can be categorized along two principal axes:

Reference standard: What constitutes “expert” ground truth—crowdsourced perceptual judgments, explicit expert feature rankings, task-driven cues, or consensus-driven fusion.
Methodological basis: Direct metric construction (e.g., dot-product alignment, weighted normalizations), learning-based regression to expert ratings, probabilistic and Bayesian weighting schemes, or graph-theoretic score aggregation.

Key domains and methodologies include:

Perceptual similarity learning via crowdsourced pairwise comparison (e.g., CPJ metric) (Xia et al., 2018).
Feature attribution alignment with expert weights (ESS score in credit risk) (Qadi et al., 2021).
Model-expertise estimation in saliency map integration (arbitrator model, AM) (Xu et al., 2016).
Expert-referenced acoustic cue quantification for explainable AI (SER explanation) (Nasr et al., 12 Nov 2025).
Pairwise judgment aggregation in DAG-based prioritization for security and privacy (Mell, 2021).
Local fixation density weighting for enhanced agreement with expert visual quality ratings (Gide et al., 2017).

2. Subjective and Perceptual Alignment: Learning from Human Ratings

A central paradigm involves directly quantifying or learning saliency map similarity according to expert or crowd perceptual ratings. In "Learning a Saliency Evaluation Metric Using Crowdsourced Perceptual Judgments," a large-scale subjective study is conducted where 16 observers compare pairs of estimated saliency maps (ESMs) against a ground-truth saliency map (GSM) and indicate which ESM better matches GSM. Key perceptual factors extracted from explanations include peak alignment, energy distribution, salient region count and shape, and semantic target alignment.

The resulting data supports the learning of CPJ—an end-to-end two-stream CNN that predicts which of a pair of ESMs is perceptually more similar to GSM. Training is supervised by the relative saliency score, $r_k = 2l_k - 1$ , where $l_k$ is the fraction favoring one ESM over another. CPJ achieves 88.8% agreement with human judgments on unseen images, surpassing classic metrics such as sAUC and rAUC (71.6% and 67.3%, respectively), and generalizes across datasets and models (Xia et al., 2018).

3. Expert-Aligned Explanations for Feature Attribution

In domains such as credit risk scoring, "expert saliency scoring" formalizes the alignment between post-hoc feature attributions (e.g., SHAP values) and explicit expert-assigned feature relevances. Here, each model prediction's SHAP vector is normalized to a probability simplex and compared—via dot-product—to a similarly normalized expert weight vector. The Expert Saliency Score (ESS) for instance $i$ is

$\mathrm{ESS}_i = \sum_{j=1}^d \hat{s}_{i,j} \cdot \hat{e}_j$

where $\hat{s}_{i,j}$ is the normalized $|\phi_{i,j}|$ (the local SHAP attribution for feature $j$ ), and $\hat{e}_j$ is the normalized expert-aggregated relevance for feature $j$ . ESS scores near 1 indicate high model–expert alignment, while scores near 0 denote strong divergence (Qadi et al., 2021).

Across N≈30,000 validation instances, models with higher mean(ESS) not only achieve higher AUC but are also rated as more trustworthy by experts, establishing a quantitative link between actionable explainability, expert agreement, and predictive performance.

4. Model-Expertise Estimation in Saliency Fusion and Arbitration

Saliency integration frameworks must often weight or select from a pool of candidate predictions without access to ground-truth saliency. The arbitrator model (AM) assigns an "expertise score" to each model by estimating its conditional likelihood to predict true saliency versus background in a Bayesian framework. Expertise values (denoted $\alpha_p$ and $\beta_p$ ) are estimated online via:

Statistics-based proxy: Using a reference map generated from model consensus and external knowledge, compute

$\beta_p = \frac{P(\iota_p=1 | F)}{P(\iota_p=1 | \bar{F})}$

where $F$ is the set of superpixels deemed salient in the reference, and $\iota_p$ is the binarized input saliency.

Latent variable (EM): Treat saliency as a latent variable, estimate expertise and challenge level using EM, and assign $\alpha_p = \beta_p$ in practice.

These scores are then used within a Bayesian integration routine to guide fusion toward reference-consistent and expertise-weighted outputs, thus improving the robustness of the ensemble to systematic errors in the candidate pool (Xu et al., 2016).

5. Fixation-Weighted Metrics and Subjective Validation

Evaluation of saliency maps can be improved by weighting individual fixations according to their local density (“fixation density weighting”). The Locally Weighted Fixation-Density (LWFD) metric, introduced by Gide & Karam, incorporates DBSCAN clustering of eye-tracking fixations, assigning each fixation a cluster-size weight $W(p)$ . The final weighted NSS metric is

$\mathrm{WNSS} = \frac{1}{\sum_{p \in P} W(p)} \sum_{p \in P} W(p) \frac{S(p) - \mu_s}{\sigma_s}$

where $S(p)$ is the predicted saliency, $\mu_s$ and $\sigma_s$ its mean and std, and $P$ the fixation set. This weighting increases correlation with expert mean opinion scores (MOS) collected in the VAQ database, outperforming all previously standard metrics, including CC, NSS, and AUC variants (e.g., WNSS SROCC=0.7858 vs. CC SROCC=0.7461) (Gide et al., 2017).

6. Expert-Referenced and Task-Aligned Saliency for XAI

In explainable AI for domains such as speech emotion recognition, expert saliency scoring extends to measuring the presence and magnitude of theory-driven, expert-referenced cues—such as loudness, F₀ variation, or HNR—inside salient regions identified by XAI methods (e.g., CRP, OS). The framework involves:

Segmenting top-K most salient windows (e.g., 0.15s) using a model's saliency map.
Extracting vector-valued expert cues from each window via an established toolkit (OpenSMILE).
Aggregating these into mean and optionally relevance-weighted expert scores $E_i$ for each cue $i$ over the K windows.

By comparing these values to expected expert patterns (such as higher loudness and pitch for anger), one obtains a saliency score that is both interpretable and faithful to the domain's psychoacoustic theory (Nasr et al., 12 Nov 2025).

7. Pairwise Expert Judgment Aggregation and Graph Scoring

In domains lacking direct ground truth but requiring prioritization (e.g., security risk scoring), expert saliency scoring is achieved by systematically eliciting pairwise comparisons between items, encoding relationships in a directed acyclic graph (DAG) whose edges carry degree weights (e.g., much less <2, equal =0, much greater >2). By unifying multiple expert judgment graphs via voting and conflict resolution, and then propagating constraints topologically, this methodology produces robust prioritization or scoring systems faithful to aggregated expert opinion (Mell, 2021).

Summary Table of Representative Approaches

Metric/System	Domain/Reference	Scoring Mechanism
CPJ	Visual saliency (Xia et al., 2018)	CNN trained on perceptual scores
ESS	Credit risk (Qadi et al., 2021)	Dot-product of SHAP/expert weights
AM $(\alpha_p,\beta_p)$	Saliency integration (Xu et al., 2016)	Bayesian/proxy/EM-based expertise estimation
LWFD (WNSS/sWNSS)	Eye-tracking (Gide et al., 2017)	Fixation-density weighted NSS
SER cue quantification	Explainable AI (Nasr et al., 12 Nov 2025)	Aggregated expert feature cues within saliency windows
DAG expert voting	Security scoring (Mell, 2021)	Aggregated degree-weighted DAGs

Expert saliency scoring thus provides a rigorously quantified bridge between computational predictions and human or domain-expert priorities, enabling both the evaluation and refinement of interpretable AI systems across diverse scientific and engineering domains.