Contrastive Feature Attribution on Spectrograms

Updated 3 December 2025

Contrastive feature attribution on spectrograms is an explainability method that quantifies key time-frequency regions influencing speech-to-text outputs by contrasting target and foil hypotheses.
The approach integrates perturbation-based segmentation, aggregated word-boundary scoring, and relative contrastive measures to generate interpretable, model-agnostic saliency maps.
Case studies demonstrate its efficacy in gender translation tasks, revealing input regions such as pitch bands that critically affect speech model decisions.

Contrastive feature attribution on spectrograms is an explainability technique designed to localize and quantify the time–frequency input regions that drive a speech-to-text (S2T) model’s preference between two candidate outputs—a target word and its foil. Unlike standard feature attribution, which asks which spectrogram regions make a prediction likely, contrastive attribution addresses the subtler question of why the model prefers the target hypothesis instead of a specific alternative. This approach has been established for S2T generative models where inputs are two-dimensional log-Mel spectrograms and outputs are token sequences. The technique combines large-scale input perturbation, principled aggregation over subword token probabilities, and contrastive saliency scoring to generate interpretable heatmaps tied to specific prediction alternatives (Conti et al., 30 Sep 2025).

1. Conceptual Foundation of Contrastive Explanations

Contrastive explanations seek to answer “Why did the model choose output P instead of alternative Q?”—here, P is the target word (e.g., curiosa) and Q is the foil (e.g., curioso). In the S2T domain, the input $X$ is a spectrogram, and model output is a text token sequence. Contrastive feature attribution produces a saliency map assigning each time–frequency "bin" a score that measures how much its evidence sways the model toward P and away from Q. Perturbing a region by occlusion (zeroing) and observing how probabilities of P and Q change quantifies that region's contrastive influence. This stands in opposition to traditional saliency, which does not consider a specific alternative and thus may conflate evidence overlapping both hypotheses.

2. Mathematical Formulation and Scoring Functions

Contrastive feature attribution expands on perturbation-based approaches such as SPES (Fucci et al., 2024). Each spectrogram region's saliency is computed via model probability changes under masking. Word-level probabilities are derived from subword probabilities using the Word-Boundary method [Pimentel & Meister, 2024]:

$p(w) = \left[ \prod_{i=0}^{n-1} p(w_i) \right] \times \frac{p(\langle\text{BOW}\rangle \mid w_{0:n-1})}{p(\langle\text{BOW}\rangle \mid w_{0:n-1}) + p(\langle\text{no-BOW}\rangle \mid w_{0:n-1})} \tag{4}$

Given original ( $p_{\text{orig}}(t)$ , $p_{\text{orig}}(f)$ ) and masked ( $p_{\text{pert}}(t)$ , $p_{\text{pert}}(f)$ ) probabilities for target ( $t$ ) and foil ( $f$ ), three scorers are defined:

Name	Formula	Properties on $t\gg f$
Base ( $S_B$ )	$S_B(t) = p_{\text{orig}}(t) - p_{\text{pert}}(t)$	Non-contrastive
Diff ( $S_{CD}$ )	$S_{CD}(t, f) = [p_{\text{orig}}(t) - p_{\text{pert}}(t)] - [p_{\text{orig}}(f) - p_{\text{pert}}(f)]$	Collapses to $S_B$ ; not contrastive
Relative ( $S_{CR}$ )	$S_{CR}(t, f) = \frac{p_{\text{orig}}(t)}{p_{\text{orig}}(t) + p_{\text{orig}}(f)} - \frac{p_{\text{pert}}(t)}{p_{\text{pert}}(t) + p_{\text{pert}}(f)}$	Remains contrastive under imbalance

The ratio-based $S_{CR}$ scorer ensures that both candidate probabilities influence the saliency, maintaining contrastiveness even when the model highly favors the target.

3. Procedural Workflow

The full pipeline involves several steps:

Segmentation: Divide spectrogram $X$ into $M$ superpixels using SLIC at multiple granularities ( $M\in\{2000, 2500, 3000\}$ ).
Perturbation: For $K=20,000$ iterations, generate binary masks $m\in\{0,1\}^M$ with $P(m_i=0)=0.5$ and produce masked input $X^p = X\odot m$ .
Model Inference: Pass $X^p$ through an S2T model to output $p_\text{pert}(t)$ , $p_\text{pert}(f)$ .
Scoring: Compute $S_{CR}^{(k)}(t, f)$ according to Eq. (3).
Aggregation: For each region $i$ , calculate saliency as $\text{mean}_k\left[ S_{CR}^{(k)}(t, f)\cdot(1-m_i) \right]$ —the averaging is weighted by whether region $i$ was perturbed.
Resolution Upsampling: Project region-wise scores back to full spectrogram resolution for high-fidelity saliency localization.

4. From Raw Spectrograms to Saliency Maps

Inputs are 80-dimensional log-Mel filterbanks, windowed every 10 ms, then reduced in length by two stride-2 1D convolutions, and finally processed by a Transformer encoder–decoder. Occlusion (setting a region to zero) is empirically found to approximate acoustic signal removal. This approach is purely perturbation-based, not requiring model gradients, and thus remains model-agnostic. SLIC segmentation with $\sigma=0$ and a frame cap at 750 controls computational burden and map granularity. Inference uses beam size 5 and no-repeat-ngram-size 5.

5. Case Study: Gender Assignment in Speech Translation

Contrastive explanations are evaluated on gender prediction tasks using the MuST-SHE dataset (Bentivogli et al., 2020), comprising English TED talks with ambiguous speaker-referring terms translated into languages with grammatical gender. Only examples where the model predicts either the correct or incorrect form are retained. Targets and foils are mapped to annotated gender forms (e.g., $t$ = curiosaF, $f$ = curiosoM).

Two faithfulness metrics quantify attribution:

Coverage: Percentage of examples where the model's top prediction remains either $t$ or $f$ after top- $k\%$ salient regions are occluded.
Flip Rate: Among covered cases, percentage where occlusion flips output from $t$ to $f$ .

Experimental findings on en $\rightarrow$ fr with a multilingual Transformer indicate:

With $S_{CD}$ , coverage plunges below 20% after occluding 2% of most salient features; saliency maps correlate $\rho \approx 0.93$ with non-contrastive maps.
With $S_{CR}$ , coverage exceeds 30% up to 20% occlusion; saliency correlates only $\rho \approx 0.33$ with non-contrastive maps.
For feminine predictions, occluding top 5% salient regions flips over 70% cases to masculine—pinpointing pitch/formant bands as cues.
For masculine predictions, flip rates plateau around 30%, consistent with masculine-default bias in current S2T models.
Visualizations reveal $S_{CR}$ maps sharply localize pitch/formant regions relevant to gender assignment.

6. Implementation, Evaluation, and Limitations

Hyperparameters are: SLIC segment counts $\in \{2000, 2500, 3000\}$ , $K=20\,000$ perturbations, mask probability 0.5, SLIC $\sigma=0$ , frame cap 750. The model is Fairseq S2T multilingual Transformer (Wang et al., 2020a), 72M parameters, with 80-dim Mel-filterbanks and beam size 5. Word probabilities are aggregated using the Word-Boundary method (Eq. 4), which outperforms Length-Norm for French; Chain-Rule yields statistically similar performance otherwise. Source code is available under Apache 2.0 at github.com/hlt-mt/FBK-fairseq.

Limitations include restriction to binary gender assignment; extension to contrasts involving homophones, coreference, or politeness is not yet explored. Faithfulness metrics saturate beyond 20% input occlusion, where outputs degenerate. Ethical issues arise given the reliance on acoustic gender signals, which may bias against non-binary or vocal-impaired speakers and erase non-binary identities under binary labels.

7. Significance and Outlook

Contrastive feature attribution on spectrograms for S2T models achieves interpretable, model-agnostic, and contrast-specific saliency maps—advancing the capacity to diagnose and audit model decision processes at the acoustic level. The combination of perturbation-based segmentation (via SPES), relative contrastive scoring, and rigorous word probability aggregation realizes explanations not just of what input features support a given prediction, but of what tips the model toward one alternative over another. Further research into broader linguistic contrasts and mitigation of bias is an open direction (Conti et al., 30 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Contrastive Feature Attribution on Spectrograms.