Contrastive Feature Attribution on Spectrograms
- Contrastive feature attribution on spectrograms is an explainability method that quantifies key time-frequency regions influencing speech-to-text outputs by contrasting target and foil hypotheses.
- The approach integrates perturbation-based segmentation, aggregated word-boundary scoring, and relative contrastive measures to generate interpretable, model-agnostic saliency maps.
- Case studies demonstrate its efficacy in gender translation tasks, revealing input regions such as pitch bands that critically affect speech model decisions.
Contrastive feature attribution on spectrograms is an explainability technique designed to localize and quantify the time–frequency input regions that drive a speech-to-text (S2T) model’s preference between two candidate outputs—a target word and its foil. Unlike standard feature attribution, which asks which spectrogram regions make a prediction likely, contrastive attribution addresses the subtler question of why the model prefers the target hypothesis instead of a specific alternative. This approach has been established for S2T generative models where inputs are two-dimensional log-Mel spectrograms and outputs are token sequences. The technique combines large-scale input perturbation, principled aggregation over subword token probabilities, and contrastive saliency scoring to generate interpretable heatmaps tied to specific prediction alternatives (Conti et al., 30 Sep 2025).
1. Conceptual Foundation of Contrastive Explanations
Contrastive explanations seek to answer “Why did the model choose output P instead of alternative Q?”—here, P is the target word (e.g., curiosa) and Q is the foil (e.g., curioso). In the S2T domain, the input is a spectrogram, and model output is a text token sequence. Contrastive feature attribution produces a saliency map assigning each time–frequency "bin" a score that measures how much its evidence sways the model toward P and away from Q. Perturbing a region by occlusion (zeroing) and observing how probabilities of P and Q change quantifies that region's contrastive influence. This stands in opposition to traditional saliency, which does not consider a specific alternative and thus may conflate evidence overlapping both hypotheses.
2. Mathematical Formulation and Scoring Functions
Contrastive feature attribution expands on perturbation-based approaches such as SPES (Fucci et al., 2024). Each spectrogram region's saliency is computed via model probability changes under masking. Word-level probabilities are derived from subword probabilities using the Word-Boundary method [Pimentel & Meister, 2024]:
Given original (, ) and masked (, ) probabilities for target () and foil (), three scorers are defined:
| Name | Formula | Properties on |
|---|---|---|
| Base () | Non-contrastive | |
| Diff () | Collapses to ; not contrastive | |
| Relative () | Remains contrastive under imbalance |
The ratio-based scorer ensures that both candidate probabilities influence the saliency, maintaining contrastiveness even when the model highly favors the target.
3. Procedural Workflow
The full pipeline involves several steps:
- Segmentation: Divide spectrogram into superpixels using SLIC at multiple granularities ().
- Perturbation: For iterations, generate binary masks with and produce masked input .
- Model Inference: Pass through an S2T model to output , .
- Scoring: Compute according to Eq. (3).
- Aggregation: For each region , calculate saliency as —the averaging is weighted by whether region was perturbed.
- Resolution Upsampling: Project region-wise scores back to full spectrogram resolution for high-fidelity saliency localization.
4. From Raw Spectrograms to Saliency Maps
Inputs are 80-dimensional log-Mel filterbanks, windowed every 10 ms, then reduced in length by two stride-2 1D convolutions, and finally processed by a Transformer encoder–decoder. Occlusion (setting a region to zero) is empirically found to approximate acoustic signal removal. This approach is purely perturbation-based, not requiring model gradients, and thus remains model-agnostic. SLIC segmentation with and a frame cap at 750 controls computational burden and map granularity. Inference uses beam size 5 and no-repeat-ngram-size 5.
5. Case Study: Gender Assignment in Speech Translation
Contrastive explanations are evaluated on gender prediction tasks using the MuST-SHE dataset (Bentivogli et al., 2020), comprising English TED talks with ambiguous speaker-referring terms translated into languages with grammatical gender. Only examples where the model predicts either the correct or incorrect form are retained. Targets and foils are mapped to annotated gender forms (e.g., = curiosaF, = curiosoM).
Two faithfulness metrics quantify attribution:
- Coverage: Percentage of examples where the model's top prediction remains either or after top- salient regions are occluded.
- Flip Rate: Among covered cases, percentage where occlusion flips output from to .
Experimental findings on enfr with a multilingual Transformer indicate:
- With , coverage plunges below 20% after occluding 2% of most salient features; saliency maps correlate with non-contrastive maps.
- With , coverage exceeds 30% up to 20% occlusion; saliency correlates only with non-contrastive maps.
- For feminine predictions, occluding top 5% salient regions flips over 70% cases to masculine—pinpointing pitch/formant bands as cues.
- For masculine predictions, flip rates plateau around 30%, consistent with masculine-default bias in current S2T models.
- Visualizations reveal maps sharply localize pitch/formant regions relevant to gender assignment.
6. Implementation, Evaluation, and Limitations
Hyperparameters are: SLIC segment counts , perturbations, mask probability 0.5, SLIC , frame cap 750. The model is Fairseq S2T multilingual Transformer (Wang et al., 2020a), 72M parameters, with 80-dim Mel-filterbanks and beam size 5. Word probabilities are aggregated using the Word-Boundary method (Eq. 4), which outperforms Length-Norm for French; Chain-Rule yields statistically similar performance otherwise. Source code is available under Apache 2.0 at github.com/hlt-mt/FBK-fairseq.
Limitations include restriction to binary gender assignment; extension to contrasts involving homophones, coreference, or politeness is not yet explored. Faithfulness metrics saturate beyond 20% input occlusion, where outputs degenerate. Ethical issues arise given the reliance on acoustic gender signals, which may bias against non-binary or vocal-impaired speakers and erase non-binary identities under binary labels.
7. Significance and Outlook
Contrastive feature attribution on spectrograms for S2T models achieves interpretable, model-agnostic, and contrast-specific saliency maps—advancing the capacity to diagnose and audit model decision processes at the acoustic level. The combination of perturbation-based segmentation (via SPES), relative contrastive scoring, and rigorous word probability aggregation realizes explanations not just of what input features support a given prediction, but of what tips the model toward one alternative over another. Further research into broader linguistic contrasts and mitigation of bias is an open direction (Conti et al., 30 Sep 2025).