Explainable Audio Deepfake Detection

Updated 5 March 2026

Explainable Audio Deepfake Detection (ADD) is a field developing models that not only classify audio as real or fake but also provide clear, human-interpretable reasoning such as temporal localization and traceability.
Architectures like Audio Language Models with Chain-of-Thought and diffusion-based artifact segmentation deliver transparent, frame-level evidence, enhancing both detection accuracy and forensic auditability.
Benchmarks such as FakeSound2 reveal that although in-domain detection shows high accuracy, challenges remain in achieving robust traceability and out-of-domain generalization under adversarial conditions.

Explainable Audio Deepfake Detection (ADD) encompasses the development and analysis of algorithms that not only discriminate between genuine and fake audio signals, but also provide transparent, human-interpretable evidence and reasoning underlying their decisions. This field intersects explainable artificial intelligence (XAI), adversarial machine learning, forensic analysis, and audio representation learning. Fundamental concerns are not only detection accuracy and generalization, but the localization of manipulations, attribution of forgery mechanisms, and robust, auditable reasoning even under adversarial conditions.

1. Definitions, Scope, and Core Dimensions

Explainable Audio Deepfake Detection is defined as the suite of models and tools that output, in addition to a binary “real” or “fake” label, explicit evidence supporting their predictions:

Temporal Localization: Pinpointing the regions of an audio clip that contain manipulated content.
Traceability: Inferring both the manipulation operation (“how”) and the underlying generative or editing source (“by whom”).
Human-Readable Reasoning: Producing explicit, auditable traces (linguistic or visual) of the decision process, allowing human analysts to verify the basis of the model’s outputs.

Conventional benchmarks reporting only AUC or EER do not address these objectives. The FakeSound2 benchmark formalizes all three via segment-level precision/recall (localization), manipulation/source classification (traceability), and separate in-domain vs. out-of-domain generalization tests (Xie et al., 21 Sep 2025).

Explainability mechanisms operate both as intrinsic design features (reasoning traces, interpretable features) and as post-hoc analyses (saliency, relevance maps), constrained throughout by the need for faithfulness and utility in real-world forensic workflows (Grinberg et al., 23 Jan 2025).

2. Architectures Enabling Explainability

2.1 Explicit Reasoning via Audio LLMs with Chain-of-Thought

Audio LLMs (ALMs) equipped with Chain-of-Thought (CoT) mechanisms represent a structural advance:

Input/Output Formalism:

$\mathcal{F}: X \mapsto Y = \{r_1, r_2, \ldots, r_N, c\}$

where $X \in \mathbb{R}^L$ is the waveform, $r_k$ are text-based reasoning aspects (e.g., prosody, liveliness), and $c \in \{\text{fake},\;\text{real}\}$ the decision.

Decoding Process: The model autoregressively emits reasoning steps $r_1,\dots,r_N$ based on audio perception, followed by the label $c$ .
Glass-box Transparency: Each reasoning step can be audited to determine both grounding in perceptual evidence and logical entailment to $c$ , surpassing black-box baselines (Nguyen et al., 7 Jan 2026).

2.2 Modular Feature-Based Explainability

SLIM separates paralinguistic style and linguistic content representations, learning their dependency exclusively on real speech and identifying fakes via violations of this dependency (Zhu et al., 2024). EAI-ADD integrates acoustic and emotional feature streams, exposing cross-level inconsistencies particularly evident in synthetic audio (Zhang et al., 20 Jan 2026).

2.3 Faithful Post-hoc Explanations

GATR, AttnLRP, Grad-CAM, SHAP, and domain-specific adaptations are used for transformer-based models, producing relevance heatmaps in the time (or time-frequency) domain. The faithfulness of such explanations is rigorously benchmarked on large datasets and partial-spoof tests, with GATR establishing best alignment to classifier focus (Grinberg et al., 23 Jan 2025).

2.4 Artifact Region Segmentation via Diffusion

Supervised diffusion models (SegDiff, ADDSegDiff) are trained to reconstruct “artifact masks” derived from bona fide vs. vocoded audio pairs, providing high-resolution, time-frequency localization of deepfake artifacts and aligning explanations to ground-truth perturbations (Grinberg et al., 3 Jun 2025).

3. Benchmarking and Quantitative Evaluation

Benchmarks such as FakeSound2 systematically evaluate:

Localization: Segment-level F1-score and Intersection-over-Union (IoU) for manipulated intervals.
Traceability: Accuracy in manipulation type and source generator attribution.
Generalization: Quantified via in-domain vs. out-of-domain accuracy gaps.

Results indicate that, although models achieve high identification accuracy ( $\sim$ 95%) and very high localization F1 ( $\sim$ 97\%) in-domain, traceability accuracy remains limited ( $\sim$ 72\% for manipulation, $85\%$ for source attribution), and generalization to new generators leads to substantial performance drops (manipulation accuracy $41.8\%$ , F1 $79.3\%$ out-of-domain) (Xie et al., 21 Sep 2025). Diffusion-based artifact segmentation far surpasses generic XAI such as SHAP and LRP in both Dice and F1-score (50–57% vs. 11–13% Dice) (Grinberg et al., 3 Jun 2025).

Faithfulness metrics (Grinberg et al., 23 Jan 2025) include:

Average Drop (AD), Average Gain (AG), Input Fidelity, and Average Increase (AI).
Partial-Spoof RCQ and Relevance Mass/Rank Accuracy (RMA/RRA) for alignment with ground truth.

4. Explainability Mechanisms and Reasoning Diagnostics

Advanced explainability frameworks assess the internal reasoning robustness under adversarial settings:

Acoustic Perception Robustness ( $R_{acoustic}$ ): Quantifies the model’s true perception of forensic features on curated binary question banks (Nguyen et al., 7 Jan 2026).
Cognitive Coherence: Measures logical entailment between reasoning steps and the final verdict, and its shift under attack.
Cognitive Dissonance: Captures internal conflicts between reasoning and verdict, which may serve as “silent alarms” when models make incorrect but unjustified predictions.

Under adversarial influences:

Acoustic attacks often erode coherence but amplify dissonance, alerting to potential manipulation even if the output label is incorrect.
Linguistic attacks may increase coherence (by inducing overconfident rationalization) but suppress dissonance, masking detection errors (Nguyen et al., 7 Jan 2026).

EAI-ADD leverages frame-level inconsistency scores ( $\gamma^{\mathrm{mis}_t}$ ) and heterogeneous attention graphs to isolate precisely which segments show unnatural emotion-acoustic decoupling, supplementing global binary decisions with frame-wise, human-readable reasoning (Zhang et al., 20 Jan 2026).

SLIM produces per-frame mismatch scores between style and linguistics ( $S_{\text{mismatch}}$ ), giving localized, interpretable justifications for fake classifications based on temporal spikes (Zhu et al., 2024).

5. Model Vulnerabilities, Robustness, and Design Guidelines

ALMs with high $R_{acoustic}$ (e.g., Qwen2-Audio, granite-3.3-8b) exhibit a “reasoning shield”—explicit reasoning reduces adversarial susceptibility and maintains high performance under perturbation ( $\Delta\mathrm{ASR}<0$ at $p=0.0027$ ) (Nguyen et al., 7 Jan 2026). Models with poor perceptual grounding incur a “reasoning tax”—increased vulnerability and rationalization of errors, particularly under semantic (linguistic) attacks.

Key recommendations:

Pre-finetune encoders to maximize $R_{acoustic}$ .
Augment training with adversarial examples and regularize for dissonance preservation.
Deploy internal dissonance ( $\Psi_{\mathrm{Diss}}$ ) as a risk score to trigger human audits.
Balance Chain-of-Thought trace length to avoid “verbal overshadowing.”

6. Empirical Insights and Large-Scale Analysis

Large-scale XAI studies reveal that model attention may concentrate on unexpected audio regions:

For bona fide audio, background non-speech or low-energy regions may dominate relevance in some datasets.
For spoofed audio, unstressed vowels and low-energy onsets (in-domain) or steady speech (out-of-domain) are often the most relevant (Grinberg et al., 23 Jan 2025).
Post-hoc explainer selection is critical; GATR provides the most faithful alignment to classifier logic on large datasets, outperforming Grad-CAM, SHAP, and DeepSHAP across deletion/insertion tests, input fidelity, and spoof-segment concentration.

Diffusion-based artifact segmenters yield heatmaps that tightly match vocoder-specific artifacts, achieving substantial improvements over classical XAI in faithfulness and localization accuracy (Grinberg et al., 3 Jun 2025).

7. Open Challenges and Future Directions

Significant open challenges persist:

Artifact-Agnostic Generalization: Existing models overfit to manipulation artifacts present in training data, limiting out-of-domain detection and traceability (Xie et al., 21 Sep 2025).
Explainability under Multiple Manipulations: Addressing audios containing overlapping or cross-modal (e.g., audio-video) forgeries.
Interpretability-Versus-Robustness Tradeoffs: Increasing CoT detail can mask fine-grained spectral cues (“verbal overshadowing”) (Nguyen et al., 7 Jan 2026).
Extending Style and Emotion Libraries: Expanding SLIM or EAI-ADD to handle a broader range of real speech variation (pathologies, accents) is critical for coverage (Zhu et al., 2024).

Directions include prototype-based and symbolic reasoning modules, adversarial domain adaptation, better joint modeling of speaker-linguistic dependencies, and enhanced explainable architectures linking feature localization to human-interpretable categories (phoneme, prosody, emotional transitions).

Key references: (Nguyen et al., 7 Jan 2026, Xie et al., 21 Sep 2025, Zhu et al., 2024, Grinberg et al., 3 Jun 2025, Zhang et al., 20 Jan 2026, Grinberg et al., 23 Jan 2025).