Frame-level anomaly maps for segment-level localization in partial audio deepfake detection

Develop frame-level anomaly mapping methods that operate on frozen speech foundation model embeddings to achieve segment-level localization of manipulated regions in partially spoofed audio, directly addressing short-spoof-segment weaknesses observed on the HAD and ADD 2023 benchmarks.

Background

TRACE is a training-free framework that detects partial audio deepfakes by analyzing first-order embedding dynamics from frozen speech foundation models. While it performs strongly at the utterance level, the paper notes reduced performance on datasets with short, densely packed spoof segments (e.g., HAD and ADD 2023), suggesting a need for finer temporal localization.

The authors explicitly state that developing frame-level anomaly maps remains an open direction to enable segment-level localization and mitigate the dilution of localized anomalies by global statistics.

References

Several directions remain open: frame-level anomaly maps could enable segment-level localization, directly addressing the short-spoof-segment weakness on HAD and ADD 2023; multi-layer fusion across layers 15--21 may improve robustness beyond the single optimal layer; and the same paradigm could extend beyond audio to deepfake face detection via vision transformers, machine-generated text detection via LLMs, or cross-modal consistency verification in multimodal foundation models.

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models  (2604.01083 - Khan et al., 1 Apr 2026) in Supplementary Material, Section: Extended Discussion