Multi-layer fusion across intermediate transformer layers to improve robustness

Investigate whether multi-layer fusion across transformer layers 15–21 of WavLM-Large (and analogous layers in related speech foundation models) improves robustness and performance beyond using a single optimal layer in training-free partial audio deepfake detection based on embedding trajectory dynamics.

Background

Ablation studies in the paper show that intermediate layers (e.g., layer 18 of WavLM-Large) are more informative for detecting splice boundaries than final layers. However, only single-layer representations are used in the reported system.

The authors explicitly identify multi-layer fusion across a range of intermediate layers (15–21) as an open direction to potentially improve robustness over single-layer selection.

References

Several directions remain open: frame-level anomaly maps could enable segment-level localization, directly addressing the short-spoof-segment weakness on HAD and ADD 2023; multi-layer fusion across layers 15--21 may improve robustness beyond the single optimal layer; and the same paradigm could extend beyond audio to deepfake face detection via vision transformers, machine-generated text detection via LLMs, or cross-modal consistency verification in multimodal foundation models.

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models  (2604.01083 - Khan et al., 1 Apr 2026) in Supplementary Material, Section: Extended Discussion