Multi-layer fusion across intermediate transformer layers to improve robustness
Investigate whether multi-layer fusion across transformer layers 15–21 of WavLM-Large (and analogous layers in related speech foundation models) improves robustness and performance beyond using a single optimal layer in training-free partial audio deepfake detection based on embedding trajectory dynamics.
References
Several directions remain open: frame-level anomaly maps could enable segment-level localization, directly addressing the short-spoof-segment weakness on HAD and ADD 2023; multi-layer fusion across layers 15--21 may improve robustness beyond the single optimal layer; and the same paradigm could extend beyond audio to deepfake face detection via vision transformers, machine-generated text detection via LLMs, or cross-modal consistency verification in multimodal foundation models.