Spectral-Contrastive Audio Residuals (SONAR)
- Spectral-Contrastive Audio Residuals (SONAR) is a deep learning framework that enhances audio deepfake detection by separating low-frequency content from subtle high-frequency residuals.
- It employs a dual-path architecture with a content feature extractor and a noise feature extractor, integrating features via frequency cross-attention for robust latent space discrimination.
- The model uses a frequency-aware contrastive Jensen–Shannon loss to align LF and HF embeddings, achieving state-of-the-art performance with accelerated convergence.
Spectral-Contrastive Audio Residuals (SONAR) is a frequency-guided deep learning framework designed for generalizable detection of audio deepfakes. SONAR addresses the spectral bias in neural network training, which leads traditional detectors to primarily exploit low-frequency (LF) cues while neglecting subtle but forensic high-frequency (HF) artifacts often left by audio deepfake generators. By explicitly disentangling LF content and HF residuals via learnable filters, integrating the representations through frequency cross-attention, and aligning them using a frequency-aware Jensen–Shannon contrastive loss, SONAR achieves improved generalization, sharper decision boundaries in latent space, and accelerated convergence over prior baselines (HIdekel et al., 26 Nov 2025).
1. Motivation: Spectral Bias and High-Frequency Artifacts
Spectral bias, also referenced as the “frequency principle” or “F-principle,” denotes the tendency of neural networks to fit LF components of signals before HF details during training (Rahaman et al. 2019, Fridovich-Keil et al. 2022). In audio deepfake detection, this bias results in models under-exploiting HF information, despite HF artifacts being critical cues for detecting synthetic speech. Analysis revealed strong LF–HF co-modulation in genuine speech (Pearson between 0–4 kHz and 7–8 kHz bands), a property that collapses to zero or negative correlation in deepfakes. Systematic shifts in the HF/LF energy contrast further distinguish real from synthetic audio. SONAR is proposed to bridge this frequency exploitation gap by constructing learning signals that explicitly target faint HF audio residuals alongside LF content (HIdekel et al., 26 Nov 2025).
2. Architecture: Dual-Path Representation and SRM-Based Extraction
SONAR processes each input utterance through two complementary branches, each with a distinct frequency emphasis but architecturally identical backbone:
- Content Feature Extractor (CFE, LF Path): The input is processed by a pretrained Wav2Vec 2.0 XLSR encoder—a 24-layer transformer with hidden size 1024—to extract LF-dominated features , where is the number of frames.
- Noise Feature Extractor (NFE, HF Path): The signal first passes through a Rich Feature Extractor (RFE), which consists of a bank of learnable 1D filters of SRM (Steganalysis Rich Model)-inspired design. Each filter is constrained such that its central tap and , enforcing strict high-pass behavior. After each optimizer step, filters are projected to restore these constraints via scaling and mean normalization. Outputs from the filter bank are concatenated and reduced through a convolution, resulting in , which is then encoded by an independent XLSR encoder (weights not shared) to produce HF-representative features .
This dual-path architecture explicitly separates LF content and HF residuals for joint downstream modeling (HIdekel et al., 26 Nov 2025).
3. Frequency Cross-Attention and Latent Fusion
The outputs of both branches, and , are integrated by a multi-head frequency cross-attention module with heads:
where and . This mechanism allows LF content frames to attend to HF noise frames (and vice versa in symmetric variants), capturing temporally long- and short-range LF–HF dependencies. This late-stage feature fusion achieves a disentangled yet complementary joint representation, facilitating more robust fake/real discrimination and sharpening the separation in the model’s latent manifold (HIdekel et al., 26 Nov 2025).
4. Frequency-Aware Contrastive Jensen–Shannon Loss
SONAR’s learning objective includes a frequency-aware contrastive term designed to align or differentiate LF and HF embeddings depending on class (real or fake):
- Per-frame Softmax Conversion:
- Jensen–Shannon Divergence:
- Alignment Loss:
For real audio (), the objective minimizes divergence, pulling LF and HF embeddings together. For synthetic audio (), maximizing divergence pushes embeddings apart.
The full loss is:
Weighted cross-entropy (WCE) addresses dataset imbalance. This contrastive alignment exploits the collapsed LF–HF relationship in deepfakes, substantially clarifying the real/fake decision boundary in latent space (HIdekel et al., 26 Nov 2025).
5. Training Procedures and Optimization
SONAR is trained on the ASVspoof 2019 LA dataset (highly imbalanced, 1:9 real:fake), with validation on its LA dev set and evaluation on ASVspoof 2021 (LA & DF) and in-the-wild benchmarks. AdamW optimizer is employed, with learning rate decayed from to using cosine scheduling. Batch size is on four NVIDIA L40 GPUs. Early stopping with a patience of three epochs is enforced, and all experiments are run with three random seeds. After each update, SRM filters undergo hard projection to maintain high-pass constraints. SONAR-Full stabilizes in approximately 12 epochs, and SONAR-Finetune in only 4–6 epochs, representing roughly fourfold faster convergence compared to strong baselines requiring up to 100 epochs (HIdekel et al., 26 Nov 2025).
6. Empirical Performance and Ablation Analysis
SONAR exhibits state-of-the-art Equal Error Rate (EER) reductions on standard and in-the-wild testbeds, summarized as follows:
| Benchmark | State-of-the-Art Baseline EER (%) | SONAR-Full EER (%) | SONAR-Finetune EER (%) |
|---|---|---|---|
| ASVspoof 2021 DF | 3.69 (XLSR+AASIST) | 1.57 | 1.45 |
| ASVspoof 2021 LA | 1.90 (XLSR+AASIST) | 1.55 | 1.20 |
| In-The-Wild (ITW) | 6.71 (XLSR-Mamba) | 6.00 | 5.43 |
Ablations demonstrate that removing the RFE (HF noise branch) or JS loss term results in substantial EER degradation (e.g., JS removal on ITW: 8.50% vs. 6.00%). Non-learnable (fixed) filters are outperformed by learnable versions. Optimal performance is achieved with filters and . SONAR is robust to resampling and common codecs (e.g., MP3, Opus), showing only minor softmax output shifts. Latent analyses reveal that cosine similarity between real LF–HF embeddings clusters near , while for fakes, it centers around ; t-SNE projections indicate clearer class separation than single-stream XLSR (HIdekel et al., 26 Nov 2025).
7. Generalizability and Future Extensions
Operating purely at the latent representation level, SONAR remains architecture-agnostic, permitting integration with any backbone network (e.g., alternative self-supervised learning encoders, CNNs) without structural modification. The modality-agnostic design principle implies that the SONAR methodology—frequency-guided contrastive alignment—can extend beyond audio. For data types such as images or video, appropriate high-pass filters or frequency decompositions (e.g., DCT bands) can be substituted for the SRM filters to yield similar disentanglement and discrimination of subtle generative artifacts.
A plausible implication is that this approach may generalize to any domain where discriminative high-frequency cues, typically masked by network bias, are critical for distinguishing real and synthetic data. Future work is likely to explore such modality transfers and more diverse backbone integrations, leveraging SONAR’s architecture-agnostic and frequency-aware contrastive paradigm (HIdekel et al., 26 Nov 2025).