Papers
Topics
Authors
Recent
2000 character limit reached

Spectral-Contrastive Audio Residuals (SONAR)

Updated 3 December 2025
  • Spectral-Contrastive Audio Residuals (SONAR) is a deep learning framework that enhances audio deepfake detection by separating low-frequency content from subtle high-frequency residuals.
  • It employs a dual-path architecture with a content feature extractor and a noise feature extractor, integrating features via frequency cross-attention for robust latent space discrimination.
  • The model uses a frequency-aware contrastive Jensen–Shannon loss to align LF and HF embeddings, achieving state-of-the-art performance with accelerated convergence.

Spectral-Contrastive Audio Residuals (SONAR) is a frequency-guided deep learning framework designed for generalizable detection of audio deepfakes. SONAR addresses the spectral bias in neural network training, which leads traditional detectors to primarily exploit low-frequency (LF) cues while neglecting subtle but forensic high-frequency (HF) artifacts often left by audio deepfake generators. By explicitly disentangling LF content and HF residuals via learnable filters, integrating the representations through frequency cross-attention, and aligning them using a frequency-aware Jensen–Shannon contrastive loss, SONAR achieves improved generalization, sharper decision boundaries in latent space, and accelerated convergence over prior baselines (HIdekel et al., 26 Nov 2025).

1. Motivation: Spectral Bias and High-Frequency Artifacts

Spectral bias, also referenced as the “frequency principle” or “F-principle,” denotes the tendency of neural networks to fit LF components of signals before HF details during training (Rahaman et al. 2019, Fridovich-Keil et al. 2022). In audio deepfake detection, this bias results in models under-exploiting HF information, despite HF artifacts being critical cues for detecting synthetic speech. Analysis revealed strong LF–HF co-modulation in genuine speech (Pearson r0.6r\approx0.6 between 0–4 kHz and 7–8 kHz bands), a property that collapses to zero or negative correlation in deepfakes. Systematic shifts in the HF/LF energy contrast ΔE=(EHFELF)\Delta E=(E_{HF}–E_{LF}) further distinguish real from synthetic audio. SONAR is proposed to bridge this frequency exploitation gap by constructing learning signals that explicitly target faint HF audio residuals alongside LF content (HIdekel et al., 26 Nov 2025).

2. Architecture: Dual-Path Representation and SRM-Based Extraction

SONAR processes each input utterance xx through two complementary branches, each with a distinct frequency emphasis but architecturally identical backbone:

  • Content Feature Extractor (CFE, LF Path): The input xx is processed by a pretrained Wav2Vec 2.0 XLSR encoder—a 24-layer transformer with hidden size 1024—to extract LF-dominated features zcontentRF×Dz_{content}\in\mathbb{R}^{F\times D}, where FF is the number of frames.
  • Noise Feature Extractor (NFE, HF Path): The signal first passes through a Rich Feature Extractor (RFE), which consists of a bank of MM learnable 1D filters of SRM (Steganalysis Rich Model)-inspired design. Each filter wiw_i is constrained such that its central tap wi[c]=1w_i[c]=-1 and kwi[k]=0\sum_k w_i[k]=0, enforcing strict high-pass behavior. After each optimizer step, filters are projected to restore these constraints via scaling and mean normalization. Outputs from the filter bank are concatenated and reduced through a 1×11\times1 convolution, resulting in xnoisex_{noise}, which is then encoded by an independent XLSR encoder (weights not shared) to produce HF-representative features znoiseRF×Dz_{noise}\in\mathbb{R}^{F\times D}.

This dual-path architecture explicitly separates LF content and HF residuals for joint downstream modeling (HIdekel et al., 26 Nov 2025).

3. Frequency Cross-Attention and Latent Fusion

The outputs of both branches, zcontentz_{content} and znoisez_{noise}, are integrated by a multi-head frequency cross-attention module with H=8H=8 heads:

Q=zcWQ,K=znWK,V=znWVQ = z_c W^Q, \quad K = z_n W^K, \quad V = z_n W^V

Attention(Q,K,V)=softmax(QKT/dk)V\text{Attention}(Q, K, V) = \mathrm{softmax}(QK^{T}/\sqrt{d_k})V

where zc=zcontentz_c=z_{content} and zn=znoisez_n=z_{noise}. This mechanism allows LF content frames to attend to HF noise frames (and vice versa in symmetric variants), capturing temporally long- and short-range LF–HF dependencies. This late-stage feature fusion achieves a disentangled yet complementary joint representation, facilitating more robust fake/real discrimination and sharpening the separation in the model’s latent manifold (HIdekel et al., 26 Nov 2025).

4. Frequency-Aware Contrastive Jensen–Shannon Loss

SONAR’s learning objective includes a frequency-aware contrastive term designed to align or differentiate LF and HF embeddings depending on class (real or fake):

  1. Per-frame Softmax Conversion:

pic=softmax(zc[i]),pin=softmax(zn[i])p_i^c = \mathrm{softmax}(z_c[i]), \quad p_i^n = \mathrm{softmax}(z_n[i])

  1. Jensen–Shannon Divergence:

JS(zc,zn)=1Fi=1FJS(picpin)\mathrm{JS}(z_c, z_n) = \frac{1}{F} \sum_{i=1}^F \mathrm{JS}(p_i^c \parallel p_i^n)

  1. Alignment Loss:

LJS(x,y)=yJS(zc,zn)+(1y)(1JS(zc,zn))L_{JS}(x, y) = y\cdot \mathrm{JS}(z_c, z_n) + (1-y)\cdot (1 - \mathrm{JS}(z_c, z_n))

For real audio (y=1y=1), the objective minimizes divergence, pulling LF and HF embeddings together. For synthetic audio (y=0y=0), maximizing divergence pushes embeddings apart.

The full loss is:

L(x,y)=WCE(y^,y)+λJSLJS(x,y), with λJS=1 in optimal configuration.L(x, y) = \mathrm{WCE}(\hat{y}, y) + \lambda_{JS}\cdot L_{JS}(x, y), \text{ with } \lambda_{JS}=1 \text{ in optimal configuration.}

Weighted cross-entropy (WCE) addresses dataset imbalance. This contrastive alignment exploits the collapsed LF–HF relationship in deepfakes, substantially clarifying the real/fake decision boundary in latent space (HIdekel et al., 26 Nov 2025).

5. Training Procedures and Optimization

SONAR is trained on the ASVspoof 2019 LA dataset (highly imbalanced, 1:9 real:fake), with validation on its LA dev set and evaluation on ASVspoof 2021 (LA & DF) and in-the-wild benchmarks. AdamW optimizer is employed, with learning rate decayed from 1×1051\times10^{-5} to 1×1081\times10^{-8} using cosine scheduling. Batch size is 28×428\times4 on four NVIDIA L40 GPUs. Early stopping with a patience of three epochs is enforced, and all experiments are run with three random seeds. After each update, SRM filters undergo hard projection to maintain high-pass constraints. SONAR-Full stabilizes in approximately 12 epochs, and SONAR-Finetune in only 4–6 epochs, representing roughly fourfold faster convergence compared to strong baselines requiring up to 100 epochs (HIdekel et al., 26 Nov 2025).

6. Empirical Performance and Ablation Analysis

SONAR exhibits state-of-the-art Equal Error Rate (EER) reductions on standard and in-the-wild testbeds, summarized as follows:

Benchmark State-of-the-Art Baseline EER (%) SONAR-Full EER (%) SONAR-Finetune EER (%)
ASVspoof 2021 DF 3.69 (XLSR+AASIST) 1.57 1.45
ASVspoof 2021 LA 1.90 (XLSR+AASIST) 1.55 1.20
In-The-Wild (ITW) 6.71 (XLSR-Mamba) 6.00 5.43

Ablations demonstrate that removing the RFE (HF noise branch) or JS loss term results in substantial EER degradation (e.g., JS removal on ITW: 8.50% vs. 6.00%). Non-learnable (fixed) filters are outperformed by learnable versions. Optimal performance is achieved with M=30M=30 filters and λJS=1\lambda_{JS}=1. SONAR is robust to resampling and common codecs (e.g., MP3, Opus), showing only minor softmax output shifts. Latent analyses reveal that cosine similarity between real LF–HF embeddings clusters near +1+1, while for fakes, it centers around 0.2-0.2; t-SNE projections indicate clearer class separation than single-stream XLSR (HIdekel et al., 26 Nov 2025).

7. Generalizability and Future Extensions

Operating purely at the latent representation level, SONAR remains architecture-agnostic, permitting integration with any backbone network (e.g., alternative self-supervised learning encoders, CNNs) without structural modification. The modality-agnostic design principle implies that the SONAR methodology—frequency-guided contrastive alignment—can extend beyond audio. For data types such as images or video, appropriate high-pass filters or frequency decompositions (e.g., DCT bands) can be substituted for the SRM filters to yield similar disentanglement and discrimination of subtle generative artifacts.

A plausible implication is that this approach may generalize to any domain where discriminative high-frequency cues, typically masked by network bias, are critical for distinguishing real and synthetic data. Future work is likely to explore such modality transfers and more diverse backbone integrations, leveraging SONAR’s architecture-agnostic and frequency-aware contrastive paradigm (HIdekel et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Spectral-Contrastive Audio Residuals (SONAR).