PA-TFM: Perceptual Adaptive Time-Frequency Multiplexing

Updated 9 November 2025

PA-TFM is a neural audio watermarking approach that dynamically assigns embedding capacity based on psychoacoustic masking to ensure imperceptibility while resisting diverse attacks.
The method integrates heterogeneous watermark algorithms by routing each into distinct time–frequency tiles, yielding improved detection rates and audio quality metrics under varied corruptions.
It leverages precomputed perceptual masks and black-box watermark systems without additional training, enabling efficient real-time embedding and robust extraction.

Perceptual Adaptive Time-Frequency Multiplexing (PA-TFM) is a multiplexing paradigm introduced for neural audio watermarking that leverages perceptual masking and heterogeneity of watermark designs to enhance robustness against diverse dilution attacks. Unlike static or naïve multiplexing strategies, PA-TFM allocates embedding capacity dynamically across time–frequency regions in a psychoacoustically adaptive fashion, ensuring that at least one component watermark survives under a wide array of signal corruptions—all while maintaining imperceptibility.

1. Problem Scope and Conceptual Overview

Audio watermarking aims to embed hidden, resilient information into digital speech, such that it can be reliably extracted post-distribution or even after targeted tampering. Existing neural watermarking systems (notably AudioSeal [A] and PerTh [P]) exhibit complementary susceptibility: a spread-spectrum approach may survive lossy compression but not fineband neural codecs, whereas a phase-based approach can resist neural tokenization but be vulnerable to temporal cropping.

PA-TFM addresses these vulnerabilities by combining multiple, heterogeneous watermarking algorithms and routing each into time–frequency “tiles” where that algorithm’s robustness aligns with the psychoacoustic landscape (as indicated by a perceptual mask $m(t,f)$ ). This allocation is dynamic and content-dependent, achieved without retraining the constituent watermarking networks. In this sense, PA-TFM is distinguished from:

Parallel multiplexing: Uniform addition of multiple watermark perturbations.
Sequential multiplexing: Output of one watermarking embedder serves as input to the next, leading to inter-perturbation interference.
Frequency-division/time-division multiplexing: Static allocation of frequency bands or time frames, respectively.
PA-TFM: Dynamic, mask-adaptive allocation with mutually-exclusive routing, exploiting both temporal and spectral psychoacoustic redundancies.

2. Mathematical Architecture and Mask Adaption

In PA-TFM, let $x[n]$ denote the clean audio, and $X(t,f)=\text{STFT}\{x[n]\}$ the short-time Fourier transform. Watermark systems $W_A$ and $W_P$ individually produce perturbations $\delta_A[n]$ and $\delta_P[n]$ with STFTs $\Delta_A(t,f)$ and $\Delta_P(t,f)$ . The PA-TFM watermarked spectrogram is:

$\widetilde{X}(t,f) = X(t,f) + W_A(t,f)\, \Delta_A(t,f) + W_P(t,f)\, \Delta_P(t,f)$

$\tilde{x}[n] = \text{ISTFT}\{\widetilde{X}(t,f)\}$

The perceptual mask $m(t,f)\in[0,1]$ estimates local auditory masking via, for example, Bark-scale masking and SNR:

$T(t,f) = \sum_{f'} |X(t,f')|\, H(\mathrm{Bark}(f) - \mathrm{Bark}(f'))$

$\mathrm{SNR}(t,f) = 20\log_{10} \frac{|X(t,f)|}{T(t,f)}$

$m(t,f) = \mathrm{clip}\biggl(\frac{\mathrm{SNR}(t,f) - \theta_{\min}}{\theta_{\max}-\theta_{\min}}, 0, 1\biggr)$

The STFT domain is partitioned into time slots $T_i$ and frequency bands $B_i$ for each watermark, with exclusive routing functions ensuring non-overlap:

$W_i(t,f) = \alpha_i\, m(t,f)\, \mathbf{1}_{t\in T_i}\, \mathbf{1}_{f\in B_i}$

where $\alpha_i$ sets the strength per watermark and $\sum_i \mathbf{1}_{T_i}(t) = 1$ , $\sum_i \mathbf{1}_{B_i}(f) = 1$ (no spatial overlap).

The approach can be framed as maximizing the expected detector confidence (Score) subject to a distortion constraint such as SNR or PESQ:

$\max_{\{\alpha_i\}} \sum_{i \in \{A,P\}} E_x \left[\mathrm{Score}_i(\tilde{x})\right] \quad \text{s.t.} \quad D(x,\tilde{x}) \leq \epsilon$

where $D$ measures distortion. Practically, $\alpha_i$ are hand-tuned (e.g., to SNR ≈ 17 dB and PESQ > 4.3).

3. Embedding and Detection Algorithms

Embedding Algorithm Overview

Input: x[n] (clean audio), STFT params, watermarkers W_A, W_P, strengths α_A, α_P

1. Compute STFT: X(t,f) = STFT{x}
2. Compute perceptual mask: m(t,f) via local SNR/spectral flatness
3. Define time slots T_A, T_P and frequency bands B_A, B_P
4. For each (t,f):
    W_A(t,f) = α_A * m(t,f) * 1_{t∈T_A} * 1_{f∈B_A}
    W_P(t,f) = α_P * m(t,f) * 1_{t∈T_P} * 1_{f∈B_P}
5. Compute watermarked STFT deltas:
    Δ_A(t,f) = STFT{W_A.embed(x) - x}
    (similarly for Δ_P)
6. Compose:
    ˜X(t,f) = X(t,f) + W_A(t,f)Δ_A(t,f) + W_P(t,f)Δ_P(t,f)
7. Inverse STFT: ˜x[n] = ISTFT{˜X(t,f)}
Output: ˜x

Extraction Algorithm Overview

Input: y[n] (possibly attacked audio)
1. Apply each watermark detector D_A, D_P to y[n]
2. Retrieve confidence scores s_A, s_P
3. Fuse via weighted log-odds:
    L = w_A * log(s_A / (1-s_A)) + w_P * log(s_P / (1-s_P))
4. If L > τ: declare watermark present; else absent
Output: Binary detection result

4. Constituent Watermark Systems and Network Components

PA-TFM operates as a training-free framework atop existing neural watermark systems. Specifically:

AudioSeal (A): Utilizes a U-Net-based encoder–decoder generator and convolutional-transformer detector, optimized with a masking-aware loss.
PerTh (P): Employs a deep autoencoder embedder and a lightweight classifier detector, with perceptual transparency constraints.

Crucially, PA-TFM itself performs no additional training. The black-box APIs of each constituent watermark system are sufficient for both embedding and detection. Mask computation and tile routing are preprocessing steps, not learned modules.

5. Evaluation Methodology and Quantitative Performance

Dataset, Attacks, and Metrics

Dataset: LibriSpeech test_clean (2620 utterances, 16 kHz)
Attack suite (11 total):

Gaussian noise
Uniform noise
Zero-masking (time-domain dropouts)
FFT-masking (frequency-domain dropout)
Echo addition
MP3 compression (128 kbps)
Opus compression
EnCodec (neural codec)
SpeechTokenizer (tokenization + resynthesis)
DAC (Descript Audio Codec)
Room impulse response (RIR)

Metrics: PESQ (↑), STOI (↑), SNR (↑), AUC (detector, ↑), TPR@FPR=0.05 (↑)

Summary Results

Method	PESQ ↑	STOI ↑	SNR ↑	AUC ↑	[email protected] ↑
Single A	4.50	0.998	28.0	0.917	0.832
Single P	4.34	0.994	15.7	0.966	0.893
Parallel	4.32	0.993	15.2	0.958	0.885
Seq. A→P	4.31	0.993	15.0	0.963	0.880
Seq. P→A	4.30	0.992	15.1	0.938	0.723
FDM	4.34	0.994	16.3	0.965	0.891
TDM	4.40	0.995	16.7	0.961	0.884
PA-TFM	4.33	0.993	17.0	0.974	0.905

Figures in the source illustrate that the two constituent watermarks exhibit complementary TPR curves under varied attack strengths (e.g., Gaussian noise, RIR), and that PA-TFM distributes watermark energy so as to maximize survival across attacks (e.g., partial retrieval post-SpeechTokenizer attack).

6. Robustness Mechanisms and Ablation Insights

PA-TFM achieves superior robustness through the following mechanisms:

Complementarity: The framework aligns watermark A with time–frequency regions where it is empirically robust (slow decay under temporal distortion, vulnerable to spectral erasures) and watermark P where it is complementary. This selective allocation, governed by $m(t,f)$ , ensures that at least one watermark remains intact across unknown attack types.
Perceptual Mask Utilization: By localizing strong embedding to high-masking regions—auditory zones where perturbation is least perceptible—PA-TFM is able to increase $\alpha_i$ without sacrificing perceptual quality, thereby improving detector TPR relative to static schemes.
Ablation Findings: The absence of a perceptual mask leads to ~2% loss in TPR; omission of time division (frequency bands only) results in ~1.5% loss; spectral flatness–based masking yields ~0.5% lower TPR than using local-SNR.

7. Constraints, Extensions, and Deployment Considerations

Limitations

The efficacy of PA-TFM is contingent on accurate perceptual masking; misspecified $m(t,f)$ (especially on non-speech signals) can undermine robustness.
Only two watermark algorithms are considered; scaling multiplexing to more than two systems complicates mutually-exclusive tile routing.
Unknown generative attacks that precisely target the allocated tiles may bypass detection.

Extension Opportunities

Incorporating a learnable mask predictor, co-trained with a proxy detector, may further optimize allocation.
Multi-scale tiling strategies could counteract attacks operating at different granularities.
Adversarial optimization of embedding strengths ( $\alpha_i$ ) against differentiable attack models to maximize worst-case detection performance.

Practical Deployment

PA-TFM is compatible with any pair of black-box watermark APIs; it requires no retraining, only STFT-based preprocessing and mask computation.
Real-time implementation is feasible; for 16 kHz speech, CPU/GPU runtimes are ~2× faster than real time.
Calibration of $\alpha_i$ and mask threshold parameters should be performed on development data to respect SNR/PESQ/imperceptibility constraints.

In summary, PA-TFM operationalizes the joint strengths of structurally distinct watermarking techniques, dynamically guided by psychoacoustic masking, to set a new empirical benchmark for neural audio watermarking in the face of both classical and neural reconstruction attacks (Yuan et al., 4 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Multiplexing Neural Audio Watermarks (2025)

Follow Topic

Get notified by email when new papers are published related to Perceptual Adaptive Time-Frequency Multiplexing (PA-TFM).