Wavelet Prompt Tuning: Multi-Resolution Adaptation

Updated 5 April 2026

Wavelet Prompt Tuning is a parameter-efficient adaptation paradigm that uses discrete wavelet transforms to decompose prompt tokens into frequency subbands, enhancing multi-resolution artifact detection.
It injects wavelet-enhanced tokens into transformer layers, allowing models to selectively focus on scale-specific cues in applications like deepfake detection, speaker verification, and image restoration.
Empirical evaluations demonstrate that WPT significantly reduces error rates compared to vanilla prompt tuning, confirming its efficiency and robust domain adaptability.

Wavelet Prompt Tuning (WPT) is a parameter-efficient adaptation paradigm that leverages wavelet-domain transformations to inject multi-resolution, frequency-aware inductive biases into prompt tokens for large pre-trained models. Initially developed in the context of deepfake audio detection, and subsequently extended to speaker verification and image restoration, WPT fuses ideas from classical signal processing (notably discrete wavelet transform—DWT) with prompt-based adaptation of frozen transformer backbones. By transforming subsets of the learnable prompt tokens using wavelet decompositions prior to injection at each transformer block, WPT enables models to capture subtle, scale-specific artifacts that are typically encountered in synthetic or corrupted data, while requiring orders of magnitude fewer parameters than standard fine-tuning.

1. Core Principles and Motivation

Wavelet Prompt Tuning extends prompt tuning—a lightweight transfer learning approach in which a small set of trainable tokens ("soft prompts") are prepended to the input sequence of a frozen transformer—to the wavelet domain. While standard prompt tuning treats all prompt tokens as generic, WPT proposes decomposing some or all prompt tokens using DWT, producing subbands (e.g., LL, LH, HL, HH in 2D Haar DWT) that explicitly encode information at different frequency scales (Xie et al., 9 Apr 2025, Farhadipour et al., 24 Jan 2026, Xuan et al., 6 Oct 2025). This aligns with the observation that many artifacts and discriminative cues (such as those in deepfakes or degraded images) are localized in particular spectral or spatial bands.

The method leverages established properties of wavelets—multi-resolution analysis and localized support—allowing the few learnable parameters to attend selectively to features corresponding to genuine or synthetic artifacts across scales. This frequency-awareness is not available to standard prompt tuning, which operates in the native (token) domain.

2. Methodological Formulation

WPT has several closely related instantiations, varying in the implementation of the wavelet transformation, learning paradigm, and target domain.

Audio: WPT-SSL, WPT-XLSR, and WSPT-XLSR

Audio backbone: Most work employs large frozen SSL audio models, such as wav2vec2-XLSR (24 transformer layers, hidden size $d=1024$ ).
Prompt allocation: At each transformer layer $k = 1 \dots \ell$ , a learnable prompt matrix $P_k \in \mathbb{R}^{p \times d}$ is introduced. In WPT, a subset ( $w$ of $p$ ) of the prompt tokens is subjected to a wavelet transform.
Discrete Wavelet Transform: The prompt sub-matrix or vector is decomposed via DWT. In (Xie et al., 9 Apr 2025), 2D Haar wavelet transforms are applied to both the token and feature dimensions, yielding four subbands $T_{LL}, T_{LH}, T_{HL}, T_{HH}$ . Their concatenation forms the set of wavelet-enhanced prompt tokens.
Partial Wavelet Prompting (Partial-WSPT-XLSR): Only the last $m < p$ prompt tokens are wavelet-enhanced, promoting parameter efficiency while maintaining access to generic prompts (Xuan et al., 6 Oct 2025).
Learnable vs. Fixed Wavelet Filters: While early WPT variants use fixed Haar or Daubechies-4 filters, learnable analysis ( $F_0, F_1$ ) and synthesis ( $H_0, H_1$ ) filters can be adopted, with learnable DWT shown to further enhance discriminative power for artifact localization (Xuan et al., 6 Oct 2025).
Wavelet-Domain Sparsification: Stochastic or learnable binary masks $M$ (with parameterizable sparsity $k = 1 \dots \ell$ 0) can be applied to wavelet coefficients to sparsify wavelet prompts (Xuan et al., 6 Oct 2025); this regulates capacity and injects structural prior.
Token Fusion and Insertion:
- In WSPT-XLSR, all prompt tokens are wavelet-processed; in Partial-WSPT-XLSR, enhanced and untouched tokens are fused.
- Prompt tokens are prepended at each transformer layer.
- Post-layer, the prompt outputs are discarded/replaced with the original prompts.
Backward Compatibility: The main SSL backbone and tokenization remain frozen, maintaining inference efficiency and preserving generalization.

Image: Wavelet Prompt Block (WPB)

In all-in-one image restoration, the Wavelet Prompt Block fits into U-Net style encoder-decoder architectures (Wang et al., 4 Mar 2026).

Skip Connection Injection: The WPB operates on encoder-decoder skip links. It applies a 1-level Haar DWT, injects prompts (as learnable subband-specific maps $k = 1 \dots \ell$ 1) into high-frequency bands (LH, HL, HH) via Spatial Feature Transform, and reweights bands via a degradation-based estimator (DWE).
Inverse Wavelet Reconstruction: The prompted, subband-weighted features are recombined by IDWT before decoder consumption.
Causal Deconfounding: The WPB, in conjunction with DWE, realizes back-door adjustment for causal deconfounding, mitigating spurious correlations between semantic and degradation features.

3. Mathematical Details

The operational pipeline in the audio domain can be summarized:

Wavelet Transformation (on token axis or prompt matrix):

$k = 1 \dots \ell$ 2

Prompt Construction:

$k = 1 \dots \ell$ 3

Masking (WDS, Optional):

$k = 1 \dots \ell$ 4

Inverse Transformation (if necessary):

$k = 1 \dots \ell$ 5

Prompt Injection at Layer $k = 1 \dots \ell$ 6:

$k = 1 \dots \ell$ 7

In the image domain, WPB operates as:

1-level DWT on encoder features,
learnable prompt map injection (high-frequency bands only, via weighted combinations and SFT),
degradation-based band reweighting,
IDWT to reconstruct prompted skip feature.

All methods consistently keep the main backbone parameters frozen and learn only the wavelet prompt tokens (and, where applicable, the filter/mask parameters or prompt maps).

4. Empirical Evaluation and Performance

WPT consistently yields strong gains in parameter efficiency and generalization across detection tasks characterized by multi-resolution artifacts.

Speech Deepfake Detection:
- WPT-XLSR-AASIST outperforms PT and FT baselines in cross-type deepfake detection with an average EER of 3.58% vs. PT (6.74%) and FT (4.98%) (Xie et al., 9 Apr 2025).
- In (Xuan et al., 6 Oct 2025), Partial-WSPT-XLSR achieves EER of 10.58% on Deepfake-Eval-2024 and 0.13% on SpoofCeleb; the full WaveSP-Net (with Mamba back-end) surpasses XLSR-1B at 1.298% parameter budget.
Speaker Verification—Spoofing-Aware:
- On WildSpoof, WPT-XLSR-AASIST reduces EER from 0.45% (vanilla prompts) to 0.16% (Farhadipour et al., 24 Jan 2026).
- Cross-domain robustness is enhanced: out-of-domain EER decreases from 0.75% to 0.40%.
Image Restoration:
- Causal-deconfounding Wavelet-Disentangled Prompt Network (CWP-Net) deploying WPB delivers superior AiOIR performance relative to state-of-the-art methods, as validated on diverse benchmark settings (Wang et al., 4 Mar 2026).

The ablation studies confirm that wavelet prompt tokens, particularly the highest-frequency (HH) subband, consistently dominate discriminative attention in real-vs-fake settings (Xie et al., 9 Apr 2025). Learnable analysis/synthesis filters, prompt sparsification, and subband weighting further improve effectiveness.

5. Architectural Variants and Extensions

Variant	Domain	Waveletization Scope
WPT-SSL (PT-SSL with Wavelets)	Audio (ADD)	Fixed Haar, tokens
WSPT-XLSR	Audio (DD)	Learnable, all tokens
Partial-WSPT-XLSR	Audio (DD)	Learnable, last $k = 1 \dots \ell$ 8
WPB in CWP-Net	Image Restoration	Skip connection, subbands

Learnable Filters vs. Fixed: Earlier works use fixed Haar/Daubechies DWT; later ones learn analysis/synthesis filters.
Partial vs. Full Prompt Waveletization: Tuning the number of wavelet-enhanced tokens ( $k = 1 \dots \ell$ 9 in Partial-WSPT-XLSR) controls parameter efficiency and expressivity.
Sparsification: Introduced as Wavelet-Domain Sparsification (WDS), it encourages only a fraction $P_k \in \mathbb{R}^{p \times d}$ 0 of coefficients to influence the model, acting as a regularizer and further parameter reduction (Xuan et al., 6 Oct 2025).
Wavelet Prompt Block (WPB) for Images: Extends WPT to visual data, with prompt maps injected into wavelet subbands of feature maps (Wang et al., 4 Mar 2026).

6. Applications and Interpretability

Audio Deepfake Detection and Speaker Verification: WPT families (WPT-SSL, WPT-XLSR-AASIST, WSPT-XLSR) provide robust detection by adapting to spectral cues left by synthesis artifacts—a context where classical frequency filters provide direct interpretability (Xie et al., 9 Apr 2025, Xuan et al., 6 Oct 2025, Farhadipour et al., 24 Jan 2026).
Image Restoration: WPB in CWP-Net utilizes prompt tokens for causal deconfounding, adjusting restoration based on estimated degradation type without spurious feature entanglement (Wang et al., 4 Mar 2026).
Type-invariance in Audio: WPT enables models to map multi-type audio (speech, singing, music, sounds) deepfakes into a common authenticity-based space, erasing confounding type-specific clusters present in backbone embeddings (Xie et al., 9 Apr 2025).
Ablation Insights: The HH band prompts dominate in discriminative attention for fake detection across types; omitting waveletization or sparsification degrades performance (Xie et al., 9 Apr 2025, Xuan et al., 6 Oct 2025).

7. Limitations and Future Directions

Domain Adaptation: While WPT strengthens multi-resolution artifact sensitivity, cross-domain generalization remains challenging. Integration with adversarial or meta-learning strategies is proposed (Farhadipour et al., 24 Jan 2026).
Learnable Wavelet Bases: Most WPT variants use fixed wavelet bases; parameterizing the mother wavelet or filter structure is an open area, with early learnable-filter results indicating benefits (Xuan et al., 6 Oct 2025).
Prompt Placement: Optimal placement of WPT tokens—layer selection and mapping to scales—remains empirical. Automation via attention modules or dynamic routing is a plausible research direction (Farhadipour et al., 24 Jan 2026).
Extension to New Domains: The extension from audio to image domains in CWP-Net suggests a domain-agnostic framework for multi-resolution prompt-based adaptation.

References: