WaveSP-Net: Wavelet Prompt Tuning
- The paper demonstrates that integrating wavelet transforms with prompt tuning significantly enhances artifact detection and restoration performance while updating only a minimal fraction of parameters.
- WaveSP-Net is a framework that injects multiresolution, frequency-localized prompt embeddings via both discrete and continuous wavelet constructions to capture subtle signal variations.
- Experimental results reveal that WaveSP-Net outperforms vanilla prompt tuning in audio deepfake detection, speaker verification, and image restoration while maintaining a highly efficient parameter budget.
Wavelet Prompt Tuning is a family of methodologies for parameter-efficient adaptation of frozen deep neural models, in which learnable prompt embeddings are structured or injected via wavelet transforms to encode explicit multiresolution, frequency-localized information. The approach extends classical prompt tuning—which prepends or inserts trainable tokens into transformer-based models—by introducing wavelet-domain structure or fusion at the token or feature level. This enables detection and characterization of subtle, scale-dependent artifacts (as in deepfake audio or image restoration) while updating only a small fraction of the total model parameters, often using no learnable convolution or wavelet filter weights. Key variants include discrete or continuous wavelet prompt constructions, with applications spanning audio deepfake detection, speaker verification, and all-in-one image restoration (Xuan et al., 6 Oct 2025, Farhadipour et al., 24 Jan 2026, Wang et al., 4 Mar 2026, Xie et al., 9 Apr 2025).
1. Foundations of Prompt Tuning and Wavelet Representations
Prompt tuning, in the context of large frozen language or audio models, typically learns a small set of input tokens or intermediate embeddings (“prompts”) while keeping pre-trained model weights fixed. Vanilla prompt tuning treats all token dimensions and positions in a flat and non-frequency-aware manner. In many tasks, however—including detection of synthetic media artifacts—task-critical features span both temporal and frequency dimensions at multiple resolutions.
The wavelet transform serves as a multiresolution analysis tool, decomposing signals (1D or 2D) into components corresponding to distinct frequency bands and locations. In the context of deep models, injecting prompt embeddings determined by or filtered through wavelet transforms provides explicit, localized access to both coarse and fine-grained variations in the input or intermediate representations (Xie et al., 9 Apr 2025, Xuan et al., 6 Oct 2025, Wang et al., 4 Mar 2026).
2. Core Methods: Wavelet Prompt Tuning Variants
Several implementations of Wavelet Prompt Tuning have emerged:
- Discrete Wavelet Prompt Tuning (WPT-SSL, WSPT): A subset of prompt tokens per transformer layer is decomposed using the Discrete Wavelet Transform (DWT)—usually with fixed Haar filters—to form localized sub-band tokens (LL, LH, HL, HH). These wavelet prompt tokens can be prepended to each transformer layer alongside conventional (vanilla) prompts (Xie et al., 9 Apr 2025).
- Wavelet-Domain Sparse Prompt Tuning (WSPT): In WSPT-XLSR and Partial-WSPT-XLSR, a portion of each prompt token matrix () at transformer layer is processed through learnable analysis wavelet filters (for low- and high-pass decomposition), followed by sparsification via stochastic or learnable binary masks. The enhanced tokens are then reconstructed by learnable synthesis filters, providing both locality and parameter efficiency (Xuan et al., 6 Oct 2025).
- Wavelet Prompt Block in Vision: In image restoration, the Wavelet Prompt Block (WPB) performs a one-level DWT on skip features in a U-Net backbone, synthesizes high-frequency prompt maps for the LH, HL, HH subbands, merges them via spatial feature transforms, and finally reconstructs the feature map using the inverse DWT. Reweighting guided by a degradation-based estimator provides adaptation to specific types and degrees of image corruption (Wang et al., 4 Mar 2026).
- Continuous Wavelet Construction: Some variants, especially in theoretical treatments, refer to the Continuous Wavelet Transform (CWT) and subsequently discretize for implementation, ensuring dyadic multiresolution coverage and orthogonal sub-band representations (Farhadipour et al., 24 Jan 2026).
The following table summarizes the principal design elements of Wavelet Prompt Tuning variants:
| Variant | Wavelet Transform | Learnable Filters | Sparsification |
|---|---|---|---|
| WPT-SSL, WPT-XLSR-AASIST | Fixed Haar DWT | No | No |
| WSPT-XLSR | Learnable DWT | Yes | Yes (mask) |
| Partial-WSPT-XLSR | Learnable DWT | Yes | Yes (partial mask) |
| Vision WPB | Fixed Haar (1-level) | Yes (prompt maps) | No |
3. Mathematical Formulation
Discrete Wavelet Decomposition of Prompts
Consider a frozen transformer (e.g., XLSR) with layers. Each layer receives a set of learnable prompt tokens (vanilla prompt). In wavelet prompt tuning:
- Wavelet Processing: For tokens, is subject to a 1D DWT using low- and high-pass filters . The wavelet coefficients are:
concatenated to form 0 (Xuan et al., 6 Oct 2025).
- Sparsification: Mask 1 (random or learnable) is applied to enforce parameter-efficient updates:
2
- Reconstruction: The masked coefficients are inverted by convolution with synthesis filters 3 to yield 4 enhanced prompt tokens, then merged with vanilla prompts depending on the variant (full or partial).
DWT on Prompt Matrices
In WPT-SSL (Xie et al., 9 Apr 2025):
- 2D DWT (fixed Haar filters) is applied to a wavelet prompt matrix 5, resulting in four subbands 6. Each sub-band is re-embedded and concatenated for input at layer 7.
Injection Strategy
At each transformer layer 8, the input is constructed as:
9
and processed through the frozen block 0. The prompt outputs are not propagated deeper; the same prompt tokens are re-inserted at every layer.
4. Empirical Results and Application Domains
Wavelet Prompt Tuning exhibits robust empirical performance for deepfake detection and image restoration:
- Audio Deepfake Detection: WPT-XLSR-AASIST achieves an average EER of 3.58% in all-type (speech, sound, singing, music) audio deepfake detection, outperforming both full fine-tuning (4.98% EER) and vanilla prompt tuning (6.74% EER) while using 1 of the parameters and converging in a fraction of the epochs (Xie et al., 9 Apr 2025). On SpoofCeleb, WPT achieves an EER of 0.16%, a 64% reduction compared to generic prompt tuning (Farhadipour et al., 24 Jan 2026).
- Speech Deepfake Detection: WaveSP-Net, employing Partial-WSPT-XLSR plus a bidirectional Mamba back-end, attains 10.58% EER on Deepfake-Eval-2024 and 0.13% on SpoofCeleb—both outperforming SOTA baselines with only 2 of the XLSR parameter count updated. Ablations confirm critical contributions by learnable wavelet filters and sparsification (Xuan et al., 6 Oct 2025).
- Image Restoration: The Wavelet Prompt Block in CWP-Net for all-in-one image restoration injects learnable prompt maps into the high-frequency DWT subbands at each skip connection of a U-Net. This leads to superior restoration fidelity by disentangling degradation and semantic features, compensating for unknown degradation patterns (Wang et al., 4 Mar 2026).
- Speaker Verification: In spoofing-aware speaker verification, incorporating wavelet prompt tuning into the anti-spoofing subsystem within a multi-model ensemble framework produces competitive SASV EER and macro a-DCF, with substantial improvement in cross-domain generalization compared to standard prompt approaches (Farhadipour et al., 24 Jan 2026).
5. Design Considerations, Hyperparameters, and Ablation Insights
Token Allocation and Placement: Optimal numbers of vanilla and wavelet prompt tokens per layer (e.g., 3, 4) and insertion into every or upper transformer layers (PT-DEEP schema) are empirically established (Xie et al., 9 Apr 2025, Farhadipour et al., 24 Jan 2026).
Wavelet Basis: Most studies fix the DWT to Haar or Daubechies wavelets. There is evidence that learnable wavelet bases further improve performance, e.g., replacing fixed Haar filters with trainable analysis/synthesis kernels, especially when combined with sparsification (Xuan et al., 6 Oct 2025).
Sparsification: Applying binary masks (random or learnable) over wavelet coefficients enables extremely low parameter budgets while preserving discriminative capacity (best sparsity ratio 5, i.e., 10% active coefficients) (Xuan et al., 6 Oct 2025).
Prompt-Subband Assignment: Manually mapping prompt tokens to sub-band or scale remains the default; automated assignment via attention is suggested as an open direction (Farhadipour et al., 24 Jan 2026). High-frequency (e.g., HH) prompts consistently dominate for artifact discrimination.
Regularization and Optimization: Adam optimizer, moderate learning rates (e.g., 6), and weight decay are standard. Training converges rapidly (10–20 epochs for WPT-SSL) compared to full fine-tuning (>50 epochs) (Xie et al., 9 Apr 2025).
Ablation Highlights:
- Ablations removing wavelet structure, sparsification, or learnable filters each lead to distinct EER increases (1–6%), supporting each component’s necessity (Xuan et al., 6 Oct 2025).
- Fixed vs. learnable filters show substantial gaps (e.g., EER 16.55% for fixed, 10.58% for learnable wavelet in WSPT) (Xuan et al., 6 Oct 2025).
- Optimal token numbers and assignment schemas are non-trivial; too few or too many tokens degrade performance (Xie et al., 9 Apr 2025).
6. Limitations and Future Directions
Domain Generalization: Despite enhanced artifact sensitivity and cross-type invariance, out-of-domain and cross-corpus generalization remain imperfect (e.g., marked performance drops on unseen datasets), motivating hybridization with domain-adversarial or meta-learning approaches (Farhadipour et al., 24 Jan 2026, Xie et al., 9 Apr 2025).
Learnable Wavelet Bases: Most present implementations fix the wavelet family. Allowing the mother wavelet to be learned—either as parametric or neural filters—is identified as a promising research direction for adaptively matching unknown artifact distributions (Farhadipour et al., 24 Jan 2026).
Placement and Routing of Wavelet Prompts: The optimal transformer layers and mapping of prompt tokens to subbands are not fully determined. Future work may leverage attention-based or automatic routing to further improve parameter efficiency and expressivity (Farhadipour et al., 24 Jan 2026).
A plausible implication is that expanding Wavelet Prompt Tuning to other modalities (e.g., vision, multimodal fusion) and tasks requiring fine-scale discrimination could leverage its parameter-efficient, locality-preserving, and frequency-aware design.
7. Summary Table: Notable Results Across Applications
| Application | Architecture & Setting | Reported EER (%) | Parameter Budget |
|---|---|---|---|
| All-type audio deepfake (ADD) | WPT-XLSR-AASIST (PT-DEEP, all-type) | 3.58 (avg) | 0.2% of FT (Xie et al., 9 Apr 2025) |
| Speech deepfake detection | WaveSP-Net (Partial-WSPT, Mamba) | 10.58 (DE24) | 1.3% of XLSR (Xuan et al., 6 Oct 2025) |
| Speaker verification–spoofing | WPT-XLSR-AASIST, ensemble SMV | 0.16 (SpoofCeleb) | ~1 M of 300 M (Farhadipour et al., 24 Jan 2026) |
| All-in-one image restoration | CWP-Net, U-Net+WPB (AiOIR) | SOTA* | Multi-scale, low-budget (Wang et al., 4 Mar 2026) |
*Performance metric for restoration given as superior to SOTA alternatives (see original for dataset specifics).
Wavelet Prompt Tuning synthesizes efficiency, expressivity, and multiresolution feature encoding within prompt-based adaptation frameworks. It enables strong artifact localization and robust, generalized discrimination in both audio and vision, marking an important step in parameter-efficient deep model adaptation (Xuan et al., 6 Oct 2025, Wang et al., 4 Mar 2026, Xie et al., 9 Apr 2025, Farhadipour et al., 24 Jan 2026).