SpecPL: Disentangling Spectral Granularity for Prompt Learning

Published 6 May 2026 in cs.CV, cs.AI, cs.CL, and cs.LG | (2605.04504v1)

Abstract: Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a spectral granularity factorization mechanism that decomposes visual cues into stable low-frequency and discriminative high-frequency components using a frozen VAE.
The paper employs a Visual Semantic Bank and counterfactual granule supervision to align multimodal features, significantly improving base-to-novel generalization and robustness to domain shifts.
The paper demonstrates that SpecPL offers plug-and-play integration with minimal overhead, achieving enhanced accuracy and resilience in fine-grained and cross-domain benchmarks.

SpecPL: Spectral Granularity Disentanglement for Robust Prompt Learning in Vision-LLMs

Introduction and Motivation

Vision-LLMs (VLMs) have achieved compelling performance in few-shot and zero-shot transfer scenarios through prompt learning, wherein the backbone model is kept frozen and only a small set of prompt parameters is optimized. However, conventional prompt learning frameworks suffer from pronounced modality asymmetry: adaptation is predominantly text-centric, with the visual encoder typically remaining a static, holistic feature extractor throughout both training and inference. This leads to insufficient exploitation of hierarchical and fine-grained visual cues, resulting in weak discrimination on challenging visual domains and suboptimal generalization, particularly for fine-grained or distribution-shifted tasks.

To address this critical bottleneck, the paper "SpecPL: Disentangling Spectral Granularity for Prompt Learning" (2605.04504) introduces a spectral granularity factorization mechanism for prompt learning. The central hypothesis asserts that effective VLM adaptation demands explicit modeling of both stable, low-frequency semantic structure and high-frequency instance-level detail. The paper posits that modeling granularity via latent-space spectral decomposition, rather than convoluted pixel-space frequency analysis, allows for controllable separation of global invariants and local discriminative cues, bridging the modality asymmetry in prompt learning.

Methodology

Spectral Disentanglement via Frozen VAE

SpecPL reframes prompt learning as a dual-path spectral processing problem. Central to the design is a frozen Variational Autoencoder (VAE) teacher that serves as a latent domain for spatial-spectral decomposition. The VAE encodes input imagery into a spatially aligned latent representation. A lightweight spatial-spectral proxy (average pooling and residual computation) decomposes each latent image $z$ into:

Base (low-frequency): Captures semantic invariances, robust to instance-level noise, using local smoothing.
Detail (high-frequency): Extracts residuals corresponding to fine-grained, instance-specific cues such as texture, pose, and local structure.

Both bands are projected into the shared VLM embedding space via learned projection heads (MLPs) with only the prompt and auxiliary modules being trainable; the VAE and core CLIP backbone remain frozen throughout.

Visual Semantic Bank and Text Refinement

A key architectural contribution is the Visual Semantic Bank—a resource pool of low-frequency semantic prototypes initialized from the base band projections. During training, new base-band prototypes are used to update this bank via online nearest-neighbor EMA assignment; during inference, the bank is frozen.

Refined text features are constructed by compositional retrieval: original class text embeddings softly attend over the bank, yielding fused multimodal anchors that are more aligned with invariant visual structure. These enriched text features are then used as anchors for image-text contrastive losses.

Granule Modulation and Counterfactual Supervision

To ensure models are sensitive to high-frequency, discriminative details (combatting the well-documented "shape bias" of deep vision models), SpecPL introduces factual and counterfactual granule supervision:

Factual: The model is conditioned and supervised to leverage the detail component of its own image.
Counterfactual: Instance granules (high-frequency features) are randomly permuted/sampled from other images in the batch, with the model explicitly supervised to predict the identity from the granule source. This mechanism compels the model to recognize that class identity can be critically determined by local fine-grained visual evidence.

Granule modulation is implemented with a FiLM-style conditioning module, but only during training—this branch is discarded at inference, preserving fast deployment and test-time efficiency.

Training Objective

The loss is a composite of:

Main classification loss (cross-entropy with refined text features)
Semantic alignment loss (forcing text features to align with base visual anchors)
Granule supervision objectives (factual and counterfactual)

Empirical Evaluation

Base-to-Novel Generalization

Evaluated across 11 standard classification benchmarks (including ImageNet, EuroSAT, FGVC-Aircraft, DTD, Flowers-102), SpecPL consistently increases the harmonic mean (HM) accuracy on base-to-novel splits. For instance, with a CoOp backbone, HM improves from 71.66% to 76.52% (+4.86), with particularly pronounced gains on datasets characterized by fine-grained or textural class distinctions (e.g., FGVC-Aircraft, DTD). The model also significantly reduces generalization gaps (base-to-novel accuracy drop), e.g., by 31.6% for CoOp.

Cross-Dataset Transfer and Robustness

SpecPL yields increased robustness in cross-dataset and domain generalization evaluation (ImageNet→{V2, Sketch, A, R}), with consistent improvements in average transfer accuracy and resilience to natural/adversarial distribution shifts. Notably, improvements are most salient on targets with large domain or style shifts—strong evidence of the value in explicitly modeling visual granularity factors.

Ablation and Diagnostic Analyses

Ablation studies confirm that each component in SpecPL is additive: the Visual Semantic Bank stabilizes adaptation but is insufficient for generalization to novel classes by itself. Semantic alignment and granule modulation act synergistically, improving factual representation when combined, while counterfactual supervision regularizes and enhances detail-sensitivity provided the underlying semantic anchor is stable.

Spectral diagnostics reveal that the frozen VAE manifold provides a substantially cleaner separation (lower spectral energy overlap) between base (invariant) and detail (granular) components, as compared to non-disentangled CLIP representations, supporting the core design claim.

Architectural Generalizability

The method is not CLIP-specific; SpecPL also provides substantial HM improvements when tested with BLIP-ITC as the VLM backbone, demonstrating its general potential.

Efficiency

SpecPL adds only a minor overhead: the VAE teacher is used exclusively at training, and the granule modulation branch is similarly disabled at inference. Thus, SpecPL maintains parameter efficiency and fast inference, with only marginal increases in trainable parameters, memory, and one-time VAE cache construction.

Implications and Prospects

Practical Implications

Plug-and-play style: SpecPL can be integrated almost universally with prompt-learning methods, providing immediate improvements (especially on harder, detail-sensitive tasks) without major changes to model architecture or base optimization protocol.
Distribution shift robustness: By encoding both stable global structure and detail granularity, SpecPL enhances resilience to domain shifts and fine-grained recognition tasks, important for real-world deployments.
Interpretability: Explicit spectral disentanglement offers greater transparency into which visual factors underpin prompt learning adaptation—enabling better failure analysis and principled prompt engineering.

Theoretical Implications

Modeling inductive biases: SpecPL strengthens the case that prompt learning's generalization/stability limitations stem from implicit, poorly structured visual representations—and that explicit factorization of frequency/granularity components is a principled, powerful regularizer in the VLM adaptation regime.
Generalizable adaptation principles: The effectiveness of spectral disentanglement in both CLIP and BLIP-ITC settings suggests that emergent multi-modal contrastive representations share latent manifold properties that can be systematically exploited.

Future Directions

The paper outlines several avenues:

Spatiotemporal extension: Temporal spectral decomposition, especially in video understanding where motion granularity is crucial.
Learnable spectral filters: Dynamically adapting the frequency separation via learnable filters, increasing flexibility across domains.
Generative modeling applications: Leveraging disentangled representations for controllable text-to-image generation (e.g., structurally consistent generation with varying textures).
Intrinsic spectral decomposition in frozen VLMs: Minimizing dependence on external VAE teachers.

Conclusion

SpecPL addresses a crucial and previously underexplored limitation in VLM prompt learning by disentangling visual spectral granularity within a latent VAE manifold. By explicitly modeling both semantic invariance (low-frequency) and discriminative detail (high-frequency), and introducing counterfactual supervision for instance-level granules, SpecPL harmonizes stability and generalization—a persistent trade-off in prior approaches. Empirical results substantiate its design, showing state-of-the-art performance in fine-grained, cross-domain, and base-to-novel generalization. These findings spotlight spectral disentanglement as a core adaptation principle for robust, efficient VLMs and open research trajectories for its deployment in more complex, detail-dependent visual reasoning domains.

Markdown Report Issue