Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

Published 5 Apr 2026 in cs.CV and cs.AI | (2604.03980v1)

Abstract: Parameter-efficient prompt learning has become the de facto standard for adapting Vision-LLMs (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-LLMs via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces GAPL, a framework that leverages Gram matrix-derived second-order statistics to enhance prompt learning in vision-language models under domain shifts.
It integrates three complementary streams—global invariant, Gram-anchored, and contextual—to achieve improved out-of-distribution and base-to-novel generalization.
Empirical results demonstrate that GAPL outperforms existing methods by achieving higher accuracy on both in-domain and OOD benchmarks with efficient prompt adaptation.

Gram-Anchored Prompt Learning for Vision-LLMs via Second-Order Statistics

Introduction

This work introduces Gram-Anchored Prompt Learning (GAPL), a parameter-efficient prompt learning framework targeting robust adaptation of vision-LLMs (VLMs) under pronounced domain shifts. Rather than exclusively relying on first-order visual cues, GAPL incorporates second-order statistical information—specifically, Gram-matrix-derived style descriptors—to construct more stable and generalizable prompt conditioning mechanisms in vision-language adaptation scenarios.

Prompt learning has established itself as a viable approach to efficiently adapt large, frozen VLMs such as CLIP for downstream recognition without updating the model backbone. The predominant approach, as instantiated in CoOp, CoCoOp, and their derivatives, relies mainly on first-order visual features (e.g., pooled global tokens or localized spatial embeddings) to produce dynamic prompts. However, these first-order cues inadequately capture image-level distributional structure and are highly sensitive to variations in visual style and local appearance, which limits out-of-distribution (OOD) robustness.

The motivation underlying GAPL is grounded in the observation that Gram matrices, known from the style transfer and domain adaptation literature, encode global texture and appearance statistics that are significantly more invariant across style, domain, and distributional shifts than first-order representations. While previous prompt learning methods have extended context regularization, region-awareness, and cross-modal interactions (e.g., HiCroPL [Zheng_2025_ICCV], CoPrompt [roy2023consistency], MaPLe [Khattak_2023_CVPR]), no prior work has systematically integrated Gram-based second-order signals into prompt generation.

GAPL Framework and Methodology

GAPL augments prompt learning for VLMs by coupling three distinct but complementary streams:

Global Invariant Stream: Preserves pre-trained VLM semantic priors via template-based ensemble text features, maintaining stable semantic alignment and mitigating over-parameterization on the training domain.
Gram-Anchored Stream: Introduces a concise image-level descriptor from the diagonal of the patch-token Gram matrix, log-transforms and compresses it with a lightweight MLP and sigmoid, and uses the resultant gating vector to modulate the prompted text features in a style-responsive manner. This channel-wise second-order cue enhances prompt adaptability and cross-domain robustness.
Contextual-Anchored Stream: Leverages learnable local signals and attention over patch tokens for fine-grained local visual-textual alignment, enhancing discriminativity for spatially localized features.

A dynamically fused prediction is constructed using an input-adaptive MLP-based weighting over these branches, optimizing a joint objective comprising classification and regularization terms.

Empirical Results

Cross-Domain Generalization

On a standard suite of ImageNet variants representing varying OOD scenarios, GAPL achieves top accuracy (72.9%) on the source domain and attains the highest average accuracy (61.12%) on OOD variants (ImageNet-V2, -Sketch, -A, -R). Notably, it outperforms MMRL, TCP, and PromptSRC by an absolute margin of 0.61–1.83% in key OOD settings. The Gram-Anchored Stream provides clear gains, particularly under severe style mismatches (e.g., +3.15% on ImageNet-A over the Global Invariant baseline), validating the robustness conferred by second-order conditioning.

Base-to-Novel Generalization

Under the base-to-novel transfer protocol across 11 benchmarks, GAPL achieves the highest average novel accuracy (78.12%) and harmonic mean (81.77%), maintaining competitive base accuracy. On datasets with large style and distributional gaps (Flowers102, SUN397, DTD), GAPL sets new state-of-the-art performance. These results confirm that introducing second-order anchors does not compromise in-distribution fit while substantially boosting transfer to unseen classes.

Ablation Studies

Component-wise ablations demonstrate:

The Gram-Anchored Stream delivers the most significant improvement in challenging OOD scenarios.
The diagonal-only Gram descriptor achieves equivalent results to more complex variants (diagonal+variance or full Gram), with superior efficiency and no GPU memory bottlenecks.
Deep visual prompting (injecting learnable prompts into all visual encoder layers) results in catastrophic OOD robustness degradation, affirming that prompt adaptation should remain on the textual side when targeting cross-domain generalization.

Fusion of learned and fixed template text features is essential; experimental sweep reveals that a fusion weight $\alpha=0.7$ is optimal for robustness.

Analysis of Latent Representations and Alignment

Qualitative and quantitative analyses via t-SNE and inter-domain Euclidean distances show that the Gram-based anchor substantially reduces domain-induced variance in the feature space. Class-discriminative clusters collapse across domains, and cross-domain centroids align to an order of magnitude finer scale (from $\sim5$ to $<0.05$ in Euclidean space), providing strong evidence of improved latent homogeneity and transfer.

Furthermore, patch-level similarity maps indicate improved semantic specificity: the Gram-based anchor suppresses non-discriminative activations and concentrates correspondence on salient object parts under both appearance and style variation.

Implications and Future Directions

The integration of second-order statistics into prompt learning presents a viable direction for robust transfer learning with VLMs. The empirical evidence establishes that simple, computationally feasible Gram-diagonal descriptors suffice to significantly enhance cross-domain adaptation without sacrificing in-distribution or base-class performance. The GAPL design substantiates the claim that structure-aware visual cues, beyond first-order statistics, should be core to prompt-based adaptation regimes when facing substantial distributional shifts.

Future research avenues include:

Online test-time adaptation and continual domain generalization (to relax the fixed offline parameterization constraint).
Extension of Gram-anchored prompt conditioning to dense prediction settings, including zero-shot segmentation, where spatial alignment is as critical as semantic generalization.
Investigation of higher-order or richer statistics, conditional on tractability, for tasks involving texture or material recognition.

Conclusion

Gram-Anchored Prompt Learning demonstrates principled and practical advances in parameter-efficient adaptation of VLMs. By anchoring prompt learning in second-order Gram-based descriptors, GAPL delivers robust generalization under distribution shift, substantiated by strong empirical results across 15 public benchmarks. The findings offer actionable guidance for future work in robust vision-language transfer and inform the evolution of prompt learning strategies in large-scale, foundation model settings.

Reference: "Gram-Anchored Prompt Learning for Vision-LLMs via Second-Order Statistics" (2604.03980)

Markdown Report Issue