DVLA-RL: Dual-Level Vision-Language Alignment
- The paper introduces a dual-level fusion framework that leverages both local attribute cues and global class descriptions to improve few-shot image classification.
- A reinforcement learning–driven gating mechanism dynamically fuses textual signals with multi-level visual features, enabling progressive alignment across Transformer layers.
- The modular design and ablation studies demonstrate DVLA-RL's state-of-the-art performance across standard, fine-grained, and cross-domain few-shot benchmarks.
Dual-level Vision-Language Alignment with Reinforcement Learning Gating (DVLA-RL) is a plug-in framework for few-shot image classification that explicitly addresses the challenge of aligning visual and linguistic representations at both local (attribute) and global (description) semantic levels. Its architecture comprises two principal components: Dual-level Semantic Construction (DSC), which conditions LLMs on both class names and support examples to generate and select discriminative attributes as well as synthesize holistic class descriptions; and RL-gated Attention (RLA), which dynamically fuses these textual signals with multi-level visual features via a policy trained by episodic REINFORCE. By introducing a layer-wise, reinforcement learning–driven gating mechanism between self-attention and cross-modal attention, DVLA-RL achieves progressive and adaptive alignment between visual and language streams, resulting in improved class discrimination in few-shot regimes across standard, fine-grained, and cross-domain settings (Li et al., 31 Jan 2026).
1. Architectural Overview and Objectives
DVLA-RL is designed as a modular add-on for few-shot image classification pipelines, where the goal is rapid adaptation to novel categories given only a handful of support samples. The framework systematically fuses vision and language representations at two semantic levels:
- Attribute level: Fine-grained textual attributes are generated for each class by prompting an LLM with class names and support images. These attributes provide local, easily visualized cues (e.g., “corded white coat”).
- Description level: Selected attributes are synthesized by the LLM into coherent, global class descriptions that capture higher-level, holistic semantics.
To integrate these dual-level semantics, DVLA-RL introduces a layer-wise stochastic gating strategy, where each Transformer layer’s vision-language fusion is modulated by a lightweight policy network trained via the REINFORCE algorithm. This enables shallow layers to attend more to attributes and deep layers to leverage descriptions, yielding cross-modal alignments that adapt to both spatial granularity and semantic abstraction (Li et al., 31 Jan 2026).
2. Dual-level Semantic Construction (DSC)
The DSC module generates the textual cues that drive hierarchical fusion. Given an -way, -shot support set and a class name , DSC operates in three phases:
2.1 Low-level Attribute Discovery
A LLM (e.g., Qwen2.5-VL-32B) is prompted with both the class name and support images to elicit key distinguishing attributes:
where instructs the model to enumerate concise, distinguishing attributes relevant to class in provided images.
2.2 Progressive Top- Attribute Selection
Candidate attributes are iteratively filtered using a CLIP-based text embedding space. The process refines a template by selecting, at each step, the attribute maximizing cosine similarity with the current template:
starting from . The top attributes () are used for further alignment, with each converted to a canonical textual format for encoding.
2.3 High-level Description Summarization
To capture inter-attribute context, the selected attributes are synthesized via a fluent LLM-prompted summary:
where requests a concise scientific class description built from the top attributes.
Outputs from DSC—the attribute sentences and global description—form two token streams for later fusion.
3. RL-gated Attention (RLA)
RLA is responsible for the dynamic, layer-wise fusion of text and vision embeddings inside the Transformer backbone.
3.1 Attention Pathways
For normalized visual tokens and textual tokens , the framework computes:
- Cross-attention from text queries to image keys/values:
- Self-attention in the text modality:
- Standard scaled-dot-product attention:
3.2 RL-driven Stochastic Gating
The fused token stream is produced as:
The gating parameter is drawn from a Beta distribution whose mean is predicted by a policy network , conditioned on the state:
where denotes global average pooling.
3.3 Layer-wise Sequential Decision and Rewards
The process is formulated as a sequential decision problem over layers. At each layer :
- Policy outputs given state .
- Reward comprises alignment between fused features and CLIP text embedding plus query accuracy gain:
where is a linear map, is the CLIP text embedding for the ground-truth class, and is current query accuracy.
Optimization employs REINFORCE to maximize expected cumulative reward, including an entropy bonus.
3.4 Auxiliary Losses
After fusing, is reinjected into the visual stream via residual addition and concatenation. Prototypical Network prototypes are formed by averaging support features and queries classified via cosine-softmax. The total loss is:
where is query cross-entropy and balances the RL signal.
4. Layer-wise Integration Strategy
The attribute and description token streams from DSC are injected into different network depths:
- Shallow (early) layers: Receive the attribute tokens () to promote fine-grained, local alignment between image regions and visual attributes.
- Deep (late) layers: Receive the global description tokens () to guide the network toward holistic, class-level understanding.
The learned gating parameter is empirically observed to increase with network depth, indicating increased emphasis on description-guided semantics in deeper layers. This design ensures that visual-textual fusion matches the semantic granularity appropriate to each layer’s feature abstraction (Li et al., 31 Jan 2026).
5. Experimental Evaluation and Quantitative Results
DVLA-RL was evaluated across three few-shot learning scenarios:
- General few-shot: miniImageNet, tieredImageNet, CIFAR-FS.
- Fine-grained few-shot: CUB-200-2011, Stanford Dogs, Stanford Cars.
- Cross-domain: miniImageNet-trained, tested on CUB, Places, ChestX.
Each benchmark used 2000 episodes of 5-way (1-shot and 5-shot) classification, with 15 queries per class.
Main Performance Results
| Dataset | 1-shot DVLA-RL | 5-shot DVLA-RL | Prior Best (SemFew/SUITED/MEFP) |
|---|---|---|---|
| miniImageNet | 81.69 ± 0.36 | 88.25 ± 0.28 | 78.94/86.49 |
| tieredImageNet | 83.02 ± 0.43 | 91.71 ± 0.29 | 82.37/89.89 |
| CIFAR-FS | 87.18 ± 0.40 | 90.59 ± 0.31 | 84.34/89.11 |
| CUB | 91.93 / 95.06 | 86.02 / 94.13 | |
| Dogs | 89.64 / 91.42 | 76.55 / 88.86 | |
| Cars | 92.95 / 96.59 | 89.97 / 96.53 | |
| CUB (cross-dom) | 67.46 / 78.99 | 51.55 / 73.61 | |
| Places | 69.26 / 80.70 | 52.06 / 73.78 | |
| ChestX | 23.47 / 26.94 | 23.11 / 26.70 |
DVLA-RL outperforms previous approaches on all benchmarks.
Component Ablation (miniImageNet 1-shot)
| Configuration | Accuracy (%) |
|---|---|
| Attributes only | 75.73 |
| + Descriptions | 76.56 |
| + Progressive Top-k | 78.36 |
| + RLA (full DVLA-RL) | 81.69 |
Ablation confirms that each module incrementally contributes to overall performance.
6. Implementation Details
- Visual backbone: Visformer-Tiny (ViT variant).
- Text encoder: CLIP ViT-B/16.
- LLM for semantics: Qwen2.5-VL-32B, with specific prompts controlling attribute extraction and description synthesis.
- Pretraining and meta-tuning: 300–800 epochs for pretraining (batch 512), 100 epochs episodic for meta-tuning.
- Optimization: AdamW, initial learning rate , cosine scheduler.
- RL hyperparameters: .
7. Summary and Significance
DVLA-RL demonstrates that hierarchical, progressive alignment between vision and language—achieved through LLM-grounded attribute extraction, selection, summarization, and reinforcement learning–guided fusion—provides substantial improvements for few-shot learning tasks. By matching the level of semantic abstraction injected to the depth of the visual network and dynamically adjusting fusion via RL-gated attention, this architecture attains state-of-the-art results across a wide range of benchmarks. The staged ablation validates the complementary roles of attribute selection, description synthesis, and RL gating, underscoring the necessity of both dual-level semantics and adaptive layer-wise fusion (Li et al., 31 Jan 2026).