DVLA-RL: Dual-Level Vision-Language Alignment

Updated 7 February 2026

The paper introduces a dual-level fusion framework that leverages both local attribute cues and global class descriptions to improve few-shot image classification.
A reinforcement learning–driven gating mechanism dynamically fuses textual signals with multi-level visual features, enabling progressive alignment across Transformer layers.
The modular design and ablation studies demonstrate DVLA-RL's state-of-the-art performance across standard, fine-grained, and cross-domain few-shot benchmarks.

Dual-level Vision-Language Alignment with Reinforcement Learning Gating (DVLA-RL) is a plug-in framework for few-shot image classification that explicitly addresses the challenge of aligning visual and linguistic representations at both local (attribute) and global (description) semantic levels. Its architecture comprises two principal components: Dual-level Semantic Construction (DSC), which conditions LLMs on both class names and support examples to generate and select discriminative attributes as well as synthesize holistic class descriptions; and RL-gated Attention (RLA), which dynamically fuses these textual signals with multi-level visual features via a policy trained by episodic REINFORCE. By introducing a layer-wise, reinforcement learning–driven gating mechanism between self-attention and cross-modal attention, DVLA-RL achieves progressive and adaptive alignment between visual and language streams, resulting in improved class discrimination in few-shot regimes across standard, fine-grained, and cross-domain settings (Li et al., 31 Jan 2026).

1. Architectural Overview and Objectives

DVLA-RL is designed as a modular add-on for few-shot image classification pipelines, where the goal is rapid adaptation to novel categories given only a handful of support samples. The framework systematically fuses vision and language representations at two semantic levels:

Attribute level: Fine-grained textual attributes are generated for each class by prompting an LLM with class names and support images. These attributes provide local, easily visualized cues (e.g., “corded white coat”).
Description level: Selected attributes are synthesized by the LLM into coherent, global class descriptions that capture higher-level, holistic semantics.

To integrate these dual-level semantics, DVLA-RL introduces a layer-wise stochastic gating strategy, where each Transformer layer’s vision-language fusion is modulated by a lightweight policy network trained via the REINFORCE algorithm. This enables shallow layers to attend more to attributes and deep layers to leverage descriptions, yielding cross-modal alignments that adapt to both spatial granularity and semantic abstraction (Li et al., 31 Jan 2026).

2. Dual-level Semantic Construction (DSC)

The DSC module generates the textual cues that drive hierarchical fusion. Given an $N$ -way, $K$ -shot support set $S=\{(x_i, y_i)\}$ and a class name $C$ , DSC operates in three phases:

2.1 Low-level Attribute Discovery

A LLM $\mathcal{L}_e$ (e.g., Qwen2.5-VL-32B) is prompted with both the class name and support images to elicit key distinguishing attributes:

$A_{C_{\text{sup}}} = \mathcal{L}_e \left( P_{\text{dis}}(C_{\text{sup}}) \right),$

where $P_{\text{dis}}$ instructs the model to enumerate concise, distinguishing attributes relevant to class $C$ in provided images.

2.2 Progressive Top- $k$ Attribute Selection

Candidate attributes $a_j$ are iteratively filtered using a CLIP-based text embedding space. The process refines a template $T^{(i)}$ by selecting, at each step, the attribute maximizing cosine similarity with the current template:

$s_j^{(i)} = \cos\left(E_{\text{text}}(T^{(i)}), E_{\text{text}}(a_j)\right),$

starting from $T^{(0)} = \text{"A photo of a \{CLASS\}"}$ . The top $k$ attributes ( $\widehat{A}_{C_{\text{sup}}}$ ) are used for further alignment, with each converted to a canonical textual format for encoding.

2.3 High-level Description Summarization

To capture inter-attribute context, the selected $k$ attributes are synthesized via a fluent LLM-prompted summary:

$D_{C_{\text{sup}}} = \mathcal{L}_e \left( P_{\text{sum}}(\widehat{A}_{C_{\text{sup}}}) \right),$

where $P_{\text{sum}}$ requests a concise scientific class description built from the top attributes.

Outputs from DSC—the attribute sentences and global description—form two token streams for later fusion.

3. RL-gated Attention (RLA)

RLA is responsible for the dynamic, layer-wise fusion of text and vision embeddings inside the Transformer backbone.

3.1 Attention Pathways

For normalized visual tokens $H_{\text{img}}$ and textual tokens $H_{\text{text}}$ , the framework computes:

Cross-attention from text queries to image keys/values:

$H_{\text{cross-img}} = \operatorname{Attn}(W_q^{\text{text}} H_{\text{text}}, W_k^{\text{img}} H_{\text{img}}, W_v^{\text{img}} H_{\text{img}})$

Self-attention in the text modality:

$H_{\text{cross-text}} = \operatorname{Attn}(W_q^{\text{text}} H_{\text{text}}, W_k^{\text{text}} H_{\text{text}}, W_v^{\text{text}} H_{\text{text}})$

Standard scaled-dot-product attention:

$\operatorname{Attn}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

3.2 RL-driven Stochastic Gating

The fused token stream is produced as:

$H_{\text{fuse}} = a \cdot H_{\text{cross-img}} + (1-a) \cdot H_{\text{cross-text}}, \qquad a \sim \mathrm{Beta}\left(\alpha=p_e(s), \beta = k(1-p_e(s)) \right)$

The gating parameter $a$ is drawn from a Beta distribution whose mean $p_e(s)$ is predicted by a policy network $\varphi$ , conditioned on the state:

$s = [\text{GAP}(H_{\text{img}}) ; \text{GAP}(H_{\text{text}}) ; \cos(\text{GAP}(H_{\text{img}}), \text{GAP}(H_{\text{text}}))]$

where $\text{GAP}$ denotes global average pooling.

3.3 Layer-wise Sequential Decision and Rewards

The process is formulated as a sequential decision problem over $L$ layers. At each layer $l$ :

Policy $\pi_\theta$ outputs $a_l$ given state $s_l$ .
Reward $R_l$ comprises alignment between fused features and CLIP text embedding plus query accuracy gain:

$R_l = A_{\text{sim}} \cdot \cos(U\,\text{GAP}(H_{\text{fuse}}),\,t^*) + A_{\text{imp}} \cdot (Acc_l - Acc_{l-1}),$

where $U$ is a linear map, $t^*$ is the CLIP text embedding for the ground-truth class, and $Acc_l$ is current query accuracy.

Optimization employs REINFORCE to maximize expected cumulative reward, including an entropy bonus.

3.4 Auxiliary Losses

After fusing, $H_{\text{fuse}}$ is reinjected into the visual stream via residual addition and concatenation. Prototypical Network prototypes $c_i$ are formed by averaging support features and queries classified via cosine-softmax. The total loss is:

$L_{\text{total}} = L_{\text{sup}} + \gamma \sum_{l=1}^L L_{\text{RL}}^l$

where $L_{\text{sup}}$ is query cross-entropy and $\gamma$ balances the RL signal.

4. Layer-wise Integration Strategy

The attribute and description token streams from DSC are injected into different network depths:

Shallow (early) layers: Receive the attribute tokens ( $H_{\text{attr}}$ ) to promote fine-grained, local alignment between image regions and visual attributes.
Deep (late) layers: Receive the global description tokens ( $H_{\text{desc}}$ ) to guide the network toward holistic, class-level understanding.

The learned gating parameter $a_l$ is empirically observed to increase with network depth, indicating increased emphasis on description-guided semantics in deeper layers. This design ensures that visual-textual fusion matches the semantic granularity appropriate to each layer’s feature abstraction (Li et al., 31 Jan 2026).

5. Experimental Evaluation and Quantitative Results

DVLA-RL was evaluated across three few-shot learning scenarios:

General few-shot: miniImageNet, tieredImageNet, CIFAR-FS.
Fine-grained few-shot: CUB-200-2011, Stanford Dogs, Stanford Cars.
Cross-domain: miniImageNet-trained, tested on CUB, Places, ChestX.

Each benchmark used 2000 episodes of 5-way (1-shot and 5-shot) classification, with 15 queries per class.

Main Performance Results

Dataset	1-shot DVLA-RL	5-shot DVLA-RL	Prior Best (SemFew/SUITED/MEFP)
miniImageNet	81.69 ± 0.36	88.25 ± 0.28	78.94/86.49
tieredImageNet	83.02 ± 0.43	91.71 ± 0.29	82.37/89.89
CIFAR-FS	87.18 ± 0.40	90.59 ± 0.31	84.34/89.11
CUB	91.93 / 95.06		86.02 / 94.13
Dogs	89.64 / 91.42		76.55 / 88.86
Cars	92.95 / 96.59		89.97 / 96.53
CUB (cross-dom)	67.46 / 78.99		51.55 / 73.61
Places	69.26 / 80.70		52.06 / 73.78
ChestX	23.47 / 26.94		23.11 / 26.70

DVLA-RL outperforms previous approaches on all benchmarks.

Component Ablation (miniImageNet 1-shot)

Configuration	Accuracy (%)
Attributes only	75.73
+ Descriptions	76.56
+ Progressive Top-k	78.36
+ RLA (full DVLA-RL)	81.69

Ablation confirms that each module incrementally contributes to overall performance.

6. Implementation Details

Visual backbone: Visformer-Tiny (ViT variant).
Text encoder: CLIP ViT-B/16.
LLM for semantics: Qwen2.5-VL-32B, with specific prompts controlling attribute extraction and description synthesis.
Pretraining and meta-tuning: 300–800 epochs for pretraining (batch 512), 100 epochs episodic for meta-tuning.
Optimization: AdamW, initial learning rate $5\times 10^{-4}$ , cosine scheduler.
RL hyperparameters: $k=10,\,A_{\text{sim}}=0.5,\,A_{\text{imp}}=1.0,\,\gamma=0.1,\,\lambda=0.2$ .

7. Summary and Significance

DVLA-RL demonstrates that hierarchical, progressive alignment between vision and language—achieved through LLM-grounded attribute extraction, selection, summarization, and reinforcement learning–guided fusion—provides substantial improvements for few-shot learning tasks. By matching the level of semantic abstraction injected to the depth of the visual network and dynamically adjusting fusion via RL-gated attention, this architecture attains state-of-the-art results across a wide range of benchmarks. The staged ablation validates the complementary roles of attribute selection, description synthesis, and RL gating, underscoring the necessity of both dual-level semantics and adaptive layer-wise fusion (Li et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL).