Papers
Topics
Authors
Recent
Search
2000 character limit reached

DVLA-RL: Dual-Level Vision-Language Alignment

Updated 7 February 2026
  • The paper introduces a dual-level fusion framework that leverages both local attribute cues and global class descriptions to improve few-shot image classification.
  • A reinforcement learning–driven gating mechanism dynamically fuses textual signals with multi-level visual features, enabling progressive alignment across Transformer layers.
  • The modular design and ablation studies demonstrate DVLA-RL's state-of-the-art performance across standard, fine-grained, and cross-domain few-shot benchmarks.

Dual-level Vision-Language Alignment with Reinforcement Learning Gating (DVLA-RL) is a plug-in framework for few-shot image classification that explicitly addresses the challenge of aligning visual and linguistic representations at both local (attribute) and global (description) semantic levels. Its architecture comprises two principal components: Dual-level Semantic Construction (DSC), which conditions LLMs on both class names and support examples to generate and select discriminative attributes as well as synthesize holistic class descriptions; and RL-gated Attention (RLA), which dynamically fuses these textual signals with multi-level visual features via a policy trained by episodic REINFORCE. By introducing a layer-wise, reinforcement learning–driven gating mechanism between self-attention and cross-modal attention, DVLA-RL achieves progressive and adaptive alignment between visual and language streams, resulting in improved class discrimination in few-shot regimes across standard, fine-grained, and cross-domain settings (Li et al., 31 Jan 2026).

1. Architectural Overview and Objectives

DVLA-RL is designed as a modular add-on for few-shot image classification pipelines, where the goal is rapid adaptation to novel categories given only a handful of support samples. The framework systematically fuses vision and language representations at two semantic levels:

  • Attribute level: Fine-grained textual attributes are generated for each class by prompting an LLM with class names and support images. These attributes provide local, easily visualized cues (e.g., “corded white coat”).
  • Description level: Selected attributes are synthesized by the LLM into coherent, global class descriptions that capture higher-level, holistic semantics.

To integrate these dual-level semantics, DVLA-RL introduces a layer-wise stochastic gating strategy, where each Transformer layer’s vision-language fusion is modulated by a lightweight policy network trained via the REINFORCE algorithm. This enables shallow layers to attend more to attributes and deep layers to leverage descriptions, yielding cross-modal alignments that adapt to both spatial granularity and semantic abstraction (Li et al., 31 Jan 2026).

2. Dual-level Semantic Construction (DSC)

The DSC module generates the textual cues that drive hierarchical fusion. Given an NN-way, KK-shot support set S={(xi,yi)}S=\{(x_i, y_i)\} and a class name CC, DSC operates in three phases:

2.1 Low-level Attribute Discovery

A LLM Le\mathcal{L}_e (e.g., Qwen2.5-VL-32B) is prompted with both the class name and support images to elicit key distinguishing attributes:

ACsup=Le(Pdis(Csup)),A_{C_{\text{sup}}} = \mathcal{L}_e \left( P_{\text{dis}}(C_{\text{sup}}) \right),

where PdisP_{\text{dis}} instructs the model to enumerate concise, distinguishing attributes relevant to class CC in provided images.

2.2 Progressive Top-kk Attribute Selection

Candidate attributes aja_j are iteratively filtered using a CLIP-based text embedding space. The process refines a template T(i)T^{(i)} by selecting, at each step, the attribute maximizing cosine similarity with the current template:

sj(i)=cos(Etext(T(i)),Etext(aj)),s_j^{(i)} = \cos\left(E_{\text{text}}(T^{(i)}), E_{\text{text}}(a_j)\right),

starting from T(0)="A photo of a {CLASS}"T^{(0)} = \text{"A photo of a \{CLASS\}"}. The top kk attributes (A^Csup\widehat{A}_{C_{\text{sup}}}) are used for further alignment, with each converted to a canonical textual format for encoding.

2.3 High-level Description Summarization

To capture inter-attribute context, the selected kk attributes are synthesized via a fluent LLM-prompted summary:

DCsup=Le(Psum(A^Csup)),D_{C_{\text{sup}}} = \mathcal{L}_e \left( P_{\text{sum}}(\widehat{A}_{C_{\text{sup}}}) \right),

where PsumP_{\text{sum}} requests a concise scientific class description built from the top attributes.

Outputs from DSC—the attribute sentences and global description—form two token streams for later fusion.

3. RL-gated Attention (RLA)

RLA is responsible for the dynamic, layer-wise fusion of text and vision embeddings inside the Transformer backbone.

3.1 Attention Pathways

For normalized visual tokens HimgH_{\text{img}} and textual tokens HtextH_{\text{text}}, the framework computes:

  • Cross-attention from text queries to image keys/values:

Hcross-img=Attn(WqtextHtext,WkimgHimg,WvimgHimg)H_{\text{cross-img}} = \operatorname{Attn}(W_q^{\text{text}} H_{\text{text}}, W_k^{\text{img}} H_{\text{img}}, W_v^{\text{img}} H_{\text{img}})

  • Self-attention in the text modality:

Hcross-text=Attn(WqtextHtext,WktextHtext,WvtextHtext)H_{\text{cross-text}} = \operatorname{Attn}(W_q^{\text{text}} H_{\text{text}}, W_k^{\text{text}} H_{\text{text}}, W_v^{\text{text}} H_{\text{text}})

  • Standard scaled-dot-product attention:

Attn(Q,K,V)=softmax(QKTd)V\operatorname{Attn}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

3.2 RL-driven Stochastic Gating

The fused token stream is produced as:

Hfuse=aHcross-img+(1a)Hcross-text,aBeta(α=pe(s),β=k(1pe(s)))H_{\text{fuse}} = a \cdot H_{\text{cross-img}} + (1-a) \cdot H_{\text{cross-text}}, \qquad a \sim \mathrm{Beta}\left(\alpha=p_e(s), \beta = k(1-p_e(s)) \right)

The gating parameter aa is drawn from a Beta distribution whose mean pe(s)p_e(s) is predicted by a policy network φ\varphi, conditioned on the state:

s=[GAP(Himg);GAP(Htext);cos(GAP(Himg),GAP(Htext))]s = [\text{GAP}(H_{\text{img}}) ; \text{GAP}(H_{\text{text}}) ; \cos(\text{GAP}(H_{\text{img}}), \text{GAP}(H_{\text{text}}))]

where GAP\text{GAP} denotes global average pooling.

3.3 Layer-wise Sequential Decision and Rewards

The process is formulated as a sequential decision problem over LL layers. At each layer ll:

  • Policy πθ\pi_\theta outputs ala_l given state sls_l.
  • Reward RlR_l comprises alignment between fused features and CLIP text embedding plus query accuracy gain:

Rl=Asimcos(UGAP(Hfuse),t)+Aimp(AcclAccl1),R_l = A_{\text{sim}} \cdot \cos(U\,\text{GAP}(H_{\text{fuse}}),\,t^*) + A_{\text{imp}} \cdot (Acc_l - Acc_{l-1}),

where UU is a linear map, tt^* is the CLIP text embedding for the ground-truth class, and AcclAcc_l is current query accuracy.

Optimization employs REINFORCE to maximize expected cumulative reward, including an entropy bonus.

3.4 Auxiliary Losses

After fusing, HfuseH_{\text{fuse}} is reinjected into the visual stream via residual addition and concatenation. Prototypical Network prototypes cic_i are formed by averaging support features and queries classified via cosine-softmax. The total loss is:

Ltotal=Lsup+γl=1LLRLlL_{\text{total}} = L_{\text{sup}} + \gamma \sum_{l=1}^L L_{\text{RL}}^l

where LsupL_{\text{sup}} is query cross-entropy and γ\gamma balances the RL signal.

4. Layer-wise Integration Strategy

The attribute and description token streams from DSC are injected into different network depths:

  • Shallow (early) layers: Receive the attribute tokens (HattrH_{\text{attr}}) to promote fine-grained, local alignment between image regions and visual attributes.
  • Deep (late) layers: Receive the global description tokens (HdescH_{\text{desc}}) to guide the network toward holistic, class-level understanding.

The learned gating parameter ala_l is empirically observed to increase with network depth, indicating increased emphasis on description-guided semantics in deeper layers. This design ensures that visual-textual fusion matches the semantic granularity appropriate to each layer’s feature abstraction (Li et al., 31 Jan 2026).

5. Experimental Evaluation and Quantitative Results

DVLA-RL was evaluated across three few-shot learning scenarios:

  • General few-shot: miniImageNet, tieredImageNet, CIFAR-FS.
  • Fine-grained few-shot: CUB-200-2011, Stanford Dogs, Stanford Cars.
  • Cross-domain: miniImageNet-trained, tested on CUB, Places, ChestX.

Each benchmark used 2000 episodes of 5-way (1-shot and 5-shot) classification, with 15 queries per class.

Main Performance Results

Dataset 1-shot DVLA-RL 5-shot DVLA-RL Prior Best (SemFew/SUITED/MEFP)
miniImageNet 81.69 ± 0.36 88.25 ± 0.28 78.94/86.49
tieredImageNet 83.02 ± 0.43 91.71 ± 0.29 82.37/89.89
CIFAR-FS 87.18 ± 0.40 90.59 ± 0.31 84.34/89.11
CUB 91.93 / 95.06 86.02 / 94.13
Dogs 89.64 / 91.42 76.55 / 88.86
Cars 92.95 / 96.59 89.97 / 96.53
CUB (cross-dom) 67.46 / 78.99 51.55 / 73.61
Places 69.26 / 80.70 52.06 / 73.78
ChestX 23.47 / 26.94 23.11 / 26.70

DVLA-RL outperforms previous approaches on all benchmarks.

Component Ablation (miniImageNet 1-shot)

Configuration Accuracy (%)
Attributes only 75.73
+ Descriptions 76.56
+ Progressive Top-k 78.36
+ RLA (full DVLA-RL) 81.69

Ablation confirms that each module incrementally contributes to overall performance.

6. Implementation Details

  • Visual backbone: Visformer-Tiny (ViT variant).
  • Text encoder: CLIP ViT-B/16.
  • LLM for semantics: Qwen2.5-VL-32B, with specific prompts controlling attribute extraction and description synthesis.
  • Pretraining and meta-tuning: 300–800 epochs for pretraining (batch 512), 100 epochs episodic for meta-tuning.
  • Optimization: AdamW, initial learning rate 5×1045\times 10^{-4}, cosine scheduler.
  • RL hyperparameters: k=10,Asim=0.5,Aimp=1.0,γ=0.1,λ=0.2k=10,\,A_{\text{sim}}=0.5,\,A_{\text{imp}}=1.0,\,\gamma=0.1,\,\lambda=0.2.

7. Summary and Significance

DVLA-RL demonstrates that hierarchical, progressive alignment between vision and language—achieved through LLM-grounded attribute extraction, selection, summarization, and reinforcement learning–guided fusion—provides substantial improvements for few-shot learning tasks. By matching the level of semantic abstraction injected to the depth of the visual network and dynamically adjusting fusion via RL-gated attention, this architecture attains state-of-the-art results across a wide range of benchmarks. The staged ablation validates the complementary roles of attribute selection, description synthesis, and RL gating, underscoring the necessity of both dual-level semantics and adaptive layer-wise fusion (Li et al., 31 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL).