Latent Q-Former: Efficient Vision-Language Bridge
- The paper introduces the LQ-Former, a novel architecture that grounds a QFormer in a frozen encoder–decoder LLM's latent space to cut computational costs and improve sample efficiency.
- It employs three frozen unimodal modules with a trainable bridge, achieving higher image captioning BLEU-4 and VQA accuracy while reducing memory usage.
- Empirical results show accelerated convergence and robust performance in single-task, multi-task, and zero-shot settings, validating its practical efficiency.
The Latent Q-Former (LQ-Former), also referred to as the Semantically Grounded QFormer, constitutes an efficient bridge architecture for vision–LLMs based on unimodal frozen backbones. LQ-Former distinguishes itself from standard QFormer frameworks such as BLIP-2 and InstructBLIP by grounding the QFormer in the latent space of a frozen pre-trained encoder–decoder LLM, thereby providing substantial reductions in computational overhead and improvements in sample efficiency for vision–language pretraining and alignment. Specifically, LQ-Former directly integrates the LLM’s intermediate latent text representations as both QFormer conditioning and LLM decoder input, dispensing with the need for expensive multimodal pretraining and avoiding the full end-to-end LLM encoder computation otherwise required for every sample.
1. Architectural Overview
The LQ-Former pipeline interconnects three frozen unimodal modules and a single trainable bridge. The architecture encompasses:
- Visual Encoder: A frozen image encoder (EVA-CLIP-g/14) that encodes an image to patch embeddings , where is the number of image patches and the hidden dimension.
- LLM: A frozen encoder–decoder LLM (FlanT5-base), with (encoder) and (decoder) components, both kept frozen throughout training.
- QFormer Module: A trainable transformer-based module with learnable query vectors , comprising stacks of alternating self-attention and cross-attention layers.
The distinguishing design in LQ-Former is twofold:
- The LLM encoder’s output for the prompt, (where is the prompt length), is included directly alongside the QFormer’s learnable queries at the input, i.e., .
- Instead of funneling the QFormer’s output through the LLM encoder for fusion, the final representation is formed by concatenating the QFormer’s output with the prompt embedding and directly passing it to the LLM decoder as its “prefix”:
This paradigm eliminates the need for a full LLM encoder pass at generation time. Standard QFormer pipelines employ and pass through , which requires both encoder and decoder inference. LQ-Former restricts itself to decoder-side computation during inference.
2. Mathematical Formulation and Attention Operations
The QFormer module operates over the concatenated queries and prompt embeddings. Let , , and denote the learnable queries, image patch embeddings, and LLM encoder prompt embeddings, respectively. Across transformer layers, the input is .
Each transformer block alternates:
- Self-attention among queries:
- Cross-attention (queries to vision tokens):
where , , represent the learned projections at layer . Upon completion of layers, the final query embeddings are extracted as . The full prefix for the decoder is .
3. Training Regimen and Objectives
LQ-Former dispenses with multimodal pretraining objectives such as image–text matching (ITM) or image–text contrastive (ITC) losses. The sole objective is the standard next-token cross-entropy loss:
where is the target output sequence (caption or answer), and is the decoder prefix. In multi-task settings (e.g., captioning plus VQA), a joint loss is used by concatenating tasks within the same batch.
The standard pretraining and finetuning schedule involves:
- Single-task: 50 epochs over COCO Captions (117K images × 5 captions) or VQAv2 (82K images × 3 answers), prompt sampled randomly from a set.
- Multi-task: 20 epochs on image captioning, followed by 15 epochs finetuning jointly on captioning and VQA, using batch size 64, learning rate (linear warmup for 1K steps, cosine decay thereafter).
- Zero-shot: Models pretrained and finetuned only on COCO and VQAv2 are evaluated on OKVQA without additional tuning.
4. Computational Efficiency and Scalability
A primary advantage of LQ-Former is the decoupling of the LLM encoder during both training and inference. Only the QFormer (approximately 30M parameters) and the LLM decoder (approximately 225M parameters) require gradient computation and memory activation storage, resulting in approximately 40% memory reduction versus a fully active LLM. Inference is streamlined: a single forward pass through the image encoder and the QFormer suffices, and the prompt encoding can be cached for repeated prompts, reducing per-token FLOPs by ~50% relative to architectures requiring both LLM encoder and decoder passes.
Empirical convergence is accelerated. On captioning, LQ-Former reaches BLEU-4 of 0.30 in approximately 10 epochs, compared to 25 epochs for the baseline QFormer; peak performance is higher.
5. Empirical Performance and Ablation Results
Quantitative results reported for single-task, multi-task, and zero-shot transfer are summarized below.
| Setting | Metric | Baseline QFormer | Grounded LQ-Former |
|---|---|---|---|
| Captioning | BLEU-4 (val, single-task) | 0.238 | 0.364 |
| VQA | Accuracy (val, single-task) | 57.72% | 63.25% |
| Captioning | BLEU-4 (final, multi-task) | 0.209 | 0.362 |
| VQA | Accuracy (final, multi-task) | 55.4% | 66.8% |
| Zero-shot (OKVQA) | Accuracy | 28.8% | 38.96% |
Additional comparisons with InstructBLIP models on OKVQA show InstructBLIP (OPT 6.7B): 36.4%, InstructBLIP (FlanT5-XL 3B): 40.7%.
Ablation studies controlling for captioning BLEU-4 during pretraining reveal that language grounding (feeding to the QFormer) accelerates the emergence of VQA performance by approximately 5–6% accuracy in early finetuning, indicating expeditious visual reasoning with the grounded approach.
6. Contextual Significance and Design Implications
The central insight of LQ-Former is that aligning the QFormer’s output space with the LLM’s latent semantic space—rather than projecting image representations directly onto text tokens—provides superior sample efficiency and model alignment. Grounding the QFormer in enables direct semantic conditioning during both training and inference and substantially reduces memory and computational demands by obviating repeated LLM encoder invocation and parameter updates.
LQ-Former’s lightweight design bypasses large-scale multi-modal pretraining and sidesteps the need for updatable unimodal backbones. This trait enables practical vision–language alignment with only modest cross-modal data. Performance advantages are most pronounced in rapid convergence and higher final accuracy on both captioning and VQA tasks, in both single- and multi-task regimes.
7. Limitations and Open Challenges
While LQ-Former offers compelling efficiency and empirical gains, several limitations remain:
- A performance gap persists relative to very large LLMs (billions of parameters) pretrained on orders-of-magnitude larger image–text corpora.
- The method is tailored to encoder–decoder LLMs; adapting this latent-grounding schema to decoder-only LLMs is non-trivial.
- The LLM encoder’s inductive biases may inadvertently restrict the range of semantic compositions available to the downstream model, potentially impacting expressiveness, especially if the LLM encoder introduces systematic bias.
A plausible implication is that future lines of research may explore generalizing the latent-grounded bridging strategy to broader classes of LLMs and further minimizing reliance on frozen LLMs’ encoder representations in pursuit of both efficiency and representation flexibility.