Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering (2506.06905v2)

Published 7 Jun 2025 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

Summary

The paper introduces a meta-adaptive prompt distillation (MAPD) strategy that leverages meta-learning to optimize task-specific image features for few-shot visual question answering.
It employs an attention-mapper module within the LLaVA v1.5 architecture and first-order MAML, consistently outperforming traditional in-context learning on VL-ICL benchmarks.
MAPD achieves robust performance improvement and computational efficiency with only 24 million trainable parameters, offering a scalable solution for resource-constrained multimodal models.

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

The paper by Gupta et al., titled "Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering," presents a novel approach for enhancing the adaptability of Large Multimodal Models (LMMs) in few-shot learning settings. This is particularly pertinent in visual question answering (VQA) tasks where models are challenged to maintain performance with minimal training data.

Key Contributions and Methodology

The authors identify a critical limitation of in-context learning (ICL) in LMMs, noting that smaller models (<7 billion parameters) often falter as more examples are added, leading to performance degradation. They attribute this to the models being overwhelmed by excessive image embeddings from long sequences. To address this, the paper proposes a meta-learning strategy involving the distillation of soft prompts from task-relevant image features.

The key innovation lies in the development of a meta-adaptive prompt distillation (MAPD) technique, which leverages an attention-mapper module integrated with the LLaVA v1.5 architecture. The authors employ the first-order version of Model-Agnostic Meta-Learning (MAML) to train this mechanism, aiming to enable models to adapt promptly with few-shot examples at test time.

Evaluation and Results

Gupta et al. evaluate their MAPD technique on the VL-ICL benchmark, which includes tasks such as the Fast Open-Ended MiniImageNet and CLEVR Count Induction. Their approach outperforms traditional ICL methods and other prompt-tuning baselines consistently. Most notably, in the highly volatile environment of few-shot learning, MAPD shows monotonic improvement with increasing shot numbers, underscoring its robustness and scalability.

The paper also highlights the resilience of the proposed method against image perturbations. This is particularly evident in complex VQA scenarios where the model must navigate noisy or ambiguous visual data while retaining accuracy.

Theoretical and Practical Implications

Theoretically, this research underscores the potential of meta-learning frameworks to enhance the cross-task generalization capabilities of LMMs in low-data regimes. By focusing on task-specific embedding optimization rather than sole reliance on model expansion or excessive data, the study offers a paradigmatic shift in how few-shot learning challenges might be addressed.

Practically, the MAPD framework provides a scalable and computationally efficient alternative to traditional finetuning methods, achieving remarkable results with approximately 24 million trainable parameters. This efficiency presents a viable solution for deploying adaptable LMMs in real-world applications, where resource constraints are a critical consideration.

Future Directions

While the study effectively demonstrates the power of MAPD in single-image tasks, extending this approach to multi-image scenarios presents an intriguing avenue for future research. Additionally, exploring the integration of more complex reasoning capabilities into the meta-learning setup could further enhance the adaptability of LMMs, pushing the boundaries of what is feasible in few-shot VQA tasks.

In conclusion, Gupta et al. contribute a significant advancement to the domain of few-shot learning in LMMs through their meta-adaptive prompt distillation approach. This work not only improves the understanding of how multimodal information can be distilled effectively but also sets a foundation for further innovations in this rapidly evolving field.