Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 85 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 10 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4 31 tok/s Pro

2000 character limit reached

A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models (2502.13942v1)

Published 19 Feb 2025 in cs.CV

Abstract: A large-scale vision and LLM that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and LLMs by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the LLM, which struggles to generate accurate image descriptions with only a few training samples. Instead, we propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images. In addition, we further propose to learn different meta-parameters of the model corresponding to each CoT step in distinct subspaces to avoid interference. We evaluated our method on three commonly used image captioning datasets, i.e., MSCOCO, Flickr8k, and Flickr30k, under few-shot settings. The results of our experiments indicate that our chain-of-thought subspace meta-learning strategy is superior to the baselines in terms of performance across different datasets measured by different metrics.