SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

Published 3 Mar 2025 in cs.CV | (2503.01754v3)

Abstract: Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables LLMs to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-LLMs (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.