MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (2505.10526v2)

Published 15 May 2025 in cs.LG, cs.CL, and cs.CV

Abstract: Speculative decoding significantly accelerates LLM inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-LLMs (VLMs) presents two fundamental challenges: small LLMs that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-LLMs (MASSV), which transforms existing small LLMs into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM's vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.

Summary

The paper presents MASSV, a novel framework improving Vision-Language Model efficiency through speculative decoding adapted for multimodal inputs.
MASSV employs a two-phase methodology: multimodal adaptation connects the VLM's vision encoder to a smaller drafter, and self-data distillation aligns the drafter's predictions with the target VLM using visual instruction tuning.
Empirical evaluations show MASSV increases token acceptance length by up to 30% and achieves inference speedups of up to 1.46x on tasks like COCO captioning for Qwen2.5-VL and Gemma3 models.

Overview of MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-LLMs

The paper "MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-LLMs" introduces a novel approach for enhancing the efficiency of vision-LLMs (VLMs) through speculative decoding (SD). While speculative decoding has been successfully implemented in LLMs to significantly reduce computational cost without affecting output quality, its application to vision-LLMs has posed unique challenges. Specifically, the absence of architectural components for visual inputs in small language drafters and the divergence in token predictions between unimodal drafters and VLMs due to visual context makes the adaptation nontrivial. MASSV seeks to address these challenges using a two-phase approach.

Methodology

The MASSV framework proposes two primary strategies for utilizing smaller LLMs as effective multimodal drafters:

Multimodal Adaptation: This involves connecting the vision-LLM's vision encoder to a smaller draft model with the aid of a trainable projector. This step ensures that the drafter can effectively process and utilize visual information in generating tokens, thereby preserving the architectural integrity of the model while enabling multimodal processing.
Self-Data Distillation (SDD): The drafter undergoes self-distilled visual instruction tuning, aligning its token prediction distribution closely with that of the multimodal target model. This phase ensures higher token acceptance rates by optimizing the drafter for greater coherence with the target model's output space, specifically targeting visually-grounded tasks.

Empirical Evaluation

The MASSV framework exhibits notable improvements in speculative decoding tasks across two specific model families, namely Qwen2.5-VL and Gemma3. The experimental results indicate that MASSV enhances the accepted length of tokens by up to 30% and achieves end-to-end inference speedups of up to 1.46x on visually-grounded tasks, such as COCO captioning. These improvements underline the effectiveness of multimodal adaptation and self-data distillation in aligning the distributions between drafters and target VLMs.

Implications and Future Directions

The advancement through MASSV implies practical efficiency gains in deploying VLMs, which are particularly beneficial in real-time AI applications where computational bandwidth is a significant constraint. The decoupling of the vision encoder stage from subsequent language processing tasks minimizes redundant computational overhead, a salient issue in large VLM architectures.

Moreover, the paper illuminates potential avenues for expanding the applicability of speculative decoding into more complex multimodal AI systems. Future research could extend the MASSV approach to scenarios involving larger, more diverse datasets and refine multimodal adaptation methods further. Additionally, exploring the interoperability between different architectures and extending speculative decoding mechanisms into low-resource languages and domains could be promising new paths.

Conclusion

MASSV provides a structured and actionable framework to leverage existing LLM architectures to support vision-language interactions effectively. By employing an innovative synergy between architectural adaptation and self-data distillation, it achieves appreciable improvements in speculative decoding performance without compromising the quality of model outputs. As computational demands grow alongside model capabilities, MASSV marks a significant step toward optimizing resource utilization while meeting high-performance requirements in AI systems.