FREE: Fast and Robust Vision Language Models with Early Exits (2506.06884v1)

Published 7 Jun 2025 in cs.LG and cs.CV

Abstract: In recent years, Vision-LLMs (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at https://github.com/Div290/FREE.

Summary

The paper introduces FREE, which integrates early exit strategies and adversarial training to reduce inference latency and counteract overthinking and mid-crisis issues in VLMs.
The methodology employs transformer layers with GAN-based adversarial alignment to ensure intermediate features mirror final layer outputs without heavy reliance on labeled data.
Experimental evaluations on image captioning, VQA, and visual dialogue tasks show FREE enhances speed by over 1.51x, paving the way for real-time applications in constrained environments.

FREE: Fast and Robust Vision LLMs with Early Exits

Vision-LLMs (VLMs) are at the forefront of integrating visual and textual data for advanced cognitive tasks. Such models have shown significant performance gains across various applications. However, their large size can result in substantial inference latency, limiting their utility in real-time settings. This paper presents a novel approach, dubbed FREE (Fast and Robust Vision-LLMs with Early Exits), which seeks to mitigate inference latency while maintaining robust performance through a strategically designed early exit architecture within VLMs.

The researchers have identified two critical issues that arise during the inference of VLMs: overthinking and mid-crisis. Overthinking refers to the excessive computation expended on "easy" samples that do not require such depth, while mid-crisis describes the loss of information at intermediate layers due to focus on irrelevant feature interactions. These phenomena are particularly pronounced due to the use of frozen LLM components.

FREE addresses these issues by incorporating Early Exit (EE) strategies. At its core, FREE introduces an adversarial training framework using a Generative Adversarial Network (GAN). Each exit consists of a transformer layer and a classifier, where the transformer layer is adversarially trained to align its feature representation with the deeper, final layer output. The exit classifier acts as a discriminator, challenging the model to produce consistent predictions from intermediate layers.

The methodological innovation of FREE revolves around leveraging the transformer layers at exits to emulate the representation scope of the final layers while reducing the exponential increase in inference parameters typical in extensive models. Unlike traditional EE models, FREE does not require large labeled datasets for exit classifier tuning, significantly lowering the barrier for implementation and scalability. The adversarial training setup ensures that exit classifiers are robustly aligned with the knowledge embedded in deeper network states, facilitating accurate predictions even with early layer exits.

Moreover, the proposed setup accommodates both supervised and unsupervised training paradigms. In unsupervised scenarios, knowledge distillation is employed to maintain model performance by providing soft labels, while synthetic labels generated through CapFilt are utilized to handle scenarios devoid of labeled data. This flexibility underlines the potential of FREE to seamlessly integrate into diverse data environments.

Experimentally, FREE shows promising results, enhancing inference speed by over 1.51 times without compromising accuracy compared to traditional VLM setups. The comprehensive evaluation across image captioning, visual question answering, and visual dialogue tasks illustrates that the approach effectively mitigates mid-crisis and overthinking, driving speed and quality improvements.

The implications of FREE are multifold. Practically, it offers a path towards real-time applications of VLMs in resource-constrained environments, reducing computational demands without sacrificing performance. Theoretically, it expands the understanding of feature interaction dynamics within multi-modal models and proposes a scalable solution to bypass structural bottlenecks.

Looking forward, the adoption of FREE invites further exploration into adaptive inference mechanisms within AI. As models continue to grow in complexity, ensuring efficiency and robustness will remain paramount. FREE provides a foundation for subsequent research aiming to blend the benefits of large-scale learning with lean inferencing standards in dynamic and data-diverse contexts.

PDF Markdown

FREE: Fast and Robust Vision Language Models with Early Exits (2506.06884v1)

Summary

FREE: Fast and Robust Vision LLMs with Early Exits

Related Papers

GitHub

YouTube