Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension (2412.03704v3)

Published 4 Dec 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Despite significant advancements in vision-LLMs (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent LLM studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.

Citations (1)

Summary

  • The paper's main contribution is the development of VisVM, which uses temporal difference learning to predict long-term visual-text alignment for coherent caption generation.
  • It employs CLIP’s text-image similarity metric as a reward signal, reducing hallucinations from 32.4 to 26.2 and enhancing overall caption quality.
  • VisVM-guided search is preferred 74% of the time over greedy methods and boosts self-training performance by 10.8% across eight multimodal benchmarks.

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

This paper introduces the Vision Value Model (VisVM), which is designed to enhance the inference-time search in Vision-LLMs (VLMs) for generating image descriptions with improved accuracy and detail, reducing the occurrence of hallucinations. The VisVM presents a novel approach by evaluating not only the immediate quality of generated sentences but also anticipating the quality of subsequent sentences. This anticipatory capability allows VisVM to steer VLMs towards more coherent and detailed image descriptions.

Key Contributions and Methodology

The key contribution of this paper is the development of a stepwise value model—VisVM—that provides long-term vision value signals. This model significantly diverges from traditional process reward models by using Temporal Difference (TD) learning to predict long-term consequences, thereby maintaining coherence. VisVM employs CLIP's text-image similarity metric as a reward signal to evaluate the overall visual-text alignment and coherence.

To validate the effectiveness of VisVM, the authors conducted several experiments, indicating that the VisVM-guided search markedly enhances VLM's ability to generate descriptive captions compared to greedy decoding and other search methods leveraging visual reward signals. Specifically, captions produced using VisVM-guided search were preferred 74% of the time over those generated by greedy search, according to both GPT and human evaluations. Furthermore, VisVM-guided captions were used for self-training the VLM (LLaVA-Next-7B), showing an average performance improvement of 10.8% across eight multimodal benchmarks.

Experimental Results

The empirical results demonstrated a substantial reduction in hallucinations and an increase in the richness of visual details when VisVM was applied. The reduction in CHAIRs (a metric evaluating hallucination) from 32.4 to 26.2 illustrates the effectiveness of VisVM in producing more accurate image descriptions. Moreover, when scaling the inference compute, VisVM-guided search showed superior performance and computational efficiency, compared to CLIP-PRM-guided search.

Implications and Future Directions

The implications of this research are twofold. Practically, VisVM shows the potential to improve real-world applications of VLMs by enhancing the descriptiveness and accuracy of image captions and thus improving user interaction experiences with AI systems. Theoretically, it opens avenues for future research in self-improving models where VisVM-guided outputs can be progressively integrated into model training to further augment the model's capabilities without additional external data. This approach contributes to reducing the cost and time associated with extensive data labeling efforts.

The paper foresees future work in expanding the VisVM framework to other types of models and tasks, potentially paving the way for a robust self-training pipeline for VLMs, enabling ongoing performance improvements through inference-time search. Such developments would significantly advance the fields of AI and multimodal learning by offering scalable solutions to common challenges associated with large-scale vision-LLM training and deployment.

In conclusion, this paper presents a compelling methodology and substantiates the viability of inference-time interventions such as VisVM in scaling VLM performance, promising impactful enhancements in the generation of visually and contextually aware AI outputs.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub