Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning (2506.15649v1)

Published 18 Jun 2025 in cs.CV and cs.LG

Abstract: Despite significant advances in inference-time search for vision-LLMs (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

Summary

  • The paper introduces ViMaR, a novel dual-stage inference framework that improves VLM captioning speed and fidelity using value guidance and margin-based rewards.
  • ViMaR significantly cuts inference time over fourfold, improves caption quality, reduces hallucinations, and generalizes effectively across VLM architectures.
  • ViMaR's outputs can effectively self-train VLMs, providing a scalable, transferable method for model enhancement without costly additional data.

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

The paper "Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning" introduces ViMaR, a novel inference framework designed to enhance the efficiency and fidelity of vision-LLMs (VLMs) in generating descriptive captions. The research aims to tackle the prominent challenges of hallucinations and computational inefficiencies that are characteristic of current VLMs, particularly at inference time.

The primary innovation of this work is the two-stage approach implemented in ViMaR. This framework integrates a temporal-difference value model with a margin-based reward mechanism. In the first stage, it identifies a candidate caption with the highest value, bypassing the need to exhaustively score each possibility. The second stage involves refining segments of the caption that lack sufficient visual grounding or are lexically weak, which can lead to hallucinations. This margin-based reward adjustment applies a penalty to low-confidence outputs, emphasizing more reliable and descriptively rich sentences.

Key findings demonstrate that ViMaR delivers significant improvements in caption quality while drastically cutting inference time by over fourfold compared to contemporary value-guided methods. The framework proves effective across multiple VLM architectures, showing strong cross-model generalization—a salient feature underscored by its capability to guide a stronger unseen model despite being trained only on LLaVA Mistral-7B.

Empirical validations, using detailed human and automated evaluations like GPT-4o comparisons, indicate a 64% preference for ViMaR-generated captions over the existing VisVM method. This two-stage mechanism substantially mitigates hallucination, as evidenced by reduced rates reported through CHAIR and MMHal metrics.

Moreover, the authors leveraged ViMaR-generated outputs for self-training, which resulted in conspicuous performance improvements across an extensive suite of benchmarks related to visual comprehension. Such findings not only affirm the efficacy of ViMaR in producing high-quality captions but also present a compelling case for its use as a robust tool for ongoing model enhancement without the need for costly additional data annotations.

Future implications of ViMaR suggest a scalable and transferable inference method that can seamlessly integrate into diverse VLM systems, offering a pathway for more computationally sustainable and factually accurate image captioning tasks. The model's flexibility and capacity to generalize across different architectures signal promising trajectories for its deployment in broader AI applications, aligning with the evolving landscape of machine vision and multimodal interactions.