Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering (2502.03628v2)

Published 5 Feb 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Large Vision-LLMs (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.

Summary

The paper analyzes internal token dynamics in LVLMs, identifying key factors contributing to hallucination, and introduces VISTA, a training-free framework to mitigate it.
VISTA operates at inference time via Visual Steering Vector (VSV) to retain visual information and Self-Logits Augmentation (SLA) to enhance semantically grounded tokens.
VISTA demonstrates significant reduction in hallucination, up to 40%, on benchmarks like CHAIR and POPE, proving effective across various LVLM architectures and decoding strategies.

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-LLMs via Visual Information Steering

This paper presents an innovative approach to mitigating hallucination in Large Vision-LLMs (LVLMs), addressing the tendency of these models to generate syntactically coherent yet visually ungrounded outputs. The researchers explore the internal token dynamics of LVLMs to understand the emergence and propagation of hallucination during text generation. Through meticulous analysis, they identify three critical observations in relation to token-logit rankings: 1) the gradual loss of genuine visual information during text generation, 2) early excitation of semantically meaningful tokens in layers preceding the final layer, and 3) the presence of hidden genuine information, which suggests that LVLMs retain visually grounded information that is not eventually decoded.

Based on these insights, the paper introduces VISTA (Visual Information Steering with Token-logit Augmentation), a framework that operates at inference-time without requiring additional training. VISTA is composed of two key modules: the Visual Steering Vector (VSV) and Self-Logits Augmentation (SLA). The VSV aims to counteract the gradual loss of visual information by adjusting the activations in the residual stream with a vector derived from contrasting visual and non-visual contexts. SLA, on the other hand, leverages the early excitation characteristic to enhance the prominence of semantically meaningful tokens during decoding, specifically by integrating early-layer activations into the decision process.

VISTA's training-free nature and its applicability across various architectures and decoding strategies make it an attractive solution for enhancing LVLM reliability without extensive retraining or architectural modifications. Empirical results demonstrate a significant reduction in hallucination—up to 40%—across multiple benchmarks including CHAIR and POPE, where hallucination typically diminishes model utility in open-ended generation tasks. Furthermore, VISTA consistently outperforms existing baselines in terms of hallucination metrics, illustrating its effectiveness. It achieves this by ensuring that genuine information remains foregrounded throughout the text generation process.

Theoretically, this research enhances understanding of the intricate dynamics within LVLMs, particularly the interaction between visual and textual information across layers. Practically, VISTA's implementation can be seamlessly integrated into current LVLM inference pipelines, offering a method to improve output quality and fidelity in real-world applications ranging from interactive assistance to autonomous systems.

Future research could expand on the foundational insights provided by analyzing token dynamics to explore more sophisticated mechanisms for visual-textual alignment, possibly incorporating individualized visual prompts or feedback loops that further integrate visual cues into the token selection process. The scalability of VISTA's modules across diverse architectures also suggests potential for adaptation and refinement to cater to other multimodal models or different domains where model hallucination impacts operational reliability.