Emergent Mind


By combining natural language understanding and the generation capabilities and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented reasoning capabilities in the real world. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucinating nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through numerous metrics on several benchmarks. Additionally, we construct a comprehensive and challenging dataset specifically designed to validate the visual grounding capabilities of LVLMs. Finally, we plan to release our human annotation comprising approximately 16,000 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.


  • ViGoR introduces a fine-grained reward modeling framework to improve the visual grounding of Large Vision Language Models (LVLMs) by reducing inaccuracies.

  • It uses detailed human evaluations and automated methods to refine LVLM outputs, enhancing accuracy and contextual relevance.

  • The framework trains a reward model based on human annotator feedback and automated object detection, guiding the fine-tuning of LVLMs.

  • ViGoR not only presents a cost-effective approach to improve LVLMs but also opens avenues for future research in integrating visual perception with language comprehension.

Introduction to Visual Grounding Challenges

Recent breakthroughs in Large Vision Language Models (LVLMs) have intertwined language understanding with image perception, allowing models to engage in real-world reasoning tasks. Despite these advancements, LVLMs often struggle with accurately grounding their generated texts in visual inputs. This misalignment can lead to inaccuracies such as hallucinating nonexistent elements, overlooking significant parts of the scene, or misinterpreting object attributes and relations. Addressing these grounding issues is crucial for improving the reliability and effectiveness of LVLMs in practical applications.

ViGoR: A Solution to Visual Grounding

Enter ViGoR (Visual Grounding Through Fine-Grained Reward Modeling), a framework designed to enhance the visual grounding capabilities of LVLMs by employing fine-grained reward modeling. This approach leverages detailed human evaluations and automated methods to refine LVLM outputs, making them more accurate and contextually relevant. Notably, ViGoR demonstrates its effectiveness by presenting significant improvements over existing models on various benchmarks, with substantial advancements in generating accurate and detailed descriptions while preserving the logical reasoning and creativity inherent to LLMs.

The Mechanics of ViGoR

The ViGoR framework operates by initially sampling text outputs from a pre-trained LVLM given an input image. These outputs are then assessed by human annotators at a sentence level for inaccuracies and creativity. Using feedback as the ground truth, a reward model is trained to capture fine-grained human evaluations. This reward model subsequently guides the fine-tuning of the LVLM, considerably improving its visual grounding with a limited dataset.

Additionally, ViGoR incorporates an automated method to construct the reward model without further human intervention. This approach employs state-of-the-art object detection models to verify the presence of described entities in the images, contributing to the model's grounding capabilities.

Practical and Theoretical Implications

Practically, ViGoR represents a cost-effective solution to enhancing the visual grounding of LVLMs without the need for extensive annotated datasets. It significantly reduces the incidence of hallucinations and misconceptions in model outputs, leading to more accurate and meaningful machine-generated interpretations of visual data. Theoretically, ViGoR's approach provides insights into the efficiency of fine-grained feedback and the integration of complementary reward signals from human evaluations and automated methods. This blend of feedback sources showcases how different types of information can cohesively refine model performance.

Looking Ahead

This paper's findings underscore the importance of addressing the visual grounding issues in LVLMs and propose a promising direction for future research. The development of ViGoR paves the way for more sophisticated models capable of even closer integration of visual perception and language comprehension. Future explorations may include extending the ViGoR framework with reinforcement learning from human feedback (RLHF) and integrating explicit visual predictions for improved alignment and grounding accuracy.


The ViGoR framework marks an important step forward in the quest to improve the visual grounding of LVLMs. Through its novel use of fine-grained reward modeling, combined with human evaluations and automated methods, ViGoR significantly enhances the accuracy and contextual relevance of LVLM-generated texts. By addressing the challenges of hallucination and inaccurate grounding, ViGoR contributes valuable insights and tools to the field of AI, with broad implications for the future development of more perceptively aligned machine learning models.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.