VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning (2505.12081v3)

Published 17 May 2025 in cs.CV

Abstract: Large vision-LLMs exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).

Summary

VisionReasoner: Unified Framework for Visual Perception and Reasoning

The paper presents VisionReasoner, an innovative framework designed to unify visual perception tasks and reasoning through reinforcement learning. VisionReasoner is positioned as an adept model capable of handling multiple visual perception tasks within a single unified framework, thus advancing current methodologies in the area.

Methodology

The authors introduce VisionReasoner by employing novel multi-object cognitive learning strategies accompanied by systematic task reformulation. This allows for enhanced reasoning capabilities in visual tasks. The framework outputs a structured reasoning process before providing the desired results. This approach diverges from traditional task-specific modules and is evaluated through diverse visual perception tasks, spanning detection, segmentation, and counting.

A pivotal aspect of VisionReasoner is its utilization of reinforcement learning (RL). Unlike preceding models that implemented RL in a task-specific manner with distinct reward functions, VisionReasoner adopts a unified rewards mechanism to enhance reasoning across tasks. The framework benefits from format rewards for structured reasoning, as well as accuracy rewards like multi-object Intersection-over-Union (IoU) and L1 rewards for localization.

Experimental Results

VisionReasoner was evaluated on ten tasks across three domains. It notably outperformed established models such as Qwen2.5VL by substantial margins: 29.1% in detection on COCO, 22.1% in segmentation on ReasonSeg, and 15.3% in counting on CountBench. These results substantiate the efficacy of the unified framework in addressing diverse visual perception challenges.

Furthermore, the VisionReasoner model was tested on visual question answering tasks, showcasing comparable capabilities to state-of-the-art models, thereby solidifying its versatility across different visual data processing requirements.

Implications and Future Directions

VisionReasoner opens pathways for integrating visual perception tasks into a single, scalable, and more generalized model architecture. The implication for AI research is significant, suggesting that unified models can efficiently streamline various tasks traditionally managed by distinct frameworks.

Looking ahead, there is potential for extending VisionReasoner's capabilities with larger datasets and more complex visual perception challenges. Additionally, refining the multi-object cognition and reasoning strategies will be crucial to advancing its application across emerging tasks in vision-LLMs. This paper lays the groundwork for future exploration in automated reasoning and perception within multimodal AI systems, promising advancements in AI's ability to understand and interact with the world as humans do.