Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning (2505.12081v3)

Published 17 May 2025 in cs.CV

Abstract: Large vision-LLMs exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).

Summary

VisionReasoner: Unified Framework for Visual Perception and Reasoning

The paper presents VisionReasoner, an innovative framework designed to unify visual perception tasks and reasoning through reinforcement learning. VisionReasoner is positioned as an adept model capable of handling multiple visual perception tasks within a single unified framework, thus advancing current methodologies in the area.

Methodology

The authors introduce VisionReasoner by employing novel multi-object cognitive learning strategies accompanied by systematic task reformulation. This allows for enhanced reasoning capabilities in visual tasks. The framework outputs a structured reasoning process before providing the desired results. This approach diverges from traditional task-specific modules and is evaluated through diverse visual perception tasks, spanning detection, segmentation, and counting.

A pivotal aspect of VisionReasoner is its utilization of reinforcement learning (RL). Unlike preceding models that implemented RL in a task-specific manner with distinct reward functions, VisionReasoner adopts a unified rewards mechanism to enhance reasoning across tasks. The framework benefits from format rewards for structured reasoning, as well as accuracy rewards like multi-object Intersection-over-Union (IoU) and L1 rewards for localization.

Experimental Results

VisionReasoner was evaluated on ten tasks across three domains. It notably outperformed established models such as Qwen2.5VL by substantial margins: 29.1% in detection on COCO, 22.1% in segmentation on ReasonSeg, and 15.3% in counting on CountBench. These results substantiate the efficacy of the unified framework in addressing diverse visual perception challenges.

Furthermore, the VisionReasoner model was tested on visual question answering tasks, showcasing comparable capabilities to state-of-the-art models, thereby solidifying its versatility across different visual data processing requirements.

Implications and Future Directions

VisionReasoner opens pathways for integrating visual perception tasks into a single, scalable, and more generalized model architecture. The implication for AI research is significant, suggesting that unified models can efficiently streamline various tasks traditionally managed by distinct frameworks.

Looking ahead, there is potential for extending VisionReasoner's capabilities with larger datasets and more complex visual perception challenges. Additionally, refining the multi-object cognition and reasoning strategies will be crucial to advancing its application across emerging tasks in vision-LLMs. This paper lays the groundwork for future exploration in automated reasoning and perception within multimodal AI systems, promising advancements in AI's ability to understand and interact with the world as humans do.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube