Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One RL to See Them All: Visual Triple Unified Reinforcement Learning (2505.18129v2)

Published 23 May 2025 in cs.CV and cs.CL

Abstract: Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-LLMs (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

Summary

V-Triune: A Unified Reinforcement Learning Framework for Vision-LLMs

The presented paper introduces V-Triune, a novel reinforcement learning (RL) system designed to integrate visual reasoning and perception tasks into a cohesive training paradigm for vision-LLMs (VLMs). Unlike previous approaches that often focus on either reasoning or perception, V-Triune offers a comprehensive framework by encompassing both domains, thus facilitating joint optimization and interoperability across diverse multimodal tasks.

Overview of V-Triune

V-Triune is a Visual Triple Unified Reinforcement Learning system, characterized by three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring. This structured framework supports a unified RL approach that is applicable to a wide array of visual tasks, from object detection and counting to more conceptually driven challenges like solving math problems and engaging in scientific reasoning.

Key Components and Innovations

  1. Sample-Level Data Formatting: This component enables the integration of diverse task inputs by specifying task-specific rewards and their relative weights directly at the sample level. This approach provides the flexibility to dynamically adjust reward mechanisms and data processing strategies according to the unique requirements of each task sample.
  2. Verifier-Level Reward Computation: By employing a dedicated asynchronous reward server, this component separates reward calculation from the main training loop. It delegates the computation of task-specific rewards to specialized verifiers, thereby enhancing modularity and ensuring scalability across heterogeneous vision-language tasks.
  3. Source-Level Metric Monitoring: This feature offers a granular view of the training process by logging metrics at the data-source level. Such detailed tracking facilitates the identification of data-specific issues and enables targeted improvements in training stability and performance.

A noteworthy innovation within V-Triune is the Dynamic IoU reward, which is designed to address the limitations of static IoU thresholds in perception tasks. By progressively adjusting thresholds, this mechanism provides scalable and adaptive feedback, significantly enhancing training stability and model accuracy.

Experimental Results

The implementation of V-Triune has led to the development of the Orsta model series, featuring 7B and 32B backbone variants. Orsta demonstrates considerable performance improvements on MEGA-Bench Core, achieving a range of +2.1 to +14.1 points across different variants. These improvements extend to multiple downstream benchmarks, including COCO and CountBench. Notably, the results demonstrate that reinforcement learning effectively enhances model capabilities in both reasoning and perception tasks, however, predominantly through the refinement of alignment and decision-making processes rather than the acquisition of new skills.

Implications and Future Directions

The V-Triune framework represents a significant step toward the integration of reasoning and perception in VLMs, aligning model training with complex real-world tasks. This unified approach offers a scalable solution for enhancing VLM performance across a broad spectrum of applications. Moreover, the insights gained from implementing Dynamic IoU and sample-level reward flexibility could be pivotal in extending the applicability of VLMs to new and varied contexts.

Future research could further explore the scaling of V-Triune to larger models and more extensive datasets, possibly including RL-zero paradigms where RL is applied absent any prior supervised fine-tuning. Additionally, the distinct reflection and response length trends observed in the training metrics suggest a potential area for investigation into how RL techniques used successfully in LLMs, like CoT scaling, might be adapted or extended to perception tasks.

In summary, V-Triune provides a versatile and effective framework for reinforcing multimodal capabilities in vision-LLMs, offering a harmonious integration of reasoning and perception tasks under a unified RL methodology.

Github Logo Streamline Icon: https://streamlinehq.com