Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning (2506.06856v1)

Published 7 Jun 2025 in cs.CV

Abstract: Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal LLMs (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model's reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model's exploration space, effectively improves the reasoning boundary, and substantially accelerates training convergence speed and efficiency. Experimental results demonstrate that our proposed Vision-EKIPL achieved up to a 5\% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA). It reveals that Vision-EKIPL can overcome the limitations of traditional RL methods, significantly enhance the visual reasoning performance of MLLMs, and provide a new effective paradigm for research in this field.

Summary

The paper introduces Vision-EKIPL, a reinforcement learning method that leverages external auxiliary models to broaden the exploration space in visual reasoning.
Its novel framework achieves up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to conventional models.
The methodology sets the stage for scalable, hybrid training paradigms that can enhance visual reasoning across multimodal AI systems.

Vision-EKIPL: A Novel Framework for Enhancing Visual Reasoning in Multimodal LLMs

The paper "Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning" presents a reinforcement learning methodology aimed at augmenting the visual reasoning capabilities inherent in Multimodal LLMs (MLLMs). The authors propose an innovative reinforcement learning framework named Vision-EKIPL, which integrates high-quality actions from auxiliary models during the training phase to optimize the policy model beyond the traditional boundaries.

Overview and Contributions

Visual reasoning remains a complex and pivotal aspect of Artificial General Intelligence, as it involves understanding and logically interpreting multimodal data. Traditional methods in reinforcement learning, such as Group Relative Policy Optimization (GRPO), primarily sample actions from the policy model itself, thus restraining the exploration and reasoning capacity. Vision-EKIPL circumvents this limitation by incorporating actions from external auxiliary models, thereby broadening the model's exploration space and enhancing its reasoning boundaries. The authors present substantial empirical evidence that supports the efficacy of Vision-EKIPL, indicating up to a 5% performance improvement over state-of-the-art methods on the Reason-RFT-CoT Benchmark.

Methodological Framework

The Vision-EKIPL framework introduces external auxiliary models during RL training, allowing for a diverse set of high-quality action candidates to be sampled. This approach contrasts sharply with conventional RL paradigms where actions are restricted to those generated by the policy model. In Vision-EKIPL, both the policy model's actions and those from auxiliary models are evaluated together, leveraging a reward mechanism that prioritizes effective reasoning paths. By integrating these external knowledge sources, the model can explore novel strategies, thereby extending its reasoning efficacy and accelerating convergence.

Experimental Results

Tests conducted on several visual reasoning tasks, including Visual Counting, Structure Perception, and Spatial Transformation, demonstrate Vision-EKIPL's superiority. The framework's ability to incorporate external knowledge leads to improved training efficiency and a substantial enhancement in reasoning capabilities. These improvements were notably reflected in in-domain and out-of-domain tasks, where Vision-EKIPL surpassed traditional RL methods, open-source models, and proprietary systems like GPT-4o and Gemini-1.5-Pro.

Implications and Future Directions

Vision-EKIPL not only provides a robust pathway for enhancing visual reasoning but also posits a hybrid paradigm of supervised fine-tuning coupled with reinforcement learning. The implications of such a framework are vast, as it points towards the potential expansion into broader domains—both linguistic and multimodal. By effectively infusing external knowledge, Vision-EKIPL propels MLLMs towards a more general intelligence model. Future research may explore the adaptability of Vision-EKIPL to other reasoning tasks and its scalability across diverse AI applications.

In conclusion, Vision-EKIPL introduces a promising reinforcement learning framework that holds the potential to significantly elevate the capabilities of MLLMs. Its methodological innovations and experimental successes mark a step forward in the pursuit of enhancing visual reasoning, suggesting future avenues for the convergence of internal model learning and external knowledge integration.