ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Published 30 May 2025 in cs.CV, cs.AI, and cs.CL | (2505.24872v1)

Abstract: Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-LLMs (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple LLMs with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

Efficient Decoding-Time Guidance in Vision-Language Models

The paper titled "ProxyThinker: Test-Time Guidance through Small Visual Reasoners" presents an inference-time technique designed to enhance the visual reasoning capabilities of large vision-language models (LVLMs). The authors tackle the computational challenges associated with reinforcement fine-tuning (RFT), offering a novel approach to improving model performance without extensive training. ProxyThinker leverages the reasoning expertise from smaller models, applying this knowledge to larger LVLMs during test time, thereby optimizing their decoding dynamics to exhibit sophisticated reasoning behaviors such as self-verification and self-correction.

Methodology and Results Overview

The ProxyThinker mechanism is built upon the concept of contrasting the token-level logits between a small reasoning expert model that has undergone RFT and a similar-sized amateur model that has not. By incorporating this difference in logits during the decoding process, ProxyThinker steers the larger base LVLM towards better reasoning aptitude. This approach effectively circumvents the computational expenses associated with direct RFT on large models, while still enhancing reasoning capabilities.

The authors provide compelling evidence for the efficacy of ProxyThinker. Quantitative results showcased improvements in complex vision-related reasoning tasks and benchmarks across diverse domains such as mathematics and scientific fields. For instance, ProxyThinker improved the accuracy from 38.4% to 40.8% on the MathVision test split by integrating reasoning behaviors from a small 7B RFT-trained expert into a larger 32B base model. Notably, these results surpassed the performance achieved by full-scale LVLMs trained with RFT.

The implementation of ProxyThinker is particularly noteworthy for efficiently managing multiple language models using parallelism techniques, resulting in a significant inference speedup—up to 38 times faster—compared to previous decoding-time methods.

Implications and Future Directions

The findings in this paper offer promising directions for the practical deployment of LVLMs in scenarios demanding intricate reasoning capabilities without incurring substantial computational costs. The methodology allows larger LVLMs to inherit reasoning skills from smaller models effectively and efficiently, suggesting potential applications in various domains requiring multimodal understanding.

While ProxyThinker demonstrates substantial improvements in reasoning benchmarks, challenges remain in domains requiring extensive knowledge validation, as highlighted by the limited gains observed in the MMMU validation set. Future research may explore adaptive mechanisms or hybrid training-inference paradigms to address these limitations effectively.

Additionally, while the scalability and applicability to large models are promising, further exploration into the underlying mechanisms of ProxyThinker’s logit delta approach could provide deeper insights into its impact on model reasoning trajectories and decision-making efficacy.

Overall, ProxyThinker represents a significant advance in test-time model adaptation, paving the way for enhanced reasoning capabilities in LVLMs without the need for exhaustive training procedures. As AI continues to integrate into tasks requiring sophisticated multimodal reasoning, techniques like ProxyThinker could play a crucial role in optimizing performance while maintaining computational feasibility.