Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 77 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search (2509.07969v1)

Published 9 Sep 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Collections

Summary

The paper presents Mini-o3, a novel VLM that uses iterative agentic image tool use to achieve deep multi-turn reasoning and state-of-the-art visual search performance.
It introduces a cold-start supervised fine-tuning phase and a reinforcement learning strategy with over-turn masking to scale reasoning depth efficiently.
Empirical results show that increasing interaction turns leads to higher accuracy, highlighting the model's practical benefits for complex visual search tasks.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Introduction and Motivation

Mini-o3 addresses a critical limitation in current open-source Vision-LLMs (VLMs): the inability to perform deep, multi-turn reasoning in visual search tasks that require trial-and-error exploration. Existing models typically exhibit shallow reasoning patterns and are constrained to a small number of interaction turns, resulting in poor performance on complex visual search benchmarks. Mini-o3 is designed to scale both the depth and diversity of reasoning, enabling agentic tool use over tens of steps and achieving state-of-the-art results on challenging datasets.

Framework Overview

Mini-o3 implements a multi-turn agentic pipeline for image tool use. At each turn, the model generates a "thought" and an "action" based on the current observation and interaction history. The action either grounds a region in the image (via a normalized bounding box) or emits a final answer. The observation is the image patch resulting from the action, which is appended to the trajectory and used for subsequent reasoning.

Figure 1: The Mini-o3 framework for multi-turn agentic image tool use, iteratively generating thoughts and actions conditioned on previous observations.

This iterative loop continues until the model produces a final answer or reaches predefined limits on context length or interaction turns. The design supports reasoning strategies such as depth-first search, hypothesis revision, and backtracking, which are essential for solving difficult visual search problems.

Training Methodology

VisualProbe Dataset

Mini-o3 is trained on the VisualProbe dataset, which contains thousands of high-resolution images with small targets, numerous distractors, and questions that require iterative exploration. The dataset is explicitly constructed to elicit diverse reasoning patterns and long-horizon trajectories.

Figure 2: VisualProbe dataset features small targets, distractor objects, and high-resolution images, demanding trial-and-error exploration.

Cold-Start Data Collection

To overcome the base model's lack of exposure to multi-turn agentic trajectories, Mini-o3 employs a cold-start supervised fine-tuning (SFT) phase. Diverse multi-turn trajectories are synthesized by prompting an existing VLM with a small set of exemplars, iteratively generating thoughts and actions until a correct answer is produced. Only successful trajectories are retained, ensuring high-quality supervision.

Figure 3: Pipeline for cold-start data collection, leveraging in-context learning to synthesize diverse multi-turn trajectories.

Reinforcement Learning with Over-Turn Masking

Mini-o3 applies GRPO-based reinforcement learning with verifiable, semantics-aware rewards. A key innovation is the over-turn masking technique: responses that hit the maximum turn or context length are masked out during policy updates, preventing negative learning signals from incomplete trajectories. This enables efficient training with a modest turn budget (e.g., 6 turns) while allowing test-time trajectories to scale to tens of turns.

Figure 4: Over-turn masking technique prevents penalization of incomplete responses, supporting test-time scaling of interaction turns.

Empirical Results

Mini-o3 demonstrates a strong test-time turns scaling property: accuracy continues to grow as the upper limit on the number of turns increases from 4 to 32, despite training with only 6 turns. This is in contrast to baselines such as DeepEyes, which plateau early and fail to benefit from additional interaction depth.

Figure 5: Left: Mini-o3 accuracy increases with more allowed turns during testing. Right: Distribution of correct trajectories shows deeper thinking paths for Mini-o3.

Ablation studies confirm the necessity of each component: hard RL data, cold-start SFT, and over-turn masking all contribute significantly to performance. The choice of maximum pixel budget is also critical; too large a budget induces premature stopping, while too small a budget increases hallucinations. Optimal performance is achieved by balancing perceptual accuracy and interaction depth.

Qualitative Analysis

Mini-o3 produces complex, multi-turn reasoning trajectories in diverse real-world scenarios. Examples include progressive zoom-in and hypothesis revision in urban intersections, targeted zoom-ins and cross-checking in container yards, and coarse-to-fine zooming with verification in cluttered village scenes.

Figure 6: Multi-turn reasoning in a busy urban intersection, identifying the direction of an arrow via progressive zoom-in and backtracking.

Figure 7: Multi-turn reasoning in a container yard, locating and reading text through targeted zoom-ins and step-by-step verification.

Figure 8: Multi-turn reasoning in a lakeside village, localizing and recognizing digits on a road sign after 18 reasoning turns.

These examples illustrate Mini-o3's ability to sustain deep chains of thought, adaptively revise hypotheses, and perform robust verification across observations.

Implementation Considerations

Base Model: Qwen2.5-VL-7B-Instruct is used, but the approach is generalizable to other VLMs with sufficient context length and image tool integration.
Context Length and Pixel Budget: The context length (32K tokens) and pixel budget (2M per image) are tuned to maximize the number of feasible interaction turns without sacrificing perceptual fidelity.
Training Efficiency: Over-turn masking allows training with a small turn budget, reducing resource requirements (e.g., 3 days for 6 turns vs. 10 days for 16 turns) with negligible impact on test accuracy.
Inference: Temperature is set to 1.0 to mitigate repetition in long trajectories.

Implications and Future Directions

Mini-o3 establishes a practical recipe for scaling agentic reasoning in VLMs, enabling robust performance on tasks that require deep, trial-and-error exploration. The over-turn masking strategy is broadly applicable to other RL-based agentic systems, facilitating efficient training and test-time scaling. The VisualProbe dataset sets a new standard for evaluating multi-turn visual reasoning.

Future work may explore:

Extending the approach to larger models and more diverse toolkits (e.g., web browsing, code execution).
Integrating more sophisticated reward models for semantic evaluation.
Investigating scaling laws for interaction turns and context length in multimodal RL.

Conclusion

Mini-o3 advances the state-of-the-art in multi-turn visual search by combining a challenging dataset, a cold-start data collection pipeline, and an over-turn masking strategy for reinforcement learning. The model demonstrates scalable reasoning depth, diverse agentic behaviors, and strong empirical performance across benchmarks. The methodology provides actionable guidance for developing multimodal agents capable of deep, iterative exploration in complex environments.