Mini-o3: Scalable Multimodal Visual Search Agent

Updated 11 September 2025

Mini-o3 is a multimodal agent framework that uses an iterative thought–action–observation loop to enable deep multi-turn reasoning in complex visual search tasks.
It employs a tailored training pipeline with a Visual Probe Dataset, cold-start data generation, and over-turn masking in RL to extend reasoning depth beyond training limits.
Empirical results demonstrate state-of-the-art performance with improved accuracy on hard visual tasks, validating its scalable and robust trial-and-error search strategy.

Mini-o3 is a multimodal agent framework optimized for deep, multi-turn reasoning in challenging visual search tasks. It achieves scalable OpenAI o3-style tool-augmented behaviors using a thought–action–observation loop, extensive trial-and-error strategies, and adaptive training aligned to high interaction depths. The Mini-o3 recipe incorporates a task-specific Visual Probe Dataset, cold-start data generation emphasizing diverse reasoning patterns, and a novel over-turn masking strategy in reinforcement learning to unlock inference-time reasoning that far exceeds the training budget for interaction turns. Its architecture and training regimen position Mini-o3 as a state-of-the-art model for tasks requiring stepwise attention, zoom-based exploration, and robust decision-making in complex visual domains (Lai et al., 9 Sep 2025).

1. System Architecture and Reasoning Loop

Mini-o3 operates as an iterative, multimodal agent structured around a thought–action–observation loop. At every interaction turn, the policy:

Produces a “thought” reflecting current task state and historic observations.
Generates an “action,” usually a zoom or region selection in the normalized image space $[0, 1]^2$ , or a final answer.
Receives an “observation” generated by applying the selected action, i.e., cropping/processing image regions or collecting tool feedback.

Unlike conventional visual question answering (VQA) models constrained to single or few steps, Mini-o3 is engineered for deep, multi-turn trajectories. Its architecture encourages exploration, self-reflection, and goal maintenance, and supports complex search strategies including depth-first trial-and-error.

2. Visual Probe Dataset and Task Design

The Visual Probe Dataset is central to Mini-o3’s capabilities:

Comprised of 4,000 training and 500 testing visual question-answer instances across three difficulty levels (easy, medium, hard).
Images contain small, spatially sparse targets amid dense distractors and require high-resolution, region-based focus.
Many tasks necessitate iterative zoom-in, back-and-forth attention switching, and multi-step elimination—patterns beyond the capacity of short-turn VQA models.

The dataset is structurally adversarial, emphasizing tasks where shallow or static reasoning is insufficient, and thus benchmarks an agent’s ability to sustain and modulate its search trajectory over tens of steps.

3. Iterative Data Collection Pipeline

Mini-o3 employs a data generation pipeline tailored for high-diversity, cold-start learning:

Manual seeding with high-quality exemplar trajectories, each documenting sequential thoughts, bounding box actions, and intermediate observations.
Expansion using vision–language foundation models prompted in-context to mimic the exemplars, generating exploration-rich multi-turn trajectories.
Data curation: only those trajectories that yield correct answers across the allowed turn limit are retained.
This approach ensures pretraining data features trial-and-error, local hypothesis refinement, and non-deterministic exploration—key to scaling up reasoning diversity and avoiding monotonic behavioral collapse.

4. Reinforcement Learning and Over-Turn Masking

During policy optimization, Mini-o3 must bridge the gap between practical training constraints (e.g., short rollouts for efficiency) and test-time requirements (potentially unlimited search horizon). The over-turn masking mechanism is central:

Training employs a fixed turn cap (e.g., six steps); vanilla RL would assign zero reward to trajectories exceeding this limit, inducing negative gradients.
Instead, the completion mask $M_i$ is defined as:

$M_i = \mathbb{1}\{|o_i| \leq C_{context}\} \cdot \mathbb{1}\{\text{turn}(o_i) \leq C_{turn}\}$

where $o_i$ is a sampled output, $C_{context}$ and $C_{turn}$ are the respective context and turn budgets.

The Group Relative Policy Optimization (GRPO) objective is modified to:

$J_{\text{over-turn\_GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{\sum_{i=1}^G M_i} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{old}}(o_i \mid q)} (A_i \cdot M_i), \; \operatorname{clip}\left(\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{old}}(o_i \mid q)}, 1-\epsilon, 1+\epsilon\right) (A_i \cdot M_i) \right) \right]$

Responses exceeding limits ( $M_i = 0$ ) are masked from the gradient, so over-turn exploration is neither penalized nor artificially encouraged during training—but at inference, the policy can “scale up” to an arbitrary number of reasoning turns.

5. Reasoning Pattern Diversity

Mini-o3 explicitly targets and captures a diverse array of search and reasoning behaviors:

Depth-first exploration: recursively zooming into local candidate regions until a plausible hypothesis is confirmed or rejected.
Trial-and-error: sequentially testing hypotheses, employing self-reflective thoughts (e.g., “previous region is not the target, try the next”).
Goal maintenance: explicit notes in the “thought” field tracking intermediate subgoals, supporting both local and global search context recall.
This mixture contrasts with previous open-source systems, which often devolve into superficial, single-path exploration.

Such diversity is critical for real-world visual search, where correct answers may only be reachable through dynamic, adaptive, and reflective behaviors.

6. Experimental Results and Scaling Behavior

Empirical evaluation demonstrates Mini-o3’s state-of-the-art performance:

On the VisualProbe (hard) set, Mini-o3 achieves approximately 48% accuracy at a 32-turn test-time budget (Avg@32), markedly surpassing baselines such as DeepEyes (35%).
On HR-Bench, V* Bench, and MME-Realworld, the model’s superiority persists, especially as task complexity and required turn depth increase.
A key observed trend: although training is strictly limited (e.g., six-turn rollouts), test-time accuracy increases monotonically as the allowed number of agent turns rises, up to at least 32. This scaling is a direct result of over-turn masking and diverse trajectory induction.
Ablation studies confirm that both the inclusion of hard RL examples and the SFT (supervised fine-tuning) “cold-start” phase are essential; ablating either reduces attainable reasoning depth and accuracy.

7. Implications, Limitations, and Outlook

Mini-o3 establishes a new regime for visual search and multimodal reasoning agents by:

Demonstrating that reinforcement learning—if paired with turn-masking and thoughtful data strategy—can yield policies generalizing to interaction regimes far outside the training envelope.
Introducing design practices for constructing datasets and feedback loops that promote not just accuracy, but also rich behavioral variety and scalability.
Enabling test-time adaptation in reasoning depth, which allows the agent to allocate more steps for harder problems without retraining.

However, challenges remain. Mini-o3’s architecture is tailored for visual search and the tool-augmented setting; its transferability to non-visual or more general task domains is not addressed by the underlying data. Furthermore, over-turn masking as implemented assumes that extended trajectories are computationally feasible at inference time, which may limit deployment for latency-critical environments or resource-constrained settings.

In summary, Mini-o3 provides a blueprint for achieving scalable, deep, multi-turn reasoning and exploration in complex visual domains, with a principled reinforcement learning objective ensuring efficient training and unbounded test-time behavior (Lai et al., 9 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search (2025)

Mini-o3: Scalable Multimodal Visual Search Agent

1. System Architecture and Reasoning Loop

2. Visual Probe Dataset and Task Design

3. Iterative Data Collection Pipeline

4. Reinforcement Learning and Over-Turn Masking

5. Reasoning Pattern Diversity

6. Experimental Results and Scaling Behavior

7. Implications, Limitations, and Outlook

Whiteboard

Follow Topic

Continue Learning

Mini-o3: Scalable Multimodal Visual Search Agent

1. System Architecture and Reasoning Loop

2. Visual Probe Dataset and Task Design

3. Iterative Data Collection Pipeline

4. Reinforcement Learning and Over-Turn Masking

5. Reasoning Pattern Diversity

6. Experimental Results and Scaling Behavior

7. Implications, Limitations, and Outlook

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics