Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Mini-o3: Scalable Multimodal Visual Search Agent

Updated 11 September 2025
  • Mini-o3 is a multimodal agent framework that uses an iterative thought–action–observation loop to enable deep multi-turn reasoning in complex visual search tasks.
  • It employs a tailored training pipeline with a Visual Probe Dataset, cold-start data generation, and over-turn masking in RL to extend reasoning depth beyond training limits.
  • Empirical results demonstrate state-of-the-art performance with improved accuracy on hard visual tasks, validating its scalable and robust trial-and-error search strategy.

Mini-o3 is a multimodal agent framework optimized for deep, multi-turn reasoning in challenging visual search tasks. It achieves scalable OpenAI o3-style tool-augmented behaviors using a thought–action–observation loop, extensive trial-and-error strategies, and adaptive training aligned to high interaction depths. The Mini-o3 recipe incorporates a task-specific Visual Probe Dataset, cold-start data generation emphasizing diverse reasoning patterns, and a novel over-turn masking strategy in reinforcement learning to unlock inference-time reasoning that far exceeds the training budget for interaction turns. Its architecture and training regimen position Mini-o3 as a state-of-the-art model for tasks requiring stepwise attention, zoom-based exploration, and robust decision-making in complex visual domains (Lai et al., 9 Sep 2025).

1. System Architecture and Reasoning Loop

Mini-o3 operates as an iterative, multimodal agent structured around a thought–action–observation loop. At every interaction turn, the policy:

  • Produces a “thought” reflecting current task state and historic observations.
  • Generates an “action,” usually a zoom or region selection in the normalized image space [0,1]2[0, 1]^2, or a final answer.
  • Receives an “observation” generated by applying the selected action, i.e., cropping/processing image regions or collecting tool feedback.

Unlike conventional visual question answering (VQA) models constrained to single or few steps, Mini-o3 is engineered for deep, multi-turn trajectories. Its architecture encourages exploration, self-reflection, and goal maintenance, and supports complex search strategies including depth-first trial-and-error.

2. Visual Probe Dataset and Task Design

The Visual Probe Dataset is central to Mini-o3’s capabilities:

  • Comprised of 4,000 training and 500 testing visual question-answer instances across three difficulty levels (easy, medium, hard).
  • Images contain small, spatially sparse targets amid dense distractors and require high-resolution, region-based focus.
  • Many tasks necessitate iterative zoom-in, back-and-forth attention switching, and multi-step elimination—patterns beyond the capacity of short-turn VQA models.

The dataset is structurally adversarial, emphasizing tasks where shallow or static reasoning is insufficient, and thus benchmarks an agent’s ability to sustain and modulate its search trajectory over tens of steps.

3. Iterative Data Collection Pipeline

Mini-o3 employs a data generation pipeline tailored for high-diversity, cold-start learning:

  • Manual seeding with high-quality exemplar trajectories, each documenting sequential thoughts, bounding box actions, and intermediate observations.
  • Expansion using vision–language foundation models prompted in-context to mimic the exemplars, generating exploration-rich multi-turn trajectories.
  • Data curation: only those trajectories that yield correct answers across the allowed turn limit are retained.
  • This approach ensures pretraining data features trial-and-error, local hypothesis refinement, and non-deterministic exploration—key to scaling up reasoning diversity and avoiding monotonic behavioral collapse.

4. Reinforcement Learning and Over-Turn Masking

During policy optimization, Mini-o3 must bridge the gap between practical training constraints (e.g., short rollouts for efficiency) and test-time requirements (potentially unlimited search horizon). The over-turn masking mechanism is central:

  • Training employs a fixed turn cap (e.g., six steps); vanilla RL would assign zero reward to trajectories exceeding this limit, inducing negative gradients.
  • Instead, the completion mask MiM_i is defined as:

Mi=1{oiCcontext}1{turn(oi)Cturn}M_i = \mathbb{1}\{|o_i| \leq C_{context}\} \cdot \mathbb{1}\{\text{turn}(o_i) \leq C_{turn}\}

where oio_i is a sampled output, CcontextC_{context} and CturnC_{turn} are the respective context and turn budgets.

Jover-turn_GRPO(θ)=Eq,{oi}[1i=1GMii=1Gmin(πθ(oiq)πθold(oiq)(AiMi),  clip(πθ(oiq)πθold(oiq),1ϵ,1+ϵ)(AiMi))]J_{\text{over-turn\_GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{\sum_{i=1}^G M_i} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{old}}(o_i \mid q)} (A_i \cdot M_i), \; \operatorname{clip}\left(\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_{old}}(o_i \mid q)}, 1-\epsilon, 1+\epsilon\right) (A_i \cdot M_i) \right) \right]

  • Responses exceeding limits (Mi=0M_i = 0) are masked from the gradient, so over-turn exploration is neither penalized nor artificially encouraged during training—but at inference, the policy can “scale up” to an arbitrary number of reasoning turns.

5. Reasoning Pattern Diversity

Mini-o3 explicitly targets and captures a diverse array of search and reasoning behaviors:

  • Depth-first exploration: recursively zooming into local candidate regions until a plausible hypothesis is confirmed or rejected.
  • Trial-and-error: sequentially testing hypotheses, employing self-reflective thoughts (e.g., “previous region is not the target, try the next”).
  • Goal maintenance: explicit notes in the “thought” field tracking intermediate subgoals, supporting both local and global search context recall.
  • This mixture contrasts with previous open-source systems, which often devolve into superficial, single-path exploration.

Such diversity is critical for real-world visual search, where correct answers may only be reachable through dynamic, adaptive, and reflective behaviors.

6. Experimental Results and Scaling Behavior

Empirical evaluation demonstrates Mini-o3’s state-of-the-art performance:

  • On the VisualProbe (hard) set, Mini-o3 achieves approximately 48% accuracy at a 32-turn test-time budget (Avg@32), markedly surpassing baselines such as DeepEyes (35%).
  • On HR-Bench, V* Bench, and MME-Realworld, the model’s superiority persists, especially as task complexity and required turn depth increase.
  • A key observed trend: although training is strictly limited (e.g., six-turn rollouts), test-time accuracy increases monotonically as the allowed number of agent turns rises, up to at least 32. This scaling is a direct result of over-turn masking and diverse trajectory induction.
  • Ablation studies confirm that both the inclusion of hard RL examples and the SFT (supervised fine-tuning) “cold-start” phase are essential; ablating either reduces attainable reasoning depth and accuracy.

7. Implications, Limitations, and Outlook

Mini-o3 establishes a new regime for visual search and multimodal reasoning agents by:

  • Demonstrating that reinforcement learning—if paired with turn-masking and thoughtful data strategy—can yield policies generalizing to interaction regimes far outside the training envelope.
  • Introducing design practices for constructing datasets and feedback loops that promote not just accuracy, but also rich behavioral variety and scalability.
  • Enabling test-time adaptation in reasoning depth, which allows the agent to allocate more steps for harder problems without retraining.

However, challenges remain. Mini-o3’s architecture is tailored for visual search and the tool-augmented setting; its transferability to non-visual or more general task domains is not addressed by the underlying data. Furthermore, over-turn masking as implemented assumes that extended trajectories are computationally feasible at inference time, which may limit deployment for latency-critical environments or resource-constrained settings.

In summary, Mini-o3 provides a blueprint for achieving scalable, deep, multi-turn reasoning and exploration in complex visual domains, with a principled reinforcement learning objective ensuring efficient training and unbounded test-time behavior (Lai et al., 9 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)