OpenVLThinker: Modular Vision-Language Framework

Updated 9 December 2025

OpenVLThinker is a modular framework combining vision and language models to achieve advanced multimodal reasoning, object discovery, and tool use.
It integrates supervised fine-tuning with reinforcement learning and chain-of-thought distillation, leading to improved accuracy in chart reasoning and open-world detection.
The framework employs standardized tool interfaces and an iterative training pipeline, enabling autonomous, adaptive reasoning that outperforms traditional LVLM approaches.

OpenVLThinker is a framework and methodology for equipping vision-LLMs (LVLMs) with advanced, adaptive multimodal reasoning, object discovery, and tool-using behavior by integrating supervised fine-tuning, reinforcement learning, standardized tool interfaces, and iterative self-improvement. Its development leverages modular pipelines for chart and diagrammatic reasoning, open-world object detection, and state-of-the-art chain-of-thought distillation from text-only advanced reasoning models. Notable instantiations include the OpenVLThinker-7B model (Deng et al., 21 Mar 2025), the OpenThinkIMG system with V-ToolRL (Su et al., 13 May 2025), and OWOD-style detectors using a vision-language “brain” (Ma et al., 2023).

1. System Architecture and Core Principles

OpenVLThinker adopts a modular, extensible architecture to support multiple lines of research in visual reasoning and open-world perception. Its canonical LVLM core utilizes a transformer-style LLM backbone, typically of 7B or 2B parameter scale, combined with a ViT-style (Vision Transformer) visual encoder. Image representations are either prepended or interleaved with textual queries as positional embeddings, enabling the model to attend jointly over both modalities at every generation step (Deng et al., 21 Mar 2025, Su et al., 13 May 2025).

Key components:

Standardized Vision Tool Interfaces: OpenThinkIMG provides a registry of vision tools (e.g., OCR, GroundingDINO, SAM, Point, Crop, DrawHorizontalLineByY/DrawVerticalLineByX, ZoomInSubplot, SegmentRegionAroundPoint) with a unified input signature (image tensor plus arguments) and JSON-structured outputs (Su et al., 13 May 2025).
Central Tool Controller: Orchestrates multi-turn tool invocation by parsing LVLM-planned actions, parallelizing calls to containerized tool services, aggregating outputs $\omega_t$ , and feeding context back into the model for subsequent decisions.
Assistant "Brain": In open-world detection, a pretrained VL model (e.g., GLIP) acts as an external proposer of unknown object regions, supplying soft-labeled pseudo-annotations for incremental discovery (Ma et al., 2023).

This integrated infrastructure enables agents to autonomously plan, invoke, and use tools or external knowledge sources in a looped, adaptive fashion.

2. Iterative Training Pipeline: Supervised Fine-Tuning and Reinforcement Learning

The OpenVLThinker methodology is defined by alternating phases of supervised fine-tuning (SFT) and reinforcement learning (RL), sometimes coupled with self-distillation and knowledge from pure-text reasoners (Deng et al., 21 Mar 2025).

Supervised Fine-Tuning (SFT)

Initialization: Training begins with SFT on demonstration trajectories constructed by distilling high-quality chain-of-thought (CoT) reasoning from advanced text-only models (e.g., DeepSeek-R1-Distill-14B). For visual domains, this involves captioning each image, prompting the reasoner with (caption, question) pairs, and verifying sampled reasoning chains against ground truth, keeping the shortest correct chain. The resulting (image, question, distilled CoT + answer) triplets seed the initial SFT dataset (≈25K examples from diverse vision-math datasets) (Deng et al., 21 Mar 2025).
Tool-Use Demos: For tool-augmented LVLMs, trajectories are generated by prompting large models (e.g., GPT-4o) to plan tool actions, automatically executing and filtering for correctness via JSON-schema checks, rule-based tests, and spot-checking (Su et al., 13 May 2025).

V-ToolRL and GRPO-based Reinforcement Learning

Policy Learning: Post-SFT, the agent trains via Group-wise Proximal Policy Optimization (GRPO), sampling groups of trajectories, measuring group-normalized advantages, and optimizing a clipped policy gradient objective. State input includes the full reasoning context plus all tool outputs to date: $s_t = (Q, I, \omega_{<t})$ , where the policy $\pi_\theta(a_t|s_t)$ produces the next tool action (Deng et al., 21 Mar 2025, Su et al., 13 May 2025).
Reward Function: For chart reasoning or multimodal VQA, the primary reward is binary: $R^{(i)} = +1$ iff the LVLM's answer matches groundtruth (is_equivalent), otherwise $-1$ . Optional intermediate rewards can assess tool output quality but are not generally required for convergence (Su et al., 13 May 2025).
Iteration: Each RL-improved checkpoint can generate more refined SFT traces via self-correction for the next SFT phase, yielding steadily improving policies and facilitating data distillation beyond static demonstration datasets (Deng et al., 21 Mar 2025).

3. Trajectory Generation and Pseudo-Labeling for Open-World Generalization

Open-World Detection

Cascade-Decoupled Detection: Detectors employ a backbone encoder and Transformer-based heads that decouple object localization (foreground/background/box score) from class identification (including "unknown"). This cascade structure ensures that the objectness measure is not distorted by class confusion (Ma et al., 2023).
Down-Weighted Losses: For pseudo-labeled unknowns generated by the VL "brain" (GLIP), losses are scaled by the external model's confidence $\hat{S}$ to avoid overfitting to noisy labels:

$\mathcal{L}_r^z = \frac{1}{|\ell_z|}\sum_{i \in \ell_z} \hat{S}_{\omega(i)}\big[\|b_i - \hat{b}_{\omega(i)}\|_1 + 1 - \mathrm{GIoU}(b_i,\hat{b}_{\omega(i)})\big]$

$\mathcal{L}_{cls}^z = \frac{1}{\sum_{i} cls_i}\sum_i \mathrm{Focal}(cls_i,\hat{S}_{\omega(i)})$

Pseudo-Discovery: The model self-labels additional pseudo-unknowns by selecting the highest box-score predictions among unmatched queries, promoting evolution of its objectness concept without direct supervision (Ma et al., 2023).

Tool-Use Reasoning

The pipeline for trajectory generation in tool-use involves three steps: action planning (few-shot LLM generation), tool-call completion (batch execution), and filtering/validation (schema/rule checks and human review). This enables SFT bootstrapping and performance beyond static human-generated examples (Su et al., 13 May 2025).

4. Experimental Results and Empirical Findings

Empirical evaluations show substantial improvements in both tool-augmented visual reasoning and open-world object detection.

Tool-Augmented Chart Reasoning

On the ChartGemma dataset:

Backbone: Qwen2-VL-2B-Instruct
Tools: OCR, point, region segmentation, drawing, zooming
Results:
- Qwen-Base: 29.56%
- Qwen-SFT: 45.67%
- Text-based RL: 51.63%
- V-ToolRL (OpenThinkIMG): 59.39%
- Taco-8B: 30.50%
- CogCom-13B: 15.07%
- GPT-4.1 (zero-shot): 50.71%
- Gemini-2.0: 68.20%
- V-ToolRL outperforms non-tool RL and SFT baselines by margins of up to +29.83 percentage points and surpasses GPT-4.1 by +8.68 (Su et al., 13 May 2025). Tool-call efficiency emerges during RL (calls/sample: 0.63→0.10), and response length/complexity increase, supporting richer chain-of-thought.

Iterative Vision-Language Reasoning

On multimodal math reasoning benchmarks:

OpenVLThinker-7B:
- MathVista: 70.2% (vs. 68.5% base, 63.8% GPT-4o)
- MathVerse: 47.9% (vs. 46.8% base, 50.2% GPT-4o)
- MathVision: 29.6% on testmini, 25.3% full set
Ablations reveal: SFT filtering crucially impacts performance (unfiltered: 48.4%, filtered <500 words: 55.0%, de-reflected: 62.5%) (Deng et al., 21 Mar 2025).

Open-World Detection

On OWOD-COCO/MS-COCO splits (Ma et al., 2023):

Task 1 U-Recall: 12.1 → 39.0 (OWOD-COCO), 5.7 → 60.9 (MS-COCO)
Inference: Up to 115× faster and 5× fewer FLOPs than running VL “brain” directly.

5. Analysis, Best Practices, and Open Challenges

Methodological Best Practices

Standardize all tool and model interfaces, and deploy tools as containerized services orchestrated by a controller for modularity (Su et al., 13 May 2025).
For SFT, use diverse, filtered, high-quality distillations of reasoning chains or tool-use trajectories, verified by alignment with ground truth and chain brevity.
Alternate SFT with RL phases for stability and greater policy adaptability, using simple, robust reward functions (binary correctness) and group-relative normalization (GRPO).
In detection, scale pseudo-label loss by confidence to minimize noisy supervision, and regularly update the class set via human-interaction for true open-world evolution.

Persistent Challenges and Future Directions

Reward Shaping: Exploring richer reward schemes, e.g., partial correctness or stepwise verification, beyond end-task accuracy (Su et al., 13 May 2025, Deng et al., 21 Mar 2025).
Scaling Tool Suites: Expansion to calculators, simulators, knowledge bases, and cross-domain transfer (charts→maps, diagrams, etc.).
Generalization: Cross-modal and cross-task transfer, especially addressing open-ended VQA or real-world diagrammatic reasoning.
Interpretability: Generating transparent, human-readable explanations of tool-use and reasoning steps.
Safe Exploration: Limiting harmful or nonsensical tool invocations during RL exploration.
Incremental Discovery: Beyond a single “unknown” label toward fine-grained categorization, online prompt expansion, and clustering for open-world detection (Ma et al., 2023).
Code, Data, and Reproducibility: Open-source releases of code, model weights, trajectory generators, and tool registries (see github.com/yihedeng9/OpenVLThinker; github.com/zhaochen0110/OpenThinkIMG).

6. Significance and Impact

OpenVLThinker demonstrates that advanced multimodal reasoning, robust tool-use, and open-world perception are achievable via unified architectures that combine iterative SFT, RL, and external knowledge sources. By bridging chain-of-thought distillation, tool-augmented action, and pseudo-label-driven object discovery, the approach enables LVLMs to surpass both prior open-source and prominent closed models (e.g., GPT-4o) on challenging visual and reasoning benchmarks (Deng et al., 21 Mar 2025, Su et al., 13 May 2025, Ma et al., 2023). The iterative refinement loop, modular tool registry, and pseudo-labeling mechanisms define a generalizable recipe for rapidly evolving and evaluating multi-modal agents capable of compositional, adaptive, and transparent reasoning.