ViThinker: Active Vision-Language Reasoning

Updated 10 March 2026

ViThinker is a framework that employs active, agentic vision-language reasoning by dynamically querying perceptual features to overcome static perception.
It integrates observation tokens aligned with specialist models like segmentation and depth estimation, enabling on-demand perceptual simulation.
A two-stage curriculum and optimized query strategy yield superior accuracy and efficiency, reducing computational overhead compared to static models.

ViThinker refers to a family of contemporary models and frameworks for active, agentic vision-language reasoning that emphasize dynamic visual querying, hybrid tool-use, and biologically-inspired inference policies. The ViThinker paradigm spans both algorithmic frameworks (e.g., VisuoThink/ViThinker (Wang et al., 12 Apr 2025), ViThinker (You et al., 2 Feb 2026)), and model implementations designed to overcome the “static perception” and early visual-to-text bottlenecks characteristic of previous Vision-LLMs (VLMs). This entry focuses primarily on the 2026 framework "ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying" (You et al., 2 Feb 2026), while contextualizing it amidst related multimodal agentic reasoning strategies.

1. Motivation and Conceptual Background

Canonical vision-LLMs implementing Chain-of-Thought (CoT) reasoning operate by encoding images as static feature embeddings. When subjected to textual CoT prompting, these models irreversibly collapse continuous visual information (e.g., geometric cues, spatial topology) into fixed textual or vector representations, fundamentally limiting their capacity to interactively “think with” perceptual content. Existing extensions, such as masking and enumeration of all patch-wise features or static attention-based selection, remain passive—they indiscriminately process entire precomputed feature sets at every reasoning step, regardless of their relevance.

Drawing inspiration from theories of human active perception, ViThinker (You et al., 2 Feb 2026) reframes vision-language reasoning as an iterative, on-demand querying process, where the model autonomously decides both what and when to “look” (i.e., acquire task-relevant perceptual features), actively synthesizing just-in-time expert-aligned visual observations within its internal generative loop.

2. Architectural Principles and Dynamic Perceptual Querying

The core innovation of ViThinker is the incorporation of a parametric querying mechanism enabling dynamic “mental simulation” of perceptual information at arbitrary inference steps. At each autoregressive decoding iteration, the Transformer can output a decision token (e.g., <query_depth>, <query_seg>, <query_edge>, <query_patch>), which triggers the generation of a fixed number of specialist observation tokens. Each specialist token is trained to align (through projection heads) with the latent spaces of a suite of frozen expert models: segmentation (SAM), depth estimation (DepthAnything), edge detection (PIDINet), and visual patch features (DINOv2).

These observation tokens, once synthesized, may be leveraged in subsequent “Think” blocks (text reasoning), closing a tight loop between textual inference and perceptual hypothesis generation. At test time, all expert modules are silent—the model “hallucinates” the aligned features from its own parameters, having memorized the expert representations during training.

3. Training Protocol and Two-Stage Curriculum

ViThinker employs a two-phase curriculum for the acquisition of active querying and perceptual internalization:

Stage 1: Expert Distillation.

The model observes each decision token paired with the true expert feature output for the target image (e.g., $\Phi_{\text{seg}}(I)$ , $\Phi_{\text{depth}}(I)$ , etc.), and is trained to reconstruct these latent features from dedicated observation tokens via projection and a set of loss functions (e.g., Dice+Focal, $\ell_1$ , MSE, as appropriate). The optimization objective is the sum of alignment losses across all expert modalities:

$\mathcal{L}_{\mathrm{distill}} = \sum_{m} \mathcal{L}_{\mathrm{align}}^{(m)}$

Stage 2: Task-Driven Query Policy Learning (“Strategic Perception”).

Expert outputs are removed from the prompt. For each problem, a set of valid reasoning chains $\mathcal{S}_{\mathrm{valid}}$ (from minimal to exhaustive querying) is compiled. The model is incentivized to produce sparse, yet sufficient, queries via a penalty term:

$\mathcal{L}_{p} = \sum_{t\in \mathcal{T}_{q}} N$

where $\mathcal{T}_{q}$ indexes query steps. The overall per-sample loss is a best-path selection among valid chains:

$\mathcal{L}_{\mathrm{sample}} = \min_{s\in \mathcal{S}_{\mathrm{valid}}} \left[ \mathcal{L}_{\mathrm{CE}}(s) + \gamma \mathcal{L}_{\mathrm{vis}}(s) + \eta \mathcal{L}_p(s) \right]$

The $\eta$ coefficient is tuned to optimize the trade-off between query economy and necessary perceptual grounding.

4. Inference Mechanism: Generative Mental Simulation

The inference process in ViThinker alternates between “Think” (textual autoregressive steps) and “Query” (dynamic perception) phases. Pseudocode for this generative loop is as follows:

$\Phi_{\text{depth}}(I)$ 7 By generating specialist hidden states “on demand” and integrating them seamlessly into the ongoing reasoning trajectory, ViThinker emulates a form of mental imagery and iterative observation reminiscent of human slow thinking.

5. Empirical Evaluation and Comparative Performance

ViThinker was constructed on top of Qwen2.5-VL-7B via LoRA adaptation (rank 16 on self/cross-attention layers and full-rank for embeddings/projections), leveraging frozen vision experts for perceptual alignment during training.

Benchmarks: CV-Bench, BLINK, RealWorldQA, MMVP, MMStar-P, HR $_{4K}$ , HR $\Phi_{\text{depth}}(I)$ 0.

Main results (average accuracies):

Model	Avg. Acc. (%)
Qwen2.5-VL-7B (no CoT)	65.1%
+Textual CoT	61.9%
Visual CoT	65.5%
ICoT	67.2%
Aurora	67.3%
CoVT	68.3%
MINT-CoT	68.9%
ViThinker	70.9%

ViThinker outperforms MINT-CoT by +2.0 points and achieves the strongest gains on fine-grained and high-resolution tasks (HR $\Phi_{\text{depth}}(I)$ 1: +2.3%).

Key ablation findings:

The dual-stage curriculum is essential: Stage 1 + 2 achieves 70.9%, versus 66.9% for Stage 1 alone, 64.7% for Stage 2 alone.
Optimal observation tokens $\Phi_{\text{depth}}(I)$ 2 provides a good balance; higher $\Phi_{\text{depth}}(I)$ 3 marginally increases alignment but is less compute-efficient.
Properly calibrated sparsity penalty ( $\Phi_{\text{depth}}(I)$ 4) yields maximal accuracy; under- or over-regularization degrades query selectivity.

Qualitative chain analyses confirm that ViThinker selects visual experts dynamically matching the requirements of each reasoning step (e.g., querying both depth and segmentation for spatial inference while using only segmentation for counting).

6. Limitations, Trade-offs, and Extensions

ViThinker’s major strengths include:

Metacognitive adaptivity: It internalizes not only “what to see” but “when to see,” leading to minimal necessary querying per problem.
Tool-free inference: Inference requires no external expert calls—only the model’s own generative simulation.
Compute efficiency: Relative to static enumeration, ViThinker achieves comparable or better accuracy with $\Phi_{\text{depth}}(I)$ 546% fewer query tokens.

Principal limitations and open research questions:

Curriculum and supervision complexity: Training requires careful construction of multi-path chain sets and correct assignment of “valid” reasoning chains.
Hyperparameter sensitivity: Performance depends on the careful tuning of the sparsity ( $\Phi_{\text{depth}}(I)$ 6) parameter.
Scalability/extendability: Incorporation of additional expert modalities or extension to temporal/video domains will necessitate more sophisticated policy optimization, potentially shifting from supervised to RL-based expert selection.

Future work includes adaptation to video or 3D scene understanding, RL-based policy improvement for more flexible expert selection, and online continual learning of novel expert modalities (You et al., 2 Feb 2026).

7. Context: Relationship to Other Agentic and Active Vision-Language Reasoners

The ViThinker paradigm aligns conceptually with several contemporary advances:

VisuoThink (Wang et al., 12 Apr 2025): Realizes agentic multimodal reasoning as an interleaved sequence of “thought”, “action” (tool invocation), and “observation”, using inference-time tree search for trajectory selection—a related but tool-dependent method distinct from ViThinker’s parametric querying.
V-Thinker (Qiao et al., 6 Nov 2025): Focuses on image-interactive reasoning, combining progressive data evolution and RL-curriculum learning, emphasizing direct manipulation and editing of visual state.
Video-Thinker (Wang et al., 27 Oct 2025), VideoThinker (Li et al., 22 Jan 2026): Extend the “thinking-with-perception” paradigm to videos, but generally require tool calls or explicit chain-of-thought formatting rather than parametric mental simulation.
ViTCN (Song et al., 2024): Demonstrates the value of transformer-based contrastive reasoning on abstract patterns, conceptually related in spirit but lacking active querying or dynamic perception.

A plausible implication is the emergence of a unifying paradigm in which vision-language reasoning architectures increasingly blur the lines between passive static feature consumption and task-driven, contextually-aware, agentic perception—embodied most fully in the ViThinker class of models.

References:

ViThinker (You et al., 2 Feb 2026) VisuoThink (Wang et al., 12 Apr 2025) V-Thinker (Qiao et al., 6 Nov 2025) Video-Thinker (Wang et al., 27 Oct 2025) VideoThinker (Li et al., 22 Jan 2026) ViTCN (Song et al., 2024)