ThinkLite-VL: Efficient Visual Reasoning

Updated 11 December 2025

ThinkLite-VL models are advanced visual reasoning systems that use MCTS-guided reinforcement fine-tuning to optimize learning from limited data.
The approach eliminates knowledge distillation by curating difficulty-aware samples, achieving state-of-the-art results with far fewer training examples.
By selecting challenging yet solvable problems, ThinkLite-VL improves generalization and outperforms comparable VLMs across multiple reasoning benchmarks.

ThinkLite-VL denotes a family of visual reasoning models optimized for data-efficient self-improvement within the vision-language modeling (VLM) paradigm. Utilizing a Monte Carlo Tree Search (MCTS)-guided sample selection framework, ThinkLite-VL achieves state-of-the-art (SoTA) results while relying exclusively on reinforcement fine-tuning (RFT) with an order of magnitude fewer training examples compared to previous methods. Notably, this approach dispenses with knowledge distillation, leveraging internal difficulty-aware sample curation to maximize learning efficacy from limited data (Wang et al., 10 Apr 2025).

1. Background and Motivation

Vision-LLMs (VLMs) are typically pretrained on large-scale text corpora, resulting in a misalignment between their learned representations and the requirements of downstream multimodal reasoning tasks. Prevailing adaptation workflows thus involve extensive supervised fine-tuning (SFT), frequently via knowledge distillation from closed-source models, which carries significant compute and data costs while impeding genuine self-improvement via reinforcement learning paradigms.

In contrast, advances in text-based LLMs have demonstrated that reinforcement fine-tuning (RFT)—with carefully defined reward functions—can drive substantial gains in systematic reasoning. However, naïve RFT applied to VLMs underperforms unless the fine-tuning dataset is both sufficiently large and well-curated, particularly concerning sample difficulty. In scenarios with limited data budgets, indiscriminate mixing of easy and hard problems leads to superficially high training rewards but yields minimal improvement in actual reasoning capability. Appropriately challenging samples—neither trivial nor unsolvable—are empirically shown to drive the highest learning signal. This becomes critical when the data budget is limited to a few tens of thousands of examples (Wang et al., 10 Apr 2025).

2. MCTS-Based Difficulty Quantification

ThinkLite-VL introduces a principled method to quantify instance difficulty by leveraging Monte Carlo Tree Search (MCTS) over the model’s own reasoning process. The core construct is as follows:

State ( $s_t$ ): The chain-of-thought (CoT) prefix up to reasoning step $t$ .
Action ( $a$ ): The next reasoning token, sampled from the VLM policy $\pi_\theta(a \mid x,I,s_t)$ .
MCTS Root ( $s_0$ ): Empty prefix (start of reasoning).

Each MCTS iteration comprises:

Selection: Descend from $s_0$ via PUCT, maximizing over children $s'$ with

$s_{t+1} = \arg \max_{s' \in \text{child}(s_t)} \left[\frac{Q(s')}{N(s')} + c_{\text{puct}} \cdot \sqrt{\frac{N(s_t)}{1+N(s')}}\right]$

Expansion: At leaf $s_t$ , sample $k=3$ candidates (temperature $\tau=0.5$ ) to generate new child nodes.
Simulation: From $s_{t+1}$ , sample a complete CoT with $\pi_\theta$ until a boxed answer or maximum length; use Qwen2.5-7B-Instruct as a critic to assess correctness.
Backpropagation: If correct, record the current iteration $K$ ; else increment $N(s_{t+1})$ and continue up to a cap of 50 iterations.

The difficulty score for sample $x$ is $K(x) = \min \{ k \in [1,50] : \text{rollout}_k(x) \text{ correct} \} \vee 50$ if unsolved. Higher $K$ values indicate greater difficulty. Hyperparameter settings involve expansion width $k=3$ , $\tau=0.5$ , $c_{\text{puct}}$ selected for broad tree exploration, and a cap of 50 iterations to balance fidelity and computational effort (Wang et al., 10 Apr 2025).

3. Data Collection and Difficulty-Aware Selection

The initial training pool comprised 70,000 open-source examples from eight datasets, spanning three reasoning domains:

Math reasoning: Geometry3K, GeoQA, Geos (~8K samples)
Natural image question answering: FigureQA, ScienceQA, OK-VQA (~29K samples)
Chart comprehension: IconQA, TabMWP (~33K samples)

All multiple-choice tasks were reformulated into open-ended CoT prompts to eliminate answer shortcutting. The full MCTS difficulty assay was executed over the 70K samples, recording each sample’s $K(x)$ . The selection retained:

All samples with $K(x) > 5$ (moderate/hard) and those unsolved at 50 iterations (very hard).
This yielded 11,000 “appropriately challenging yet solvable” examples for 7B model RFT; the analogous process produced 7,500 samples for the 72B model configuration.

Stratification preserves the original domain proportions, ensuring coverage across math, natural image, and chart-based reasoning (Wang et al., 10 Apr 2025).

4. Reinforcement Fine-Tuning Configuration

The RFT process targets the Qwen2.5-VL-7B-Instruct base model, with Qwen2.5-VL-72B-Instruct serving as an evaluation reference. Key characteristics include:

Reward Function: Binary—reward $r=1$ if the model’s final boxed answer exactly matches the ground truth; $r=0$ otherwise. This is a self-rewarding process, requiring no external reward model.
Optimization: Group Relative Policy Optimization (GRPO), with group size $G=32$ . The optimization objective is:

$J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi^{\text{old}}_\theta} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\{ r_t \hat{A}_{i,t}, \text{clip}(r_t, 1 - \epsilon, 1+\epsilon) \hat{A}_{i,t} \} - \beta \cdot \text{KL}(\pi_\theta \parallel \pi_\text{pre}) \right]$

where $r_t$ is the token-level policy ratio, $\hat{A}_{i,t}$ is an advantage estimate, $\epsilon$ is a PPO-style clipping parameter ($0.2$), and $\beta$ is the KL penalty ( $\approx 0.01$ ).

Policy settings: Rollout number $G=32$ , policy temperature $0.7$, Easy-R1 codebase, inclusion of internal monologue via > … tags, with the final answer boxed (Wang et al., 10 Apr 2025).

5. Performance and Comparative Analysis

ThinkLite-VL demonstrates significant improvements over baseline and competitive benchmarks. Table 1 summarizes key outcomes:

Model	Data Size (RFT or SFT)	MathVista (%)	Avg. (8 Benchmarks) (%)
Qwen2.5-VL-7B-Instruct	-	67.8	59.69
LLaVA-CoT-11B	100K (SFT)	-	52.85
Mulberry-7B	260K (SFT)	~63.1	-
Vision-R1-7B	210K (RFT)	73.5	-
OpenVLThinker-7B	59K (RFT)	70.2	-
MM-EUREKA-Qwen-7B	54K (RFT, no KD)	73.0	-
Random11k-RFT	11K (RFT)	-	60.89
ThinkLite-VL-7B	11K (RFT, no KD)	75.1	63.89

Key findings:

ThinkLite-VL-7B, trained with only 11K MCTS-selected examples, yields a +7.3 point gain on MathVista (67.8% → 75.1%) over its base and surpasses all other 7B-level models as well as larger models including GPT-4o, O1, and Qwen2.5-VL-72B.
ThinkLite-VL-72B achieves 79.7% on MathVista.
Average score over eight benchmarks for ThinkLite-VL-7B: 63.89% (random 11K: 60.89%; full 70K: 63.13%), demonstrating superior data efficiency.

Alternative selection strategies confirm that mixing medium-difficulty samples ( $K>5$ ) with the hardest ( $K=50$ ) yields optimal generalization. Overemphasizing ultra-hard or overly easy examples degrades performance (Wang et al., 10 Apr 2025).

6. Analysis and Ablation Studies

Multiple ablations dissect the impact of sample selection:

Subset comparisons: "Unsolved only" (5.6K), "Iter $>5$ only" (5.4K), "Random11k," "Self-Consistency filtered" (23K), and "Full 70K" all underperform "Iter $>5$ + Unsolved" (11K), which attains the highest benchmark average.
Difficulty threshold sweeps: Performance peaks at an $K$ threshold of 5 ( $K>5$ + Unsolved: 11K, 63.89%). Both lower ("Iter $_1$ +Unsolved": 18K, 63.29%) and higher thresholds ("Iter $_{10}$ +Unsolved": 8K, 62.65%) yield diminished results, indicating an optimal curriculum that avoids both extreme difficulty and triviality.

A critical observation is that training set reward curves do not correlate with downstream accuracy; easy training samples inflate mean rewards without improving generalization. Difficult, but solvable, samples produce slower training reward gains but achieve superior test performance in visual reasoning (Wang et al., 10 Apr 2025).

7. Limitations and Prospects

While the MCTS-guided data curation strategy yields superior data efficiency, it incurs nontrivial computational costs, motivating investigation of cheaper surrogates or early-exit heuristics. Prospective extensions include scaling the approach to larger parameter models (e.g., 72B) and to alternative tasks such as cross-modal generation. The exploration of curriculum learning schedules or learned reward models represents another potential direction for increasing RFT data efficiency.

In summary, ThinkLite-VL demonstrates that model-driven, MCTS-guided difficulty filtering from a broad dataset effectively enables SoTA multimodal reasoning performance using a fraction of the typical fine-tuning data, validating the central premise that optimizing for sample difficulty is key to efficient visual-language self-improvement (Wang et al., 10 Apr 2025).

PDF Markdown Chat (Pro)

References (1)

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ThinkLite-VL.