Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

VibeThinker-1.5B: Efficient Reasoning Model

Updated 12 November 2025
  • VibeThinker-1.5B is a 1.5-billion-parameter Transformer that challenges the size-based reasoning paradigm by leveraging the Spectrum-to-Signal Principle to decouple diversity from signal amplification.
  • It employs a two-stage diversity-exploring distillation and MaxEnt-Guided Policy Optimization to boost performance on specialized mathematical and coding benchmarks.
  • The model achieves high-level reasoning with training costs significantly lower than larger models, making advanced capabilities accessible on commodity hardware.

VibeThinker-1.5B is a 1.5-billion-parameter dense Transformer LLM deliberately engineered to challenge the dominant paradigm that model size is the critical determinant of advanced reasoning ability. Developed by leveraging the Spectrum-to-Signal Principle (SSP), VibeThinker-1.5B utilizes an explicit decoupling of diversity maximization and signal amplification across its fine-tuning and reinforcement learning regimes. This architecture and methodology enable the model to achieve or exceed the performance of models up to 400 times larger on specialized mathematical and code reasoning tasks, with training costs nearly two orders of magnitude lower than those incurred by contemporary large-scale models.

1. Spectrum-to-Signal Principle

The Spectrum-to-Signal Principle (SSP) is the conceptual core of VibeThinker-1.5B’s training pipeline. SSP reconceptualizes the traditional supervised fine-tuning (SFT) and reinforcement learning (RL) succession by explicitly prioritizing solution diversity (the "spectrum") as the target objective of SFT, followed only subsequently by strict signal amplification during RL.

  • Spectrum Phase (SFT): Replace the focus on Pass@1 with Pass@K. The model is encouraged to produce a wide variety of plausible reasoning paths across subdomains, furnishing RL with a rich set of candidate traces.
  • Signal Phase (RL): Apply MaxEnt-Guided Policy Optimization (MGPO) to selectively amplify the most promising solution signals, using maximum-entropy weighting to prioritize ambiguous or frontier problems.

Decoupling these two phases ensures that small models retain a breadth of candidate solutions, allowing targeted RL to reliably strengthen correct reasoning chains that would otherwise be lost to mode collapse in small-parameter regimes.

2. Two-Stage Diversity-Exploring Distillation

The spectrum phase is realized through a two-stage diversity-exploring distillation process:

  1. Domain-Aware Diversity Probing: The training domain (e.g., mathematics) is partitioned into NN subdomains S1,,SNS_1,\ldots,S_N. For each subdomain, a set DiD_i of probe problems is assembled, and, during SFT, checkpoint MtM_t is evaluated on DiD_i by computing Pi(t)=Pass@K(Mt,Di)P_i(t) = \mathrm{Pass@K}(M_t, D_i). The checkpoint maximizing Pass@K is retained for each subdomain.
  2. Expert Model Fusion: The NN specialist models MiM_i^* are merged by weighted averaging:

MMergeSFT=i=1NwiMi,iwi=1,M_{\mathrm{Merge}}^{\mathrm{SFT}} = \sum_{i=1}^N w_i M_i^*, \quad \sum_i w_i = 1,

with uniform weights in practice. Despite being optimized for diversity, the fused model also exhibits competitive Pass@1—contradicting the expected spectrum–signal trade-off.

This approach yields an SFT checkpoint that demonstrably provides a broader exploration of the solution space, facilitating downstream RL.

3. MaxEnt-Guided Policy Optimization

Post-SFT, the RL signal phase employs a variant of Group Relative Policy Optimization (GRPO) enhanced with maximum-entropy weighting. Each query qq is assigned a weight reflecting how close its success rate pc(q)p_c(q) is to 50%, computed as follows:

  1. Empirical success rate,

pc(q)=1Gi=1G1{ri=1},p_c(q) = \frac{1}{G}\sum_{i=1}^G \mathbf{1}\{r_i=1\},

where ri{0,1}r_i \in \{0,1\} is the binary reward for rollout ii.

  1. Binary KL divergence to ideal entropy,

DME(pc(q)0.5)=pc(q)lnpc(q)0.5+(1pc(q))ln1pc(q)0.5,D_{\rm ME}(p_c(q) \| 0.5) = p_c(q)\ln\frac{p_c(q)}{0.5} + (1-p_c(q))\ln\frac{1-p_c(q)}{0.5},

  1. Entropy-deviation weight,

wME(pc(q))=exp(λDME(pc(q)0.5)),w_{\rm ME}(p_c(q)) = \exp(-\lambda D_{\rm ME}(p_c(q)\|0.5)),

where λ\lambda controls weight sharpness (default λ1.0\lambda\approx1.0).

  1. Weighted advantage for PPO surrogate loss,

Ai,t(q)=wME(pc(q))Ai,t(q)\mathcal{A}_{i,t}'(q) = w_{\rm ME}(p_c(q)) \cdot \mathcal{A}_{i,t}(q)

entering the clipped-PPO loss.

This focuses optimization on high-uncertainty queries ("frontiers"), inducing a curriculum effect and ensuring that RL gains are allocated to the most informative regions of the solution space. Ablation confirms significant performance drops if the diversity probing or MaxEnt weighting are removed.

4. Model Architecture and Computational Efficiency

VibeThinker-1.5B is based on the Qwen2.5-Math-1.5B base, structured as a standard dense Transformer utilizing multi-head self-attention. While specifics of depth and width are not publicly detailed, it conforms to 1.5B parameters within current Transformer conventions.

Training and inference regimes:

  • SFT (“Spectrum”): AdamW optimizer, learning rate 105\sim10^{-5}, batch size O(102103)O(10^2\text{–}10^3) tokens/GPU, regular checkpointing for diversity selection.
  • RL (“Signal”): PPO-style clipping with ε=0.2\varepsilon = 0.2, G=8G = 8 rollouts/query, λ1.0\lambda \approx 1.0. Training cost is $\$7,800$(3900 NVIDIA H800 GPU-hours), orders of magnitude less than large-model post-training (\$294K–\$535K). Inference uses vLLM, top-p=0.95p = 0.95, temperature$0.6$(math$1.0$).

Deployment feasibility extends to single-GPU and edge devices, reflecting exceptionally low computational and financial barriers compared to state-of-the-art baselines.

5. Empirical Benchmark Performance

VibeThinker-1.5B exhibits unprecedented reasoning scores for its size on mathematical and code benchmarks. Representative Pass@1 outcomes:

Model Params AIME24 AIME25 HMMT25
Qwen2.5-Math (Base) 1.5B 6.7 4.3 0.6
FastCURL-v3 1.5B 49.6 34.4 21.5
ProRL 1.5B 48.1 33.3 20.5
VibeThinker-1.5B 1.5B 80.3 74.4 50.4
DeepSeek-R1-0120 671B 79.8 70.0 41.7

On LiveCodeBench V6:

  • Base model: 0.0%
  • Magistral-Medium: 50.3%
  • VibeThinker-1.5B: 51.1%

GPQA-Diamond (professional knowledge): +30.3 percentage points improvement (from 16.4% to 46.7%).

Removing diversity probing fusion or MaxEnt weighting each diminishes performance by approximately 5–10% on AIME25, substantiating the necessity of both components.

6. Analysis, Limitations, and Impact

Reasoning competence: High Pass@K in SFT expands the potential for RL-driven improvement. MGPO’s targeting of “frontier” queries creates a curriculum effect, making effective use of limited model capacity and yielding chain-of-thought reasoning on par with models exceeding 100B parameters.

Scalability and democratization: The entire post-training regime costs under $8K, nearly two orders of magnitude lower than large models. Inference on 1.5B-parameter architectures is accessible to commodity hardware, supporting decentralization and democratization of advanced LLM capabilities.

Limitations: The approach remains constrained on broad general knowledge benchmarks (e.g., GPQA) where parameter count exerts a more noticeable effect. Base model pretraining is math-biased; thus, expansion to code-centric or multimodal pretraining is an identified future direction. The extensibility of SSP to cross-lingual or multi-modal tasks is suggested as open for exploration.

A plausible implication is a fundamental rethinking of the training objectives and resource allocation for small-scale models in research and applied environments, advocating for post-training procedures that are explicitly tailored for reasoning rather than raw scale.

7. Conclusion

VibeThinker-1.5B empirically invalidates the size-equals-reasoning axiom in mathematical and coding domains by decoupling and optimizing diversity (Pass@K) and signal (Pass@1) within distinct training phases, operationalized through explicit mathematical objectives and efficient resource usage. The model achieves high-level reasoning ability at a fraction of the computational cost, serving as a new reference point for both the methodology and economics of high-performance, small-parameter language modeling (Xu et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VibeThinker-1.5B.