VibeThinker-1.5B: Efficient Reasoning Model
- VibeThinker-1.5B is a 1.5-billion-parameter Transformer that challenges the size-based reasoning paradigm by leveraging the Spectrum-to-Signal Principle to decouple diversity from signal amplification.
- It employs a two-stage diversity-exploring distillation and MaxEnt-Guided Policy Optimization to boost performance on specialized mathematical and coding benchmarks.
- The model achieves high-level reasoning with training costs significantly lower than larger models, making advanced capabilities accessible on commodity hardware.
VibeThinker-1.5B is a 1.5-billion-parameter dense Transformer LLM deliberately engineered to challenge the dominant paradigm that model size is the critical determinant of advanced reasoning ability. Developed by leveraging the Spectrum-to-Signal Principle (SSP), VibeThinker-1.5B utilizes an explicit decoupling of diversity maximization and signal amplification across its fine-tuning and reinforcement learning regimes. This architecture and methodology enable the model to achieve or exceed the performance of models up to 400 times larger on specialized mathematical and code reasoning tasks, with training costs nearly two orders of magnitude lower than those incurred by contemporary large-scale models.
1. Spectrum-to-Signal Principle
The Spectrum-to-Signal Principle (SSP) is the conceptual core of VibeThinker-1.5B’s training pipeline. SSP reconceptualizes the traditional supervised fine-tuning (SFT) and reinforcement learning (RL) succession by explicitly prioritizing solution diversity (the "spectrum") as the target objective of SFT, followed only subsequently by strict signal amplification during RL.
- Spectrum Phase (SFT): Replace the focus on Pass@1 with Pass@K. The model is encouraged to produce a wide variety of plausible reasoning paths across subdomains, furnishing RL with a rich set of candidate traces.
- Signal Phase (RL): Apply MaxEnt-Guided Policy Optimization (MGPO) to selectively amplify the most promising solution signals, using maximum-entropy weighting to prioritize ambiguous or frontier problems.
Decoupling these two phases ensures that small models retain a breadth of candidate solutions, allowing targeted RL to reliably strengthen correct reasoning chains that would otherwise be lost to mode collapse in small-parameter regimes.
2. Two-Stage Diversity-Exploring Distillation
The spectrum phase is realized through a two-stage diversity-exploring distillation process:
- Domain-Aware Diversity Probing: The training domain (e.g., mathematics) is partitioned into subdomains . For each subdomain, a set of probe problems is assembled, and, during SFT, checkpoint is evaluated on by computing . The checkpoint maximizing Pass@K is retained for each subdomain.
- Expert Model Fusion: The specialist models are merged by weighted averaging:
with uniform weights in practice. Despite being optimized for diversity, the fused model also exhibits competitive Pass@1—contradicting the expected spectrum–signal trade-off.
This approach yields an SFT checkpoint that demonstrably provides a broader exploration of the solution space, facilitating downstream RL.
3. MaxEnt-Guided Policy Optimization
Post-SFT, the RL signal phase employs a variant of Group Relative Policy Optimization (GRPO) enhanced with maximum-entropy weighting. Each query is assigned a weight reflecting how close its success rate is to 50%, computed as follows:
- Empirical success rate,
where is the binary reward for rollout .
- Binary KL divergence to ideal entropy,
- Entropy-deviation weight,
where controls weight sharpness (default ).
- Weighted advantage for PPO surrogate loss,
entering the clipped-PPO loss.
This focuses optimization on high-uncertainty queries ("frontiers"), inducing a curriculum effect and ensuring that RL gains are allocated to the most informative regions of the solution space. Ablation confirms significant performance drops if the diversity probing or MaxEnt weighting are removed.
4. Model Architecture and Computational Efficiency
VibeThinker-1.5B is based on the Qwen2.5-Math-1.5B base, structured as a standard dense Transformer utilizing multi-head self-attention. While specifics of depth and width are not publicly detailed, it conforms to 1.5B parameters within current Transformer conventions.
Training and inference regimes:
- SFT (“Spectrum”): AdamW optimizer, learning rate , batch size tokens/GPU, regular checkpointing for diversity selection.
- RL (“Signal”): PPO-style clipping with , rollouts/query, . Training cost is $\$7,800$(3900 NVIDIA H800 GPU-hours), orders of magnitude less than large-model post-training (\$294K–\$535K). Inference uses vLLM, top-, temperature$0.6$(math$1.0$).
Deployment feasibility extends to single-GPU and edge devices, reflecting exceptionally low computational and financial barriers compared to state-of-the-art baselines.
5. Empirical Benchmark Performance
VibeThinker-1.5B exhibits unprecedented reasoning scores for its size on mathematical and code benchmarks. Representative Pass@1 outcomes:
| Model | Params | AIME24 | AIME25 | HMMT25 |
|---|---|---|---|---|
| Qwen2.5-Math (Base) | 1.5B | 6.7 | 4.3 | 0.6 |
| FastCURL-v3 | 1.5B | 49.6 | 34.4 | 21.5 |
| ProRL | 1.5B | 48.1 | 33.3 | 20.5 |
| VibeThinker-1.5B | 1.5B | 80.3 | 74.4 | 50.4 |
| DeepSeek-R1-0120 | 671B | 79.8 | 70.0 | 41.7 |
On LiveCodeBench V6:
- Base model: 0.0%
- Magistral-Medium: 50.3%
- VibeThinker-1.5B: 51.1%
GPQA-Diamond (professional knowledge): +30.3 percentage points improvement (from 16.4% to 46.7%).
Removing diversity probing fusion or MaxEnt weighting each diminishes performance by approximately 5–10% on AIME25, substantiating the necessity of both components.
6. Analysis, Limitations, and Impact
Reasoning competence: High Pass@K in SFT expands the potential for RL-driven improvement. MGPO’s targeting of “frontier” queries creates a curriculum effect, making effective use of limited model capacity and yielding chain-of-thought reasoning on par with models exceeding 100B parameters.
Scalability and democratization: The entire post-training regime costs under $8K, nearly two orders of magnitude lower than large models. Inference on 1.5B-parameter architectures is accessible to commodity hardware, supporting decentralization and democratization of advanced LLM capabilities.
Limitations: The approach remains constrained on broad general knowledge benchmarks (e.g., GPQA) where parameter count exerts a more noticeable effect. Base model pretraining is math-biased; thus, expansion to code-centric or multimodal pretraining is an identified future direction. The extensibility of SSP to cross-lingual or multi-modal tasks is suggested as open for exploration.
A plausible implication is a fundamental rethinking of the training objectives and resource allocation for small-scale models in research and applied environments, advocating for post-training procedures that are explicitly tailored for reasoning rather than raw scale.
7. Conclusion
VibeThinker-1.5B empirically invalidates the size-equals-reasoning axiom in mathematical and coding domains by decoupling and optimizing diversity (Pass@K) and signal (Pass@1) within distinct training phases, operationalized through explicit mathematical objectives and efficient resource usage. The model achieves high-level reasoning ability at a fraction of the computational cost, serving as a new reference point for both the methodology and economics of high-performance, small-parameter language modeling (Xu et al., 9 Nov 2025).