Spectrum-to-Signal Principle is a training paradigm that separates diverse output generation from the extraction of correct reasoning chains, enhancing model robustness.
It employs a two-stage process combining domain-aware diversity probing with expert model fusion and MaxEnt-guided policy optimization to maximize reasoning capability.
Empirical results show that models like VibeThinker-1.5B achieve competitive mathematics and coding benchmarks with significantly lower training and inference costs.
The Spectrum-to-Signal Principle (SSP) is a training paradigm for LLMs that systematically decouples the generation of diverse solution paths from the extraction of correct, high-quality reasoning chains. The SSP approach is designed to address the limitations of conventional LLM fine-tuning pipelines, which typically prioritize single-shot accuracy metrics (Pass@1) throughout all training stages. By first expanding the diversity of plausible outputs and then algorithmically extracting and amplifying the correct “signal” via uncertain case prioritization, SSP demonstrates that reasoning capacity comparable to much larger models can be elicited from small-parameter LLMs. This methodology underpins VibeThinker-1.5B, a 1.5-billion-parameter model that achieves parity with or outperforms several orders-of-magnitude larger models on key mathematics and coding benchmarks, while maintaining low total training and inference costs (Xu et al., 9 Nov 2025).
1. Motivation and Principle
The predominant paradigm for LLM training is a sequential application of supervised fine-tuning (SFT) for maximizing top-k accuracy, followed by reinforcement learning (RL) (typically PPO-based) targeting the same high-probability metric. This approach restricts the solution search space that RL can refine. The Spectrum-to-Signal Principle responds to this limitation by explicitly dividing the post-pretraining pipeline into two orthogonal phases:
Spectrum Phase: Supervised fine-tuning is conducted using objectives and domain partitioning that maximize output diversity—quantified via the Pass@K metric—to ensure the resulting policy encodes as wide a spectrum of plausible solution chains as possible.
Signal Phase: Once a rich solution spectrum exists, a maximum-entropy-guided RL algorithm amplifies the correct reasoning paths by targeting regions of greatest epistemic uncertainty. This process exploits the model’s capacity to discriminate and update on high-value (“signal”) instances, maximizing both robustness and accuracy.
This explicit spectrum-signal decoupling ensures that the RL optimization acts on a diverse, information-rich set of candidate hypotheses, elevating the ceiling of model capability compared with SFT pipelines optimized solely for direct accuracy (Xu et al., 9 Nov 2025).
Domain-Aware Diversity Probing: The problem domain is partitioned into N subdomains S1,…,SN (e.g., algebra, geometry, calculus for mathematics). During SFT, the model is periodically checkpointed and its Pass@K metric is evaluated across subdomain-specific probing sets Di. The optimal checkpoint for each subdomain maximizes the measured diversity score:
Pi(t)=Pass@K(Mt;Di)
Mi∗=argmaxtPi(t)
Expert Model Fusion: The subdomain-specialist checkpoints {Mi∗} are fused into a single SFT model via weighted parameter averaging:
MMergeSFT=i=1∑NwiMi∗,i∑wi=1
In practice, wi=1/N is used for uniform fusion. This merged SFT model encodes a maximal “spectrum” of valid problem-solving strategies while remaining amenable to subsequent training.
Throughout, the standard cross-entropy objective is used:
LCE(θ)=E(x,y)∼D[−logπθ(y∣x)]
However, the integration of diversity probing and model fusion ensures high Pass@K coverage without reducing top-1 accuracy (Xu et al., 9 Nov 2025).
3. MaxEnt-Guided Policy Optimization
Following spectrum-phase SFT, RL is applied with a customized policy optimization method: MaxEnt-Guided Policy Optimization (MGPO), a variant of Group Relative Policy Optimization (GRPO). The innovation centers on problem-level entropy estimation to guide learning:
Uncertainty Estimation: For each instance, empirical correctness probability is estimated from G rollouts under the old policy:
pc(q)=G1i=1∑G1{ri=1}
Entropy-Deviation Weighting: The deviation from maximum entropy (p0=0.5 for binary outcomes) is computed using KL divergence:
DME(pc∥p0)=pclogp0pc+(1−pc)log1−p01−pc
The problem weight is then
wME(pc)=exp(−λDME(pc∥0.5))
High weights are assigned to uncertain cases (pc≈0.5).
MGPO Surrogate Objective: The original token-level GRPO advantage Ai,t is scaled:
This promotes targeted policy updates concentrated on epistemically valuable examples (Xu et al., 9 Nov 2025).
4. Empirical Results and Comparative Performance
SSP’s effectiveness is validated on competitive mathematics and coding benchmarks. VibeThinker-1.5B, trained for $\sim\$7.8K(3,900NVIDIAH800GPUhours),matchedorsurpassedthecapabilitiesofvastlylargermodelsatafractionofthecostandcompute.</p><h3class=′paper−heading′id=′core−mathematics−results′>CoreMathematicsResults</h3><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>Model</th><th>AIME24</th><th>AIME25</th><th>HMMT25</th></tr></thead><tbody><tr><td>BaseQwen2.5−Math−1.5B</td><td>6.7<td>4.3<td>0.6</tr><tr><td>DeepSeek−R1(671B)</td><td>79.8<td>70.0<td>41.7</tr><tr><td>VibeThinker−1.5B</td><td>80.3<td>74.4<td>50.4</tr></tbody></table></div><p>VibeThinker−1.5Bdemonstratesa73.6percentagepointimprovementoverthebaseQwen2.5−Math−1.5BonAIME24,outperformingDeepSeek−R1onallthreetasks.Notably,ablationexperimentsconfirmthatremovingthediversityprobingandfusionstagedropsAIME25performancefrom74.4\%to\sim40\%,anddisablingentropy−guidedweightinginMGPOreducesRLgainsby\sim30\%.</p><h3class=′paper−heading′id=′coding−and−science−benchmarks′>CodingandScienceBenchmarks</h3><ul><li><strong>LiveCodeBenchV6</strong>:<ul><li>Base1.5B:0.0<li>Magistral−Medium(\sim$24B): 50.3%
VibeThinker-1.5B: 51.1%
GPQA-Diamond (graduate science QA):
VibeThinker-1.5B: 46.7% (base: 16.4%)
VibeThinker-1.5B matches or exceeds models 20–400x its parameter count, with similar or better empirical results (Xu et al., 9 Nov 2025).
5. Architecture, Training Setup, and Resource Cost
1,024-token positional context (extended to 16-32K in RL)
Training employs:
SFT: learning rate $\sim2 \times 10^{-5},batchsize128,upto50Ksteps</li><li>RL(MGPO):learningrate10^{-5},batchsize64rollouts×G=16,20Kpolicyupdates</li><li>Totalcompute:3,900H800hours(\sim3\times10^{20}FLOPs)</li><li>Cost:\sim\$7,800forfullpipeline</li></ul><p>Thetrainingresourceenvelopeis30\times–60\timeslowerthanstate−of−the−artlarge−modelRLpost−training(DeepSeek−R1:147,000H800hours/\$294,000;MiniMax−M1:258,000hours/\$535,000),renderingadvancedreasoningresearchfeasiblefornon−centralizedlabs(<ahref="/papers/2511.06221"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Xuetal.,9Nov2025</a>).</p><h2class=′paper−heading′id=′analysis−implications−and−limitations′>6.Analysis,Implications,andLimitations</h2><p>SSP’scoreimplicationisthattheexploration−exploitationbalance,ratherthanrawparametercount,isthecriticaldeterminantofrobustreasoning.Bymaximizingoutputdiversityviaspectrum−phaseSFT,modelsareexposedtoabroadermanifoldofproblem−solvingtrajectories,allowingentropy−guidedRLtoselectandamplifythefittestsolutions.</p><p>Majortrade−offsincludevastlyreducedinferencelatencyandcost—smallmodelsoperate20\timesfasterandat<5\%servingcostcomparedtomodels>100Bparameters.Thisenablesreal−timereasoningoncommoditydevicesandincreasestheaccessibilityofresearchexperimentation.</p><p>Notedlimitationsincludecontinuedknowledgegeneralizationgapsrelativeto200–600Bparametermodels,particularlyonscienceQA.Thebasemodel’smath−centricpretrainingconstrainscodegeneration,andthetransferabilityofSSPtomultimodalorretrieval−augmentedsettings,whileplausible,remainstobeempiricallyvalidated.Futureexperimentsareproposedforgranularityindomainpartitioninganddynamic\lambda$ schedules in MGPO (Xu et al., 9 Nov 2025).
7. Broader Impact and Future Directions
The Spectrum-to-Signal Principle demonstrates that algorithmic advances—especially explicit diversity/signal decoupling and uncertainty-guided exploration—reduce the structural advantage conferred by brute-force scaling. This opens competitive reasoning and scientific modeling to smaller research entities. A plausible implication is the democratization of sophisticated model training and the expansion of scientific AI research beyond a handful of central labs.
Potential extensions include adaptation for retrieval-augmented generation, tool-integrated agents, code-balanced pretraining, and more nuanced domain partitioning regimes. The principle’s efficacy in multimodal, interactive, and real-world environments is a prospective area of investigation, as is empirical analysis of the optimal entropy-guided RL hyperparameters for various domains (Xu et al., 9 Nov 2025).