Spectrum-to-Signal Principle (SSP)

Updated 12 November 2025

Spectrum-to-Signal Principle is a training paradigm that separates diverse output generation from the extraction of correct reasoning chains, enhancing model robustness.
It employs a two-stage process combining domain-aware diversity probing with expert model fusion and MaxEnt-guided policy optimization to maximize reasoning capability.
Empirical results show that models like VibeThinker-1.5B achieve competitive mathematics and coding benchmarks with significantly lower training and inference costs.

The Spectrum-to-Signal Principle (SSP) is a training paradigm for LLMs that systematically decouples the generation of diverse solution paths from the extraction of correct, high-quality reasoning chains. The SSP approach is designed to address the limitations of conventional LLM fine-tuning pipelines, which typically prioritize single-shot accuracy metrics (Pass@1) throughout all training stages. By first expanding the diversity of plausible outputs and then algorithmically extracting and amplifying the correct “signal” via uncertain case prioritization, SSP demonstrates that reasoning capacity comparable to much larger models can be elicited from small-parameter LLMs. This methodology underpins VibeThinker-1.5B, a 1.5-billion-parameter model that achieves parity with or outperforms several orders-of-magnitude larger models on key mathematics and coding benchmarks, while maintaining low total training and inference costs (Xu et al., 9 Nov 2025).

1. Motivation and Principle

The predominant paradigm for LLM training is a sequential application of supervised fine-tuning (SFT) for maximizing top-k accuracy, followed by reinforcement learning (RL) (typically PPO-based) targeting the same high-probability metric. This approach restricts the solution search space that RL can refine. The Spectrum-to-Signal Principle responds to this limitation by explicitly dividing the post-pretraining pipeline into two orthogonal phases:

Spectrum Phase: Supervised fine-tuning is conducted using objectives and domain partitioning that maximize output diversity—quantified via the Pass@K metric—to ensure the resulting policy encodes as wide a spectrum of plausible solution chains as possible.
Signal Phase: Once a rich solution spectrum exists, a maximum-entropy-guided RL algorithm amplifies the correct reasoning paths by targeting regions of greatest epistemic uncertainty. This process exploits the model’s capacity to discriminate and update on high-value (“signal”) instances, maximizing both robustness and accuracy.

This explicit spectrum-signal decoupling ensures that the RL optimization acts on a diverse, information-rich set of candidate hypotheses, elevating the ceiling of model capability compared with SFT pipelines optimized solely for direct accuracy (Xu et al., 9 Nov 2025).

2. Methodology: Two-Stage Diversity-Exploring Distillation

SSP’s spectrum phase employs a two-stage process:

Domain-Aware Diversity Probing: The problem domain is partitioned into $N$ subdomains $S_1,\dots,S_N$ (e.g., algebra, geometry, calculus for mathematics). During SFT, the model is periodically checkpointed and its Pass@K metric is evaluated across subdomain-specific probing sets $D_i$ . The optimal checkpoint for each subdomain maximizes the measured diversity score:

$P_i(t) = \text{Pass@}K(M_t; D_i)$

$M_i^* = \arg\max_t P_i(t)$

Expert Model Fusion: The subdomain-specialist checkpoints $\{M_i^*\}$ are fused into a single SFT model via weighted parameter averaging:

$M_{\rm Merge}^{\rm SFT} = \sum_{i=1}^N w_i\,M_i^*,\quad \sum_i w_i = 1$

In practice, $w_i=1/N$ is used for uniform fusion. This merged SFT model encodes a maximal “spectrum” of valid problem-solving strategies while remaining amenable to subsequent training.

Throughout, the standard cross-entropy objective is used:

$\mathcal{L}_{\rm CE}(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\bigl[-\log \pi_\theta(y|x)\bigr]$

However, the integration of diversity probing and model fusion ensures high Pass@K coverage without reducing top-1 accuracy (Xu et al., 9 Nov 2025).

3. MaxEnt-Guided Policy Optimization

Following spectrum-phase SFT, RL is applied with a customized policy optimization method: MaxEnt-Guided Policy Optimization (MGPO), a variant of Group Relative Policy Optimization (GRPO). The innovation centers on problem-level entropy estimation to guide learning:

Uncertainty Estimation: For each instance, empirical correctness probability is estimated from $G$ rollouts under the old policy:

$p_c(q) = \frac{1}{G} \sum_{i=1}^G \mathbf{1}\{r_i = 1\}$

Entropy-Deviation Weighting: The deviation from maximum entropy ( $p_0=0.5$ for binary outcomes) is computed using KL divergence:

$D_{\rm ME}(p_c \Vert p_0) = p_c\log\frac{p_c}{p_0} + (1-p_c)\log\frac{1-p_c}{1-p_0}$

The problem weight is then

$w_{\rm ME}(p_c) = \exp\bigl(-\lambda D_{\rm ME}(p_c \Vert 0.5)\bigr)$

High weights are assigned to uncertain cases ( $p_c \approx 0.5$ ).

MGPO Surrogate Objective: The original token-level GRPO advantage $\mathcal{A}_{i,t}$ is scaled:

$\mathcal{A}'_{i,t} = w_{\rm ME}(p_c) \mathcal{A}_{i,t}$

The MGPO objective per rollout token is:

$\mathcal{J}_{\rm MGPO}(\theta) = \mathbb{E}_{q}\Biggl[\frac1{G} \sum_{i=1}^G\frac1{|y_i|} \sum_{t=1}^{|y_i|}\min\Bigl( r_{i,t}(\theta) \mathcal{A}'_{i,t}, \mathrm{clip}(r_{i,t}(\theta), 1-\varepsilon, 1+\varepsilon)\mathcal{A}'_{i,t} \Bigr)\Biggr]$

This promotes targeted policy updates concentrated on epistemically valuable examples (Xu et al., 9 Nov 2025).

4. Empirical Results and Comparative Performance

SSP’s effectiveness is validated on competitive mathematics and coding benchmarks. VibeThinker-1.5B, trained for $\sim\$7.8 $K ($ 3,900 $NVIDIA H800 GPU hours), matched or surpassed the capabilities of vastly larger models at a fraction of the cost and compute.</p> <h3 class='paper-heading' id='core-mathematics-results'>Core Mathematics Results</h3><div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Model</th> <th>AIME24</th> <th>AIME25</th> <th>HMMT25</th> </tr> </thead><tbody><tr> <td>Base Qwen2.5-Math-1.5B</td> <td>6.7%</td> <td>4.3%</td> <td>0.6%</td> </tr> <tr> <td>DeepSeek-R1 (671B)</td> <td>79.8%</td> <td>70.0%</td> <td>41.7%</td> </tr> <tr> <td>VibeThinker-1.5B</td> <td>80.3%</td> <td>74.4%</td> <td>50.4%</td> </tr> </tbody></table></div> <p>VibeThinker-1.5B demonstrates a 73.6 percentage point improvement over the base Qwen2.5-Math-1.5B on AIME24, outperforming DeepSeek-R1 on all three tasks. Notably, ablation experiments confirm that removing the diversity probing and fusion stage drops AIME25 performance from$ 74.4\% $to$ \sim40\% $, and disabling entropy-guided weighting in MGPO reduces RL gains by$ \sim30\% $.</p> <h3 class='paper-heading' id='coding-and-science-benchmarks'>Coding and Science Benchmarks</h3> <ul> <li><strong>LiveCodeBench V6</strong>: <ul> <li>Base 1.5B: 0.0%</li> <li>Magistral-Medium ($ \sim$24B): 50.3%

VibeThinker-1.5B: 51.1%

GPQA-Diamond (graduate science QA):

VibeThinker-1.5B: 46.7% (base: 16.4%)

VibeThinker-1.5B matches or exceeds models 20–400x its parameter count, with similar or better empirical results (Xu et al., 9 Nov 2025).

5. Architecture, Training Setup, and Resource Cost

VibeThinker-1.5B leverages Qwen2.5-Math’s 1.5B-parameter transformer backbone:

24 layers
Hidden dimension 2,048
16 attention heads per layer
1,024-token positional context (extended to 16-32K in RL)

Training employs:

SFT: learning rate $\sim2 \times 10^{-5} $, batch size 128, up to 50K steps</li> <li>RL (MGPO): learning rate$ 10^{-5} $, batch size 64 rollouts ×$ G=16 $, ~20K policy updates</li> <li>Total compute: 3,900 H800 hours ($ \sim3\times10^{20} $FLOPs)</li> <li>Cost:$ \sim\$7,800 $for full pipeline</li> </ul> <p>The training resource envelope is$ 30\times $–$ 60\times $lower than state-of-the-art large-model RL post-training (DeepSeek-R1:$ 147,000 $H800 hours/$ \$294,000 $; MiniMax-M1:$ 258,000 $hours/$ \$535,000 $), rendering advanced reasoning research feasible for non-centralized labs (<a href="/papers/2511.06221" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xu et al., 9 Nov 2025</a>).</p> <h2 class='paper-heading' id='analysis-implications-and-limitations'>6. Analysis, Implications, and Limitations</h2> <p>SSP’s core implication is that the exploration-exploitation balance, rather than raw parameter count, is the critical determinant of robust reasoning. By maximizing output diversity via spectrum-phase SFT, models are exposed to a broader manifold of problem-solving trajectories, allowing entropy-guided RL to select and amplify the fittest solutions.</p> <p>Major trade-offs include vastly reduced inference latency and cost—small models operate$ 20\times $faster and at$ <5\% $serving cost compared to models$ >100 $B parameters. This enables real-time reasoning on commodity devices and increases the accessibility of research experimentation.</p> <p>Noted limitations include continued knowledge generalization gaps relative to 200–600B parameter models, particularly on science QA. The base model’s math-centric pretraining constrains code generation, and the transferability of SSP to multimodal or retrieval-augmented settings, while plausible, remains to be empirically validated. Future experiments are proposed for granularity in domain partitioning and dynamic$ \lambda$ schedules in MGPO (Xu et al., 9 Nov 2025).

7. Broader Impact and Future Directions

The Spectrum-to-Signal Principle demonstrates that algorithmic advances—especially explicit diversity/signal decoupling and uncertainty-guided exploration—reduce the structural advantage conferred by brute-force scaling. This opens competitive reasoning and scientific modeling to smaller research entities. A plausible implication is the democratization of sophisticated model training and the expansion of scientific AI research beyond a handful of central labs.

Potential extensions include adaptation for retrieval-augmented generation, tool-integrated agents, code-balanced pretraining, and more nuanced domain partitioning regimes. The principle’s efficacy in multimodal, interactive, and real-world environments is a prospective area of investigation, as is empirical analysis of the optimal entropy-guided RL hyperparameters for various domains (Xu et al., 9 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

1.

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B (2025)

Follow Topic

Get notified by email when new papers are published related to Spectrum-to-Signal Principle (SSP).