MetaStone-S1: Reflective Generative Model

Updated 4 July 2025

MetaStone-S1 is a reflective generative model architecture that unifies next-token prediction and process evaluation using a shared transformer backbone.
It employs a self-supervised process reward loss that optimizes reasoning steps based on final answer accuracy, reducing PRM parameters by over 99%.
Its test-time scaling method samples multiple candidate trajectories to select the best reasoning path, ensuring efficient performance on benchmark tasks.

MetaStone-S1 is a reflective generative LLM architecture designed to integrate process-level self-evaluation with scalable reasoning capabilities, enabling efficient test-time scaling (TTS) and achieving benchmark performance comparable to OpenAI o3-mini models with a substantially reduced parameter footprint (Wang et al., 2 Jul 2025). MetaStone-S1 is built around a self-supervised process reward mechanism and provides a unified, open-source interface for both policy generation and reward evaluation.

1. Reflective Generative Model Architecture

MetaStone-S1 introduces the Self-supervised Process Reward Model (SPRM), which unifies the roles of policy generation (next-token prediction) and process evaluation (reward modeling) within a shared transformer backbone. This design integrates two task-specific heads atop the backbone:

Policy Head: Responsible for the standard next-token prediction, typical of LLMs.
SPRM Head: A lightweight classifier comprising two linear layers with a dropout layer, instantiating the process reward model (PRM). This head extracts the feature representation from the penultimate transformer layer at each “step-token” (defined by \n\n), providing a scalar score per reasoning step.

The architectural efficiency allows MetaStone-S1 to reduce reward model parameters by over 99% compared to independently trained, full-sized PRMs. In previous implementations, PRMs commonly duplicated the LLM architecture with parameter sizes in the range of 7B–72B, resulting in significant computational redundancy.

The final process reward for a generated trajectory is computed as the geometric mean of all per-step scores: $S_{\text{final}} = \left( \prod_{i=1}^n \text{Score}_i \right)^{1/n}$ where $n$ is the number of reasoning steps and $\text{Score}_i$ is the output of the SPRM head at the $i$ -th step.

2. Self-supervised Process Reward Learning

MetaStone-S1 employs a self-supervised paradigm for reward model optimization, termed Self-supervised Process Reward Loss (SPR Loss). Unlike conventional PRMs that require expensive human annotations at the process (step) level, SPRM is trained end-to-end using supervision available only at the level of the final answer.

The loss function is: $\mathcal{L}_{\text{SPR}} = \frac{1}{N}\sum_{i=1}^{N} w_i \cdot \mathrm{BCELoss}(\text{Score}_{i}, y_i)$ with $w_i$ set to $1$ if the local step-score prediction aligns with the correctness of the final outcome ( $y_i$ ), and $0$ otherwise. This approach ensures that only steps contributing toward a correct final answer are incentivized, while erroneous steps are excluded from gradient updates. This strategy eliminates the need for process-level annotation, making step-level reward learning feasible at scale.

3. Test-Time Scaling (TTS) Methodology

MetaStone-S1 natively supports Test-Time Scaling (TTS), enabling the system to trade additional computational effort at inference for improved performance. The policy model samples $k$ distinct candidate reasoning trajectories for each input prompt. Each trajectory is evaluated by the SPRM head, and the highest-scoring trajectory determines the final output:

$\text{Answer} = \text{LLM}(think_{i^*}), \quad i^* = \arg\max(S_1, S_2, \dots, S_k)$

Users may control the "reasoning effort" by configuring $k$ , with pre-defined modes:

Low effort: $k = 2$ (MetaStone-S1-low)
Medium effort: $k = 8$ (MetaStone-S1-medium)
High effort: $k = 32$ (MetaStone-S1-high)

This scalable approach enables flexible deployment, allowing a balance between computational cost, wall-clock latency, and task performance.

4. Empirical Scaling Law and Performance

MetaStone-S1 empirically establishes a scaling law that links total "reasoning compute" to final performance. The measure of reasoning compute is $C = \text{Param}_{policy} \times \text{Token}_{infer}$ , encapsulating both model size and trajectory generation length. Results demonstrate that task performance increases logarithmically with this compute budget, implying that gains may be achieved either by increasing model scale or by sampling more candidate trajectories at test time.

Performance evaluations on established benchmarks reflect this relationship:

Model	AIME24	AIME25	LiveCodeBench	C-EVAL
OpenAI o3-mini (medium)	79.6	74.8	67.4	75.9
MetaStone-S1-32B-high	85.2	73.6	64.2	89.7

MetaStone-S1-32B-high, with 32B parameters and maximum TTS, matches or outperforms OpenAI o3-mini on AIME24 and C-EVAL, and approaches parity on other tasks.

5. Parameter Efficiency and Practical Implications

By consolidating the policy and process reward functions into a single shared model, MetaStone-S1 achieves over 99% PRM parameter reduction relative to traditional separate PRMs. For example, the additional SPRM head for a 7B backbone model comprises only about 26M parameters. This efficient model composition permits high candidate sampling (large $k$ ) without incurring prohibitive computational or memory overhead.

This efficiency is especially advantageous in large-scale deployment settings, making reasoning-intensive inference affordable, and enabling real-world applications previously constrained by compute requirements.

6. Open Source Release and Community Impact

MetaStone-S1 is openly available under a permissive license, with code and model checkpoints at https://github.com/MetaStone-AI/MetaStone-S1. The open-source release facilitates community-driven research on test-time scaling, reflective generative modeling, and process reward approaches. The unified, modular interface supports reproducibility and adaptation to diverse application domains and TTS configurations.

7. Significance and Prospects

MetaStone-S1 represents a practically oriented advance in reflective generative modeling by combining end-to-end reward evaluation, parameter sharing, and scalable inference. The methodology circumvents the annotation bottleneck of prior PRMs and empirically supports the principle that increased test-time reasoning can systematically yield higher task accuracy. Open science practices surrounding MetaStone-S1 are expected to foster further research into efficient self-supervised evaluation and scalable reasoning architectures.

PDF Markdown Chat (Pro)

References (1)

Test-Time Scaling with Reflective Generative Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MetaStone-S1.