Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-Time Scaling with Reflective Generative Model (2507.01951v1)

Published 2 Jul 2025 in cs.LG and cs.CL

Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

Summary

  • The paper proposes a unified large language model architecture that integrates policy and SPRM heads to optimize test-time scaling.
  • It introduces a self-supervised process reward loss that dynamically pseudo-labels reasoning steps, reducing extra compute and alignment overhead.
  • Empirical results demonstrate improved benchmark performance and significant parameter reductions, supporting scalable and efficient inference.

Test-Time Scaling with Reflective Generative Models: A Technical Assessment

The paper “Test-Time Scaling with Reflective Generative Model” (MetaStone-S1) (2507.01951) introduces a unified architecture and training paradigm for LLMs, targeting efficient and effective test-time scaling (TTS) through the integration of a policy model and a step-level process reward model (PRM) into a single network, namely the Self-supervised Process Reward Model (SPRM). The work fundamentally addresses the computational inefficiencies and alignment issues prevalent in prior externally guided TTS approaches, while demonstrating new scaling behavior and robust empirical performance.

Motivation and Background

Recent progress in TTS has fueled advances in reasoning and coding capabilities of LLMs, with OpenAI’s o3 model serving as a prominent instance. TTS approaches are usually classified into internal (e.g., long chain-of-thought) and external (e.g., candidate re-ranking or step-wise search via PRMs) methods. External TTS relying on separate, large PRMs introduces substantial parameter and compute overhead, complexity in the pipeline, and often off-policy misalignments due to data and generation distribution divergence.

The key premise of this work is that a shared-parameter architecture—where the same network underpins both the generative (policy) and evaluative (stepwise process reward) heads—can enable efficient, unified, and on-policy optimization. This, in turn, enables end-to-end fine-tuning without expensive step-level annotation or separate PRM pre-training.

Model Architecture and Training Methodology

MetaStone-S1 leverages dual task-specific heads atop a shared transformer backbone:

  • Next-Token Prediction Head: Standard autoregressive LM head for generation.
  • SPRM Head: A lightweight binary classifier (two linear layers with dropout) attached to the hidden states at each reasoning step (defined by occurrence of ".\n\n" tokens), predicting the correctness of each intermediate reasoning step.

Self-Supervised Process Reward Loss

A central contribution is the self-supervised optimization scheme for the SPRM head (SPR Loss). Rather than requiring process-level human labels or auxiliary LLM judges, the model employs dynamic pseudo-labeling. At each step, if the SPRM prediction for a token aligns with the final answer correctness (i.e., step score >0.5 iff the answer is correct), it contributes to the loss; otherwise, it is masked out. This design filters noisy supervision inherent in using final answer correctness as a proxy for step correctness and enables the model to focus on steps most representative of valid or invalid reasoning.

Unified Inference Procedure

During inference, the process proceeds as follows:

  1. For each question, sample kk candidate reasoning trajectories using the policy model.
  2. For each, score the stepwise process via the SPRM head; aggregate step scores (geometric mean).
  3. Select and output the trajectory with the highest aggregate score.

By varying kk (e.g., low=2, medium=8, high=32), MetaStone-S1 provides tunable reasoning effort, aligning inference cost and performance to task demands.

Empirical Analysis

The paper offers a comprehensive empirical evaluation across mathematical (AIME24/25), programming (LiveCodeBench), and general (C-EVAL) benchmarks. Key results include:

  • Consistent Outperformance Over Baselines and External PRMs:
    • On AIME24, MetaStone-S1-1.5B-high achieves 57.9% Pass@1 versus 55.5% for R1-Distill-Qwen-7B and 50.4% for R1-Distill-Llama-8B, with only 1.5B parameters.
    • The 7B and 32B variants also noticeably outperform similarly sized or larger open-source models and are competitive with OpenAI-o3-mini in several tasks.
  • Substantial Reduction in PRM Parameter and Compute Overhead:
    • The unified SPRM head introduces only 5M/26M extra parameters for 1.5B/7B models—a reduction exceeding 99% relative to standard 72B PRMs, while delivering superior trajectory discrimination.
  • Scaling Law Discovery:
    • The empirical relationship between total test-time compute (params × inference tokens) and accuracy is near-logarithmic, with diminishing returns beyond a certain compute threshold (e.g., Best-of-64 sampling).
  • Generalization and Robustness:
    • SPRM trained on math data generalizes well without in-domain tuning to programming (LiveCodeBench) tasks, indicating capture of domain-agnostic reasoning evaluation patterns.

Ablation and Analysis

Ablation studies confirm:

  • Efficacy of SPR Loss over Standard BCE Loss:
    • Larger score gap between correct and incorrect trajectories, reduced noise from proxy step labels, and elevated accuracy.
  • Qualitative "Aha Moments":
    • Training curves show abrupt phase transitions in the ability to discriminate correct/incorrect reasoning after certain data exposure thresholds, a phenomenon highlighted in example visualizations.
  • Applicability to Step-Level Search:
    • When integrated into MCTS for exploration, the SPRM provides efficient value estimation at nodes, raising accuracy from 39.3% to 52.8% on AIME24, though with acknowledged compute/trade-offs compared to Best-of-N.

Implications and Outlook

This work demonstrates that reward-guided test-time scaling, when tightly architecturally and algorithmically integrated with the policy model, obviates most of the computational bloat and misalignment risks previously associated with powerful external PRMs. The simplicity of step segmentation (based purely on ".\n\n" tokens), shared backbone, and self-supervised loss contribute to an efficient, practical TTS pipeline suitable for both research and production deployment.

From a theoretical perspective, the empirical scaling laws reinforce the importance of judicious compute allocation between model size and sampling budget, with the possibility that careful TTS (Best-of-N or search) may be more compute-optimal than model scaling alone in some regimes. The observed "aha moments" during SPRM training suggest fertile ground for further paper in representation alignment and curriculum learning for RLHF-like setups.

Practically, this approach invites future work in:

  • Extending reflective models to broader domains, including vision and multi-modal reasoning,
  • Investigating dynamic adjustment of TTS parameters at deployment time,
  • Developing more nuanced reward heads for tasks requiring graded or multi-label evaluation,
  • Integrating step-level search policies with adaptive compute allocation, balancing latency and accuracy.

MetaStone-S1 and its open-source release are poised to facilitate research on more efficient and interpretable scaling strategies, both for academic exploration of reasoning and for real-world AI deployments requiring controllable reasoning quality and cost.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com