SCOPE: Confidence-Weighted Pseudo-Labeling
- The paper introduces a framework that uses step-wise confidence and dynamic subgrouping to replace majority voting, yielding up to a 13.1% boost on benchmarks.
- It employs multi-granularity confidence computation and bootstrap-based local consensus to deduce more reliable pseudo-labels, mitigating confirmation bias.
- Empirical results on math reasoning benchmarks (AIME, AMC, MATH-500) show that SCOPE significantly improves reward density and exploration in test-time reinforcement learning.
Subgroup-Specific Step-Wise Confidence-Weighted Pseudo-Label Estimation (SCOPE) is a reward inference framework for test-time reinforcement learning (TTRL) that leverages model confidence across reasoning steps and dynamic subgrouping to generate more reliable and informative pseudo-labels. SCOPE addresses critical issues of majority voting strategies in TTRL—specifically confirmation bias and sparse reward signals—by combining step-wise confidence assessment, subgroup-specific pseudo-labeling, and bootstrap-based local consensus, leading to enhanced reward density and improved exploration in LLM reasoning tasks (Wang et al., 17 Dec 2025).
1. Step-Wise Confidence Formalism
SCOPE introduces a multi-granularity confidence computation to quantify model uncertainty at token, reasoning step, and response levels. For a predicted token at decoding step , token-level confidence is defined as the negative average log-probability of the top- decoded token probabilities: where is the model probability for the \textsuperscript{th} highest-probability token. Higher indicates greater certainty. For each reasoning step (comprising tokens) in answer , the step confidence is: Averaging over all steps within yields the average-step (response) confidence: This layered approach enables detailed assessment of the model’s confidence trajectory during multi-step reasoning processes.
2. Confidence-Weighted Pseudo-Label Deduction
Utilizing the computed confidence scores, SCOPE replaces the frequency-based majority voting paradigm common in unsupervised RL reward inference. Instead, the pseudo-label for a group of outputs is determined by a weighted aggregation: where denotes the extracted final answer. This procedure updates pseudo-labels based on both model consensus and the inherent confidence of individual reasoning traces, thus giving higher supervisory weight to higher-certainty rationales. The approach mitigates traditional confirmation bias and facilitates the inclusion of minority-but-confident correct answers (Wang et al., 17 Dec 2025).
3. Dynamic Subgroup Partitioning and Quality-Exploration Tradeoff
A central innovation in SCOPE is the dynamic formation of disjoint subgroups (each of size ) from , rather than sharing a single pseudo-label across all outputs. For each subgroup , a local consensus label is derived. The optimal subgroup size is selected by evaluating candidate based on:
- Quality rate : The proportion of outputs in all subgroups matching their subgroup consensus.
- Exploration rate : The fraction of distinct consensus labels across subgroups.
After normalization, the selection objective for $m^*$ minimizes
with in practice, mediating between high local agreement and label diversity. This Pareto front tradeoff ensures a balance between reward density and exploratory supervision, which addresses the challenge of reward sparsity intrinsic to global consensus (Wang et al., 17 Dec 2025).
4. Bootstrap-Based Local Consensus
Within each selected subgroup , SCOPE implements a bootstrap-resampling procedure to reliably aggregate subgroup-specific pseudo-labels:
- Resampling: bootstrap samples of size are drawn (with replacement) from the global pool .
- Vote: The step-wise confidence-weighted voting mechanism is applied to each bootstrap sample to infer a candidate label.
- Aggregation: The candidates are aggregated via another confidence-weighted vote to produce the final subgroup pseudo-label .
- Reward Assignment: Each is assigned a binary reward
This yields distinct supervisory targets per update, achieving denser supervision. The accompanying pseudocode formalizes the iterative policy update procedure incorporating these elements.
5. Comparative Empirical Performance
SCOPE demonstrates robust empirical gains across multiple math reasoning benchmarks and LLM scales. On challenging datasets such as AIME 2025, AIME 2024, AMC, and MATH-500, SCOPE universally outperforms the baseline TTRL (majority-voting) approach. The strongest results are reported with Qwen3-8B:
| Model | AIME 2024 | AIME 2025 | AMC | MATH-500 | Avg |
|---|---|---|---|---|---|
| TTRL (baseline) | 47.13 | 27.40 | 68.55 | 89.74 | 58.21 |
| w/ SCOPE | 52.70 | 31.00 | 74.09 | 91.01 | 62.20 |
| Δ (relative) | +11.8% | +13.1% | +8.1% | +1.4% | +6.9% |
Notable improvements include a 13.1% increase on AIME 2025 and 8.1% on AMC. Smaller-scale models such as Qwen2.5-Math-1.5B observe similar double-digit average improvements (+11.9%). These results substantiate the utility of step-wise confidence and dynamic subgroup supervision in recovering correct minority outputs and delivering more reliable reward signals (Wang et al., 17 Dec 2025).
6. Context and Implications in Test-Time Reinforcement Learning
SCOPE advances TTRL by addressing the two key deficiencies of traditional majority-voting pseudo-labeling: confirmation bias and reward sparsity. By leveraging step-wise model confidence, SCOPE promotes high-quality, minority-consistent reasoning paths that are typically underrepresented in pure frequency-based voting. The dynamic subgroup partitioning and local consensus approach increases reward density and drives exploratory diversity. A plausible implication is that SCOPE's methodological innovations—especially its confidence-weighted aggregation and Pareto-guided subgrouping—could generalize to other domains where unsupervised RL reward modeling must operate without reliable, verifiable reward signals (Wang et al., 17 Dec 2025).