Papers
Topics
Authors
Recent
2000 character limit reached

SCOPE: Confidence-Weighted Pseudo-Labeling

Updated 24 December 2025
  • The paper introduces a framework that uses step-wise confidence and dynamic subgrouping to replace majority voting, yielding up to a 13.1% boost on benchmarks.
  • It employs multi-granularity confidence computation and bootstrap-based local consensus to deduce more reliable pseudo-labels, mitigating confirmation bias.
  • Empirical results on math reasoning benchmarks (AIME, AMC, MATH-500) show that SCOPE significantly improves reward density and exploration in test-time reinforcement learning.

Subgroup-Specific Step-Wise Confidence-Weighted Pseudo-Label Estimation (SCOPE) is a reward inference framework for test-time reinforcement learning (TTRL) that leverages model confidence across reasoning steps and dynamic subgrouping to generate more reliable and informative pseudo-labels. SCOPE addresses critical issues of majority voting strategies in TTRL—specifically confirmation bias and sparse reward signals—by combining step-wise confidence assessment, subgroup-specific pseudo-labeling, and bootstrap-based local consensus, leading to enhanced reward density and improved exploration in LLM reasoning tasks (Wang et al., 17 Dec 2025).

1. Step-Wise Confidence Formalism

SCOPE introduces a multi-granularity confidence computation to quantify model uncertainty at token, reasoning step, and response levels. For a predicted token at decoding step tt, token-level confidence Ct\mathcal{C}_t is defined as the negative average log-probability of the top-kk decoded token probabilities: Ct=1kj=1klogPt(j)\mathcal{C}_t = -\frac{1}{k}\sum_{j=1}^k \log P_t(j) where Pt(j)P_t(j) is the model probability for the jj\textsuperscript{th} highest-probability token. Higher Ct\mathcal{C}_t indicates greater certainty. For each reasoning step sks_k (comprising NkN_k tokens) in answer oio_i, the step confidence is: Csk(i)=1NktskCt\mathcal{C}_{s_k}^{(i)} = \frac{1}{N_k}\sum_{t\in s_k} \mathcal{C}_t Averaging over all L|\mathcal{L}| steps within oio_i yields the average-step (response) confidence: CAvgStep(i)=1Lk=1LCsk(i)\mathcal{C}_{\mathrm{AvgStep}}^{(i)} = \frac{1}{|\mathcal{L}|}\sum_{k=1}^{|\mathcal{L}|} \mathcal{C}_{s_k}^{(i)} This layered approach enables detailed assessment of the model’s confidence trajectory during multi-step reasoning processes.

2. Confidence-Weighted Pseudo-Label Deduction

Utilizing the computed confidence scores, SCOPE replaces the frequency-based majority voting paradigm common in unsupervised RL reward inference. Instead, the pseudo-label oo^* for a group of outputs G\mathcal{G} is determined by a weighted aggregation: o=argmaxy  i=1GCAvgStep(i)  1[Ans(oi)=y]o^* = \arg\max_{y}\; \sum_{i=1}^{|\mathcal{G}|} \mathcal{C}_{\mathrm{AvgStep}}^{(i)}\;\mathbf{1}\bigl[\mathrm{Ans}(o_i)=y\bigr] where Ans(oi)\mathrm{Ans}(o_i) denotes the extracted final answer. This procedure updates pseudo-labels based on both model consensus and the inherent confidence of individual reasoning traces, thus giving higher supervisory weight to higher-certainty rationales. The approach mitigates traditional confirmation bias and facilitates the inclusion of minority-but-confident correct answers (Wang et al., 17 Dec 2025).

3. Dynamic Subgroup Partitioning and Quality-Exploration Tradeoff

A central innovation in SCOPE is the dynamic formation of nn disjoint subgroups SjS_j (each of size m=G/nm = |\mathcal{G}|/n) from G\mathcal{G}, rather than sharing a single pseudo-label across all outputs. For each subgroup SjS_j, a local consensus label ojo_j^* is derived. The optimal subgroup size mm^* is selected by evaluating candidate mkm_k based on:

  • Quality rate qkq_k: The proportion of outputs in all subgroups matching their subgroup consensus.
  • Exploration rate eke_k: The fraction of distinct consensus labels across subgroups.

After 2\ell_2 normalization, the selection objective for $m^*$ minimizes

dk=λ(1q^k)2+(1λ)(1e^k)2d_k = \sqrt{\lambda\,(1-\hat q_k)^2+(1-\lambda)\,(1-\hat e_k)^2}

with λ=0.7\lambda=0.7 in practice, mediating between high local agreement and label diversity. This Pareto front tradeoff ensures a balance between reward density and exploratory supervision, which addresses the challenge of reward sparsity intrinsic to global consensus (Wang et al., 17 Dec 2025).

4. Bootstrap-Based Local Consensus

Within each selected subgroup SjS_j, SCOPE implements a bootstrap-resampling procedure to reliably aggregate subgroup-specific pseudo-labels:

  1. Resampling: BB bootstrap samples of size Sj|S_j| are drawn (with replacement) from the global pool G\mathcal{G}.
  2. Vote: The step-wise confidence-weighted voting mechanism is applied to each bootstrap sample to infer a candidate label.
  3. Aggregation: The BB candidates are aggregated via another confidence-weighted vote to produce the final subgroup pseudo-label ojo_j^*.
  4. Reward Assignment: Each oSjo\in S_j is assigned a binary reward

r(o)=1[Ans(o)=Ans(oj)]r(o) = \mathbf{1}\bigl[\mathrm{Ans}(o)=\mathrm{Ans}(o_j^*)\bigr]

This yields nn distinct supervisory targets per update, achieving denser supervision. The accompanying pseudocode formalizes the iterative policy update procedure incorporating these elements.

5. Comparative Empirical Performance

SCOPE demonstrates robust empirical gains across multiple math reasoning benchmarks and LLM scales. On challenging datasets such as AIME 2025, AIME 2024, AMC, and MATH-500, SCOPE universally outperforms the baseline TTRL (majority-voting) approach. The strongest results are reported with Qwen3-8B:

Model AIME 2024 AIME 2025 AMC MATH-500 Avg
TTRL (baseline) 47.13 27.40 68.55 89.74 58.21
w/ SCOPE 52.70 31.00 74.09 91.01 62.20
Δ (relative) +11.8% +13.1% +8.1% +1.4% +6.9%

Notable improvements include a 13.1% increase on AIME 2025 and 8.1% on AMC. Smaller-scale models such as Qwen2.5-Math-1.5B observe similar double-digit average improvements (\approx+11.9%). These results substantiate the utility of step-wise confidence and dynamic subgroup supervision in recovering correct minority outputs and delivering more reliable reward signals (Wang et al., 17 Dec 2025).

6. Context and Implications in Test-Time Reinforcement Learning

SCOPE advances TTRL by addressing the two key deficiencies of traditional majority-voting pseudo-labeling: confirmation bias and reward sparsity. By leveraging step-wise model confidence, SCOPE promotes high-quality, minority-consistent reasoning paths that are typically underrepresented in pure frequency-based voting. The dynamic subgroup partitioning and local consensus approach increases reward density and drives exploratory diversity. A plausible implication is that SCOPE's methodological innovations—especially its confidence-weighted aggregation and Pareto-guided subgrouping—could generalize to other domains where unsupervised RL reward modeling must operate without reliable, verifiable reward signals (Wang et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Subgroup-Specific Step-Wise Confidence-Weighted Pseudo-Label Estimation (SCOPE).