SCOPE: Confidence-Weighted Pseudo-Labeling

Updated 24 December 2025

The paper introduces a framework that uses step-wise confidence and dynamic subgrouping to replace majority voting, yielding up to a 13.1% boost on benchmarks.
It employs multi-granularity confidence computation and bootstrap-based local consensus to deduce more reliable pseudo-labels, mitigating confirmation bias.
Empirical results on math reasoning benchmarks (AIME, AMC, MATH-500) show that SCOPE significantly improves reward density and exploration in test-time reinforcement learning.

Subgroup-Specific Step-Wise Confidence-Weighted Pseudo-Label Estimation (SCOPE) is a reward inference framework for test-time reinforcement learning (TTRL) that leverages model confidence across reasoning steps and dynamic subgrouping to generate more reliable and informative pseudo-labels. SCOPE addresses critical issues of majority voting strategies in TTRL—specifically confirmation bias and sparse reward signals—by combining step-wise confidence assessment, subgroup-specific pseudo-labeling, and bootstrap-based local consensus, leading to enhanced reward density and improved exploration in LLM reasoning tasks (Wang et al., 17 Dec 2025).

1. Step-Wise Confidence Formalism

SCOPE introduces a multi-granularity confidence computation to quantify model uncertainty at token, reasoning step, and response levels. For a predicted token at decoding step $t$ , token-level confidence $\mathcal{C}_t$ is defined as the negative average log-probability of the top- $k$ decoded token probabilities: $\mathcal{C}_t = -\frac{1}{k}\sum_{j=1}^k \log P_t(j)$ where $P_t(j)$ is the model probability for the $j$ \textsuperscript{th} highest-probability token. Higher $\mathcal{C}_t$ indicates greater certainty. For each reasoning step $s_k$ (comprising $N_k$ tokens) in answer $o_i$ , the step confidence is: $\mathcal{C}_{s_k}^{(i)} = \frac{1}{N_k}\sum_{t\in s_k} \mathcal{C}_t$ Averaging over all $|\mathcal{L}|$ steps within $o_i$ yields the average-step (response) confidence: $\mathcal{C}_{\mathrm{AvgStep}}^{(i)} = \frac{1}{|\mathcal{L}|}\sum_{k=1}^{|\mathcal{L}|} \mathcal{C}_{s_k}^{(i)}$ This layered approach enables detailed assessment of the model’s confidence trajectory during multi-step reasoning processes.

2. Confidence-Weighted Pseudo-Label Deduction

Utilizing the computed confidence scores, SCOPE replaces the frequency-based majority voting paradigm common in unsupervised RL reward inference. Instead, the pseudo-label $o^*$ for a group of outputs $\mathcal{G}$ is determined by a weighted aggregation: $o^* = \arg\max_{y}\; \sum_{i=1}^{|\mathcal{G}|} \mathcal{C}_{\mathrm{AvgStep}}^{(i)}\;\mathbf{1}\bigl[\mathrm{Ans}(o_i)=y\bigr]$ where $\mathrm{Ans}(o_i)$ denotes the extracted final answer. This procedure updates pseudo-labels based on both model consensus and the inherent confidence of individual reasoning traces, thus giving higher supervisory weight to higher-certainty rationales. The approach mitigates traditional confirmation bias and facilitates the inclusion of minority-but-confident correct answers (Wang et al., 17 Dec 2025).

3. Dynamic Subgroup Partitioning and Quality-Exploration Tradeoff

A central innovation in SCOPE is the dynamic formation of $n$ disjoint subgroups $S_j$ (each of size $m = |\mathcal{G}|/n$ ) from $\mathcal{G}$ , rather than sharing a single pseudo-label across all outputs. For each subgroup $S_j$ , a local consensus label $o_j^*$ is derived. The optimal subgroup size $m^*$ is selected by evaluating candidate $m_k$ based on:

Quality rate $q_k$ : The proportion of outputs in all subgroups matching their subgroup consensus.
Exploration rate $e_k$ : The fraction of distinct consensus labels across subgroups.

After $\ell_2$ normalization, the selection objective for $m^*$ minimizes

$d_k = \sqrt{\lambda\,(1-\hat q_k)^2+(1-\lambda)\,(1-\hat e_k)^2}$

with $\lambda=0.7$ in practice, mediating between high local agreement and label diversity. This Pareto front tradeoff ensures a balance between reward density and exploratory supervision, which addresses the challenge of reward sparsity intrinsic to global consensus (Wang et al., 17 Dec 2025).

4. Bootstrap-Based Local Consensus

Within each selected subgroup $S_j$ , SCOPE implements a bootstrap-resampling procedure to reliably aggregate subgroup-specific pseudo-labels:

Resampling: $B$ bootstrap samples of size $|S_j|$ are drawn (with replacement) from the global pool $\mathcal{G}$ .
Vote: The step-wise confidence-weighted voting mechanism is applied to each bootstrap sample to infer a candidate label.
Aggregation: The $B$ candidates are aggregated via another confidence-weighted vote to produce the final subgroup pseudo-label $o_j^*$ .
Reward Assignment: Each $o\in S_j$ is assigned a binary reward

$r(o) = \mathbf{1}\bigl[\mathrm{Ans}(o)=\mathrm{Ans}(o_j^*)\bigr]$

This yields $n$ distinct supervisory targets per update, achieving denser supervision. The accompanying pseudocode formalizes the iterative policy update procedure incorporating these elements.

5. Comparative Empirical Performance

SCOPE demonstrates robust empirical gains across multiple math reasoning benchmarks and LLM scales. On challenging datasets such as AIME 2025, AIME 2024, AMC, and MATH-500, SCOPE universally outperforms the baseline TTRL (majority-voting) approach. The strongest results are reported with Qwen3-8B:

Model	AIME 2024	AIME 2025	AMC	MATH-500	Avg
TTRL (baseline)	47.13	27.40	68.55	89.74	58.21
w/ SCOPE	52.70	31.00	74.09	91.01	62.20
Δ (relative)	+11.8%	+13.1%	+8.1%	+1.4%	+6.9%

Notable improvements include a 13.1% increase on AIME 2025 and 8.1% on AMC. Smaller-scale models such as Qwen2.5-Math-1.5B observe similar double-digit average improvements ( $\approx$ +11.9%). These results substantiate the utility of step-wise confidence and dynamic subgroup supervision in recovering correct minority outputs and delivering more reliable reward signals (Wang et al., 17 Dec 2025).

6. Context and Implications in Test-Time Reinforcement Learning

SCOPE advances TTRL by addressing the two key deficiencies of traditional majority-voting pseudo-labeling: confirmation bias and reward sparsity. By leveraging step-wise model confidence, SCOPE promotes high-quality, minority-consistent reasoning paths that are typically underrepresented in pure frequency-based voting. The dynamic subgroup partitioning and local consensus approach increases reward density and drives exploratory diversity. A plausible implication is that SCOPE's methodological innovations—especially its confidence-weighted aggregation and Pareto-guided subgrouping—could generalize to other domains where unsupervised RL reward modeling must operate without reliable, verifiable reward signals (Wang et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Subgroup-Specific Step-Wise Confidence-Weighted Pseudo-Label Estimation (SCOPE).