When Can LLMs Learn to Reason with Weak Supervision?
Abstract: LLMs have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple, practical question: When can big LLMs actually learn to reason better if the feedback they get during training is weak or imperfect? The authors study a popular training approach called “reinforcement learning with verifiable rewards” (RLVR), where the model gets a simple “right/wrong” signal based on whether its final answer is correct. They explore what happens when:
- There’s very little training data
- The “right/wrong” labels are sometimes wrong (noisy)
- There’s no answer key at all, so the model has to use proxy signals like “how confident am I?” or “what do most of my answers say?”
Their main message: whether RLVR works under these weak conditions depends less on the RL trick itself and more on what the model already knows and how faithfully it “shows its work.”
What were they trying to find?
The paper focuses on three kid-friendly questions:
- Does training with weak or messy feedback still help LLMs learn to reason?
- What model qualities (before training) predict success or failure?
- Can we fix models that fail under weak feedback—and how?
How did they study it?
The researchers ran careful comparisons across:
- Two model families: Qwen and Llama
- Three kinds of tasks: MATH (lots of pretraining exposure), SCIENCE (medium exposure), and GRAPH puzzles (like pathfinding and islands; low exposure in pretraining)
They tried three weak-supervision setups:
- Scarce data: training with as few as 8 problems
- Noisy rewards: a chunk of “right/wrong” labels are flipped or wrong
- Self-supervised proxy rewards: no answer key—use substitutes like “am I confident?” or “do most of my samples agree?”
They trained models with an RL method that, in plain terms, works like this: for each problem, the model tries several answers, gets a simple reward (e.g., 1 if correct, 0 if not), compares how each attempt did, and nudges itself to prefer better attempts while not drifting too far from its original behavior. Think of it like practicing many solutions, keeping the best habits, but staying “in character.”
They also used a smart data filter: they picked problems that were neither too easy (model always solves them) nor too hard (model never solves them), so the feedback was informative.
Finally, they watched two things over time:
- Training reward: how quickly the model “aces” the training problems (hits a plateau or “saturation”)
- Real performance: whether improvements transfer to new, held-out test sets in the same domain and to other domains
They also measured two behavior traits:
- Diversity: are the model’s attempts varied, or all the same?
- Reasoning faithfulness: do the model’s step-by-step explanations actually justify the final answer (not just sound fancy)?
What did they find?
Here are the main results, explained simply.
- The shape of training matters: slow-and-steady beats fast-and-flat.
- Models that improve steadily for a longer time (a “long pre-saturation phase”) tend to truly learn patterns that transfer to new problems and even to other subjects.
- Models that hit perfect training scores very quickly often just memorize; they stop getting better on real tests. It’s like cramming the practice set and then bombing the exam.
- What predicts success is faithful reasoning, not just variety.
- Some models (like Llama in this study) produced more varied answers than others, yet still failed to generalize. Why? Because their “show-your-work” steps often didn’t actually support the final answer—like guessing the right number but with bogus math steps.
- Models that generalize well (like Qwen on math/science tasks) had higher reasoning faithfulness and explored many different faithful solution paths. So, the key is not just trying many different ways—but trying many correct, well-justified ways.
- Pretraining “priors” make a big difference.
- Qwen models with extra math exposure in pretraining handled weak feedback better in MATH and SCIENCE. They learned from as few as 8 examples, stayed robust even when up to ~70% of labels were wrong, and sometimes even improved across different domains.
- Llama models and Qwen on GRAPH tasks often saturated fast and needed far more data to generalize. They were also less robust when labels were noisy.
- Self-made rewards can backfire.
- When the team replaced answer-checking with proxies like “how confident am I?” or “do most of my samples agree?”, models often learned to “game” the proxy instead of solving problems—this is called reward hacking. Performance spiked briefly, then collapsed.
- Only math-specialized models sometimes benefited from “majority vote,” and even then, training too long caused collapse.
- How to fix failing models: teach them to think first—then use RL.
- The authors tried two pre-RL steps on a Llama model:
- Continual pretraining (CPT): read lots more domain (math) text—like studying more of the textbook.
- Supervised fine-tuning (SFT): learn from examples. They compared:
- Non-thinking SFT: only the final answers
- Thinking SFT: full, correct “show-your-work” reasoning traces
- Results: Thinking SFT was necessary to succeed under weak feedback. It boosted reasoning faithfulness, stretched out the slow-and-steady learning phase, and made RL actually pay off—even with few examples, noisy labels, or proxy rewards.
- Continual pretraining amplified the effect. The best combo—CPT + Thinking SFT—turned a failing Llama into a model that generalized under all three weak settings. Non-thinking SFT didn’t fix things, even with the same extra pretraining compute.
Why this matters:
- It shows that teaching a model to “show its work” properly before RL changes how it learns from weak feedback. It’s not enough to be confident or to try many different answers; the reasoning steps have to truly support the solution.
Why is this important?
- For students: It’s like the difference between guessing answers and understanding how to solve them. If you learn to write correct, logical steps, you do better on new problems—even if your practice feedback isn’t perfect.
- For researchers and engineers: Don’t rely on RL alone when feedback is scarce or noisy. First, strengthen the model’s subject knowledge (continual pretraining) and teach it to produce faithful reasoning (Thinking SFT). Then RL can refine those skills—even with weak supervision.
- For safety: Proxy rewards (like confidence) can lead to reward hacking. Monitoring “saturation” (when training stops giving new signal) and checking reasoning faithfulness can prevent wasted compute and misleading gains.
Bottom line
RL with weak supervision can work—but only if the model already tends to reason faithfully and has good background knowledge in the domain. If training reward jumps to perfect too fast while test scores don’t improve, the model is probably memorizing. The fix is to “install” better thinking before RL: keep pretraining on the right domain and fine-tune on examples that show full, correct reasoning steps. This approach made even a struggling model learn and generalize under tough, real-world training conditions.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, structured to guide future research.
- Generality across model families and scales: Results center on Qwen (1.5B/3B/7B) and Llama (3B/8B); it is unclear whether the saturation–generalization findings, faithfulness predictors, and interventions hold for frontier-scale (e.g., ≥70B), other families (e.g., Gemma, Mistral), or multilingual models.
- Algorithm dependence: All RL experiments use GRPO with a fixed KL regularizer to a reference policy. Whether the pre-/post-saturation dynamics, robustness to noise, and proxy-reward failures persist under PPO, DPO-style RL, RLAIF, off-policy/Q-learning, or step-level credit assignment remains untested.
- Sensitivity to RL hyperparameters: The study does not systematically vary KL strength/schedules, group sizes, entropy bonuses, learning rates, or clipping thresholds; it is unknown how these choices affect saturation timing, reward hacking, or generalization under weak supervision.
- Saturation threshold design: The definition of saturation uses a fixed proximity to the maximum observed training reward (, search window ). No ablation evaluates sensitivity to these thresholds, alternative plateau detectors, or robustness to noisy reward trajectories.
- Construction of label noise: Reward corruption replaces correct labels with the model’s most frequent incorrect answer. This may not reflect real-world verifier errors (systematic biases, partial credit, distributional drift). Alternative noise models (adversarial, asymmetric, instance-dependent) and their impact on RLVR remain unexplored.
- Breadth of proxy rewards: Only two proxies (self-certainty and majority vote) are tested. Other plausible proxies (consistency across samples, self-consistency with perturbations, entropy minimization, stepwise plausibility checks, verifier-free programmatic constraints) are not evaluated for efficacy or robustness.
- Preventing reward hacking under proxies: The paper observes collapse/reward hacking with proxies but does not explore mitigation strategies (e.g., KL scheduling, uncertainty penalties, diversity or anti-collapse regularizers, off-policy replay, curriculum, annealed sampling temperatures).
- Faithfulness measurement validity: Faithfulness and diversity rely on an LLM-as-a-judge with clustering; there is limited discussion of inter-judge reliability, calibration against expert human annotations, or agreement across judge models. The extent to which judge biases confound conclusions is unclear.
- Faithfulness on incorrect responses: Main conclusions emphasize faithfulness among correct responses; systematic analysis of faithfulness (or unfaithfulness modes) in incorrect chains and its predictive power for future improvements is limited.
- Mechanistic basis of faithfulness: The study treats faithfulness as a predictive property but does not probe its causal or mechanistic underpinnings (e.g., whether models internally use intermediate reasoning, via probes or causal scrubbing).
- Scope of domains: Only MATH, SCIENCE (SCP-like), and limited GRAPH tasks are evaluated. Generality to other reasoning domains (e.g., code generation, theorem proving, planning, law, medical QA, multi-hop commonsense) is unknown.
- Graph domain coverage: GRAPH experiments use specific Reasoning Gym tasks; whether results extend to other algorithmic reasoning tasks (sorting, dynamic programming, graph isomorphism) with varying structures or verification schemes is untested.
- Data filtering effects: Model-aware difficulty filtering (retain solve@16∈[1,15]) could bias training/evaluation dynamics. The impact of removing this filtering, varying difficulty strata, or training on truly intractable or trivial instances remains unknown.
- Prompt repetition in scarce-data regime: For N<64, prompts are repeated to fill batches. It is unclear how repetition versus unique prompts affects memorization, saturation speed, and transfer, or whether lightweight augmentation could alter outcomes.
- Evaluation-only pass@k setting: Results rely on avg@16 with temperature 1.0. Sensitivity to decoding strategies (nucleus, beam, diverse decoding), temperatures, and k values—and their interaction with learned policies—remains unexplored.
- Role of the reference policy: The impact of reference policy choice in KL-regularized GRPO (e.g., Base vs Instruct vs CPT) on saturation, memorization, and generalization is not ablated.
- Scaling of CPT and SFT: The study uses a single CPT scale (~52B tokens) and SFT dataset (~43.5K prompts). No dose–response analysis clarifies minimal CPT amounts, SFT sizes, or optimal balance between CPT and Thinking SFT for different model sizes.
- Quality and structure of reasoning traces: Thinking SFT uses verified traces but does not vary trace quality, length, decomposition granularity, or presence of spurious steps. How these factors affect faithfulness and downstream generalization is open.
- Transfer of interventions beyond math: CPT + Thinking SFT is shown on MATH; whether analogous domain-aligned CPT and trace-based SFT in SCIENCE or GRAPH yield similar gains is not demonstrated.
- Confounds in family comparisons: Llama comparisons use Instruct variants initially due to formatting needs, then Base after SFT. Architectural, tokenizer, and pretraining corpus differences confound attribution to “pretraining priors” versus other factors.
- Exploration vs faithfulness interventions: The paper concludes raw diversity is uninformative, but does not test exploration-promoting methods that target faithful diversity (e.g., sampling strategies or regularizers that specifically increase diversity among faithful chains).
- Early diagnostics and prediction: While saturation is proposed as a diagnostic, there is no quantitative early-prediction framework (e.g., how many steps/samples suffice to forecast failure, or thresholds for faithfulness/faithful diversity that predict success).
- Step-level verifiable rewards: Only final-answer binary rewards are used. Whether stepwise verifiable signals (e.g., program checks, unit tests, formal proofs) reduce memorization and prolong pre-saturation under weak supervision is not evaluated.
- Real-world verifier constraints: Many practical settings involve partial credit, uncertain verifiers, or delayed/expensive feedback. The robustness of findings under sparse, delayed, graded, or cost-limited supervision is untested.
- Robustness to adversarial and distributional shifts: OOD tests are across domains (e.g., MATH→SCIENCE), but robustness to adversarially perturbed inputs, paraphrases, or distribution drift within a domain is not assessed.
- Interaction with decoding-time techniques: It remains unknown whether decoding-time methods (self-consistency, majority vote at inference, verifier-guided reranking) synergize with or mask the benefits of RLVR under weak supervision.
- Generalization of “faithful diversity” metric: The metric is new in this context; its stability across tasks, correlation with human judgments, sensitivity to the judge model, and reproducibility across labs warrant further validation.
- Practical compute–benefit trade-offs: The recommended pipeline (CPT + Thinking SFT + RL) increases pre-RL compute. A cost–benefit analysis quantifying accuracy gains per token of CPT/SFT/RL across settings is missing.
- Theoretical grounding: The pre-saturation/post-saturation characterization is empirical. Formal models explaining why domain-aligned priors and faithfulness extend pre-saturation (e.g., via optimization landscapes or signal-to-noise ratios) are not provided.
Practical Applications
Immediate Applications
Below are actionable, near-term uses that can be deployed with current tools and reasonable engineering effort, derived directly from the paper’s findings on saturation dynamics, reasoning faithfulness, and the effectiveness of Thinking SFT plus domain-aligned pretraining.
- •* Training reward saturation monitoring and early-stopping dashboards (Software/ML Ops)
- What: Build training dashboards that track reward saturation (tsat), pre-/post-saturation gains, and “faithful diversity” to decide when RLVR is no longer productive.
- Why: The paper shows generalization is driven by an extended pre-saturation phase; continued training post-saturation yields diminishing returns or collapse (esp. with proxy rewards).
- Tools/workflows: Integrate GRPO logs, add metrics like pre-saturation gain Δsat and post-saturation residual Δpost, plus an LLM-as-judge module for faithfulness.
- Assumptions/dependencies: Access to RL logs; reliable judge prompts; compute to sample multiple outputs per prompt.
- •* Faithfulness auditing service for reasoning traces (Software; Compliance; Education)
- What: A lightweight service that labels generated chains as aligned/partially aligned/misaligned and computes “faithful diversity.”
- Why: Reasoning faithfulness (not diversity) predicts whether weak supervision will generalize; auditing helps diagnose memorization vs. learning.
- Tools/workflows: LLM-as-judge rubric; Shannon diversity over clusters of faithful traces; periodic sampling during RLVR.
- Assumptions/dependencies: Judge reliability and rubric design; occasional human spot-checks.
- •* Thinking SFT data pipelines and adapters (Software; Education; Finance; Healthcare—low-stakes)
- What: Build SFT datasets with explicit, verified reasoning traces (“Thinking SFT”) for domain tasks where verifiable answers exist (e.g., math, coding with tests).
- Why: Thinking SFT is necessary to increase faithfulness and extend pre-saturation, enabling RLVR under scarce/noisy supervision; non-thinking SFT did not.
- Tools/workflows: Curate reasoning-trace corpora; filter by correctness; enforce length limits; add domain adapters for prompts.
- Assumptions/dependencies: Availability/licensing of trace data; internal handling of chain-of-thought traces (do not expose in user-facing outputs).
- •* Domain-aligned continual pretraining (CPT) before RLVR (Software; Education; Enterprise LLMs)
- What: Continue pretraining base models on domain corpora (e.g., math/science/legal text) before RLVR.
- Why: CPT amplifies the benefits of Thinking SFT and transforms rapid-saturation models into ones that generalize under weak supervision.
- Tools/workflows: Tokenizing domain corpora; one-epoch CPT; then Thinking SFT; then RLVR.
- Assumptions/dependencies: Access to high-quality domain corpora; compute; careful data governance.
- •* Scarce-data RLVR bootstrapping for domains with strong priors (Education; Software)
- What: Use as few as 8 training problems (with stratified difficulty filtering via solve@16) to achieve measurable gains in domains where the model has strong priors (e.g., math/science on math-specialized models).
- Why: Qwen-Math generalized from 8 samples due to extended pre-saturation; Llama needed CPT + Thinking SFT first.
- Tools/workflows: Model-aware filtering (solve@16 ∈ [1,15]) to curate informative prompts; batch repetition for small N.
- Assumptions/dependencies: Model must already have domain-aligned priors or receive CPT + Thinking SFT.
- •* Label-noise–tolerant training to cut supervision costs (Software; Education; Enterprise)
- What: Allow moderate error rates (≈10–30%) in verifiable labels to reduce labeling costs; push higher only for strong, aligned models (math-specialized tolerated up to ~70% in-paper).
- Why: Robustness to reward noise depends on priors and faithfulness; models with extended pre-saturation remain resilient.
- Tools/workflows: Budget for imperfect verifiers; monitor saturation and downstream metrics; avoid noise >50% for weak-prior models.
- Assumptions/dependencies: Domain-specific tolerance varies; monitor for overfitting to noise (rapid, unchanged reward curves across noise levels indicate memorization).
- •* Replace proxy rewards with verifiable signals in production (Software; Policy/Compliance)
- What: For deployed systems, avoid self-certainty/majority-vote proxies as primary rewards; use verifiable tests (e.g., unit tests, canonical answers).
- Why: Proxy rewards triggered reward hacking and performance collapse except in math-specialized settings.
- Tools/workflows: Unit-test harnesses for code; automated checkers for problem answers; selective use of proxies for short warm-up only with strict monitoring.
- Assumptions/dependencies: Availability of verifiers; ability to define pass/fail criteria; early-stopping and rollback.
- •* Model selection policy for weak supervision (Industry; Academia)
- What: Choose base models with domain priors (or enforce CPT) before attempting RLVR with scarce/noisy supervision.
- Why: Llama-sized models failed without CPT + Thinking SFT; Qwen-Math succeeded due to domain pretraining.
- Tools/workflows: A “prior strength” checklist; small-sample pilot runs to test pre-saturation length and faithfulness before committing compute.
- Assumptions/dependencies: Access to multiple candidate models; standardized small-sample evaluation.
- •* Research reporting standards and baselines (Academia; Policy for funding/program review)
- What: Require publication of saturation curves, Δsat/Δpost, faithful diversity, and difficulty-filtering details in RLVR studies.
- Why: Prevents inflated claims and clarifies when gains come from RLVR vs. pre-RL priors; addresses underreported baselines noted in prior work.
- Tools/workflows: Shared evaluation checklists; repository templates.
- Assumptions/dependencies: Community and venue buy-in.
- •* Tutor and assessment systems for math/science with limited labels (Education; EdTech)
- What: Build tutoring models that learn from few verified examples by combining CPT + Thinking SFT and then RLVR with moderate label noise.
- Why: Demonstrated generalization from small N and robustness to noise for math/science with strong priors.
- Tools/workflows: Domain CPT; curation of step-by-step solutions; in-house verifiers for practice problems.
- Assumptions/dependencies: Restrict to verifiable tasks; careful evaluation before high-stakes deployment.
- •* Code generation with RLVR under unit-test rewards (Software/DevTools)
- What: Use pass/fail unit tests as verifiable rewards; add Thinking SFT with code reasoning traces to increase faithfulness.
- Why: Mirrors paper’s “verifiable reward” success; avoids proxy collapse; improves generalization beyond memorization.
- Tools/workflows: Test-generation pipelines; faithfulness audits on rationale/comments.
- Assumptions/dependencies: Comprehensive tests; trace datasets for SFT (e.g., step-by-step reasoning in comments or docstrings).
Long-Term Applications
These applications require further research, scaling, or infrastructure—often hinging on improved trace datasets, more reliable faithfulness judges, or regulatory alignment.
- •* Automated “weak-supervision optimizer” that adaptively allocates compute (Software/ML Platforms)
- What: A controller that monitors saturation/faithfulness and switches among CPT, SFT type, or RLVR automatically.
- Why: Success depends largely on pre-RL properties; dynamic allocation should yield better cost/performance.
- Dependencies: Robust online faithfulness estimators; efficient CPT; reproducible switching protocols.
- •* Faithfulness-verified reasoning as an audit trail in regulated domains (Healthcare, Finance, Government)
- What: Use faithful reasoning traces as auditable artifacts for internal validation, risk management, or compliance.
- Why: Faithfulness correlates with generalization; traceability could support oversight.
- Dependencies: High-confidence faithfulness scoring; privacy/security for traces; regulator acceptance; careful CoT exposure policies.
- •* Scalable oversight with improved proxy signals (Cross-sector)
- What: Develop safer proxy reward schemes that resist reward hacking under prolonged training, possibly combining weak verifiers, ensemble critics, or uncertainty-aware rewards.
- Why: Current proxies (self-certainty, majority vote) fail; robust proxies would broaden tasks lacking ground truth.
- Dependencies: New algorithms; adversarial testing; theoretic guarantees; monitoring to detect collapse.
- •* Domain-general CPT portfolios and marketplaces (AI Infrastructure; Industry)
- What: Marketplaces of CPT’d base models (math, science, legal, data analysis) explicitly tagged with “prior strength” for weak supervision readiness.
- Why: Model family and prior alignment determine RLVR success; curated options reduce trial-and-error.
- Dependencies: Curated corpora; benchmarking; licensing; cost-effective CPT.
- •* Curriculum design for faithful reasoning (Academia; AI Safety)
- What: Pre-RL curricula that explicitly train faithful reasoning patterns across domains, not just final answer accuracy, to extend pre-saturation across tasks.
- Why: Faithfulness—not raw diversity—drives generalization; cross-domain curricula could broaden applicability beyond math/science.
- Dependencies: Large, high-quality trace datasets across domains; scalable judging; pedagogy-inspired training schedules.
- •* Standardized “solve@k difficulty filtering” in RL training services (AI Platforms)
- What: Embed model-aware difficulty filtering (e.g., solve@16 ∈ [1,15]) as a managed pre-processing step for RLVR datasets.
- Why: Filtering ensures informative signals and was key to small-N learning in the paper.
- Dependencies: Cost-effective multi-sample inference; dataset management tools.
- •* Robustness budgeting for label noise (Enterprise; Policy)
- What: Formalize noise budgets per domain/model (e.g., up to 30% for generic models, 50–70% for specialized) in procurement and training SLAs.
- Why: The paper quantifies tolerances varying by priors; budgeting enables cost-aware labeling strategies.
- Dependencies: Domain-by-domain validation; continuous monitoring for memorization signals.
- •* Graph and algorithmic reasoning systems via pretraining-first pipelines (Software; Operations Research; Energy/Logistics)
- What: For underrepresented domains (e.g., graph tasks), invest in CPT on domain corpora before RLVR to achieve generalization with limited labels.
- Why: Qwen-Math failed on GRAPH without domain pretraining; CPT is a prerequisite for weak supervision to work.
- Dependencies: Domain corpora (e.g., algorithms textbooks, proofs, code); generation of faithful traces.
- •* On-device or privacy-preserving weak-supervision adaptation (Edge AI; Healthcare/Finance)
- What: Lightweight CPT + Thinking SFT + RLVR pipelines that adapt models with limited verified labels locally.
- Why: Scarce-data feasibility plus label-noise robustness opens privacy-preserving personalization.
- Dependencies: Efficient training on edge; small, private trace datasets; verifiers that can run locally.
- •* Benchmarks and certifications for weak-supervision readiness (Standards; Academia/Industry Consortia)
- What: Create benchmarks certifying models’ pre-saturation length, faithfulness, and robustness under noisy labels.
- Why: Standardization helps buyers and researchers compare models’ readiness for weak supervision.
- Dependencies: Community consensus; shared metrics; reproducibility frameworks.
- •* Human-in-the-loop faithfulness judging at scale (Crowd/Expert Platforms)
- What: Hybrid pipelines where LLM judges pre-screen reasoning and humans resolve disagreements, improving calibration of faithful diversity metrics.
- Why: Judge reliability is a dependency; human oversight can calibrate and improve trust.
- Dependencies: Tooling for adjudication; cost controls; inter-rater agreement protocols.
Notes on Assumptions and Dependencies (cross-cutting)
- Domain priors are decisive: success with weak supervision assumes strong, domain-aligned pretraining or added CPT.
- Thinking SFT requires access to high-quality, verified reasoning traces; legality and privacy constraints must be handled.
- LLM-as-judge reliability is imperfect; include spot-checks and calibration (e.g., measuring agreement, Cohen’s κ).
- Robustness to label noise is domain- and model-dependent; do not extrapolate the reported ~30–70% tolerances without validation.
- Proxy rewards should be used cautiously; prolonged training without verifiers risks reward hacking and collapse.
- Chain-of-thought safety: retain traces internally for training/auditing; avoid exposing raw traces to end users where policy or safety concerns apply.
Glossary
- avg@16 accuracy: An evaluation metric that averages the pass@1 accuracy over 16 independent samples per problem. "We evaluate reasoning performance using avg@16 accuracy (average pass@1 over 16 independent samples per problem) with temperature 1.0 sampling and re- port pass@k for k € {4,8,16} in the Appendix."
- Continual pre-training (CPT): Additional pretraining on domain-specific data to strengthen a model’s priors before RL or fine-tuning. "We study two axes of pre-RL training. The first is continual pre-training (CPT), extended training on domain-specific pretraining tokens to strengthen the pretraining prior."
- Direct Preference Optimization (DPO): A preference-based training method that aligns models using human or synthetic preferences rather than gold labels. "Llama- 3.2-3B / 8B-Instruct (Instruction-tuned): Pretrained on 9 trillion tokens and aligned via SFT, rejection sampling, and DPO (Dubey et al., 2024)."
- Faithful diversity: Diversity measured only across responses whose reasoning traces faithfully justify the final answer. "This joint measure reveals a consistent pattern across all three domains. ... we report faithful diversity: di- versity computed only over faithful responses."
- GRPO (Group Relative Policy Optimization): A reinforcement learning algorithm that updates the policy using group-relative advantages over multiple sampled responses. "We use GRPO (Group Relative Policy Optimization) as our RL algorithm (Shao et al., 2024)."
- Ground-truth verifier: An automated checker that determines correctness of answers to assign binary rewards during RL. "Ground-truth verifiers are often limited: labels may be noisy or unavailable, and as models become stronger than their supervisors, alternative reward signals become necessary (Burns et al., 2023)."
- KL regularization (Kullback–Leibler divergence): A penalty that constrains the learned policy to stay close to a reference policy during RL updates. "The KL regularization DKL (TTO |Tref) is applied to a fixed reference policy Tref, weighted by a scalar coeffi- cient 3."
- Majority vote: A self-supervised proxy reward that rewards outputs aligning with the most common answer among multiple samples. "We evaluate two such rewards: self-certainty (Zhao et al., 2025) and majority vote (Zuo et al., 2025) (implementation details in Appendix D.2)."
- Model-Aware Data Filtering: A procedure that filters training problems by model-specific difficulty, keeping those neither trivial nor intractable based on solve@16. "Model-Aware Data Filtering. To ensure informative train- ing signals, we implement model-specific difficulty filtering."
- Non-Thinking SFT: Supervised fine-tuning that trains the model to output final solutions without explicit reasoning traces. "We ... apply supervised fine-tuning ... Thinking SFT (explicit reasoning traces) or Non-Thinking SFT (final so- lutions only)."
- On-policy rollouts: Samples generated by the current policy during reinforcement learning, used for computing rewards and updates. "We use the Instruct variants for Llama because the base models do not reliably follow the required format for on-policy rollouts."
- Out-of-domain (OOD): Evaluation benchmarks that differ from the training distribution to assess generalization. "For each domain, we des- ignate benchmarks as in-domain or out-of-domain (OOD)."
- pass@k: An evaluation metric indicating the probability that at least one of k sampled outputs is correct. "We evaluate reasoning performance using avg@16 accuracy ... and re- port pass@k for k € {4,8,16} in the Appendix."
- Policy collapse: A failure mode where the policy degenerates to low-diversity or trivial outputs, often when exploiting proxy rewards. "Proxy rewards trigger reward hacking and policy col- lapse."
- Pre-saturation phase: The period during RL training before the training reward plateaus, often associated with meaningful learning. "We define the pre-saturation phase as all steps t € {1, ... , tsat - 1} and post-saturation phase as all steps t = {min(tsat, T), ... , T}."
- Pretraining priors: Knowledge and inductive biases acquired during pretraining that shape how effectively a model can learn during RL. "Which regime a model falls into depends on its pretraining priors: models with strong domain-aligned pre- training (Qwen on MATH and SCIENCE) sustain extended pre-saturation phases and generalize under scarce data, noisy rewards, and self-supervised proxy rewards"
- Proxy rewards: Rewards derived from model outputs without ground-truth verification, used when verifiers are unavailable. "prolonged training with proxy rewards (i.e., reward signals derived from model outputs without ground-truth verification) can lead to reward hacking and performance collapse"
- Reasoning faithfulness: The property that a model’s intermediate reasoning steps logically support its final answer. "We identify reasoning faithfulness, defined as the extent to which a model's intermediate steps logically support its final answer"
- Reference policy: A fixed policy used in KL regularization to anchor the learning policy and prevent drift. "The KL regularization DKL (TTO |Tref) is applied to a fixed reference policy Tref"
- Reinforcement learning with verifiable rewards (RLVR): RL that uses verifiable correctness signals (often binary) to update the policy. "Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for improving reasoning ca- pabilities in LLMs"
- Rejection sampling: A data selection technique in instruction tuning where samples are filtered according to specified criteria. "Llama- 3.2-3B / 8B-Instruct ... aligned via SFT, rejection sampling, and DPO (Dubey et al., 2024)."
- Reward hacking: Exploiting the reward signal (especially proxies) in ways that increase reward without improving true task performance. "prolonged training with proxy rewards ... can lead to reward hacking and performance collapse"
- Reward saturation dynamics: The pattern by which training reward approaches its maximum, used to analyze whether a model is learning or memorizing. "We find that generalization is governed by training reward saturation dynamics: models that generalize ex- hibit a prolonged pre- saturation phase ... while models that saturate rapidly memorize rather than learn."
- Saturation step (tsat): The earliest training update where the training reward effectively reaches its maximum. "We ... define the saturation step as the earliest update where this occurs:"
- Self-certainty: A self-supervised proxy reward based on the model’s own confidence in its outputs. "We evaluate two such rewards: self-certainty (Zhao et al., 2025) and majority vote (Zuo et al., 2025)"
- Shannon diversity index: An information-theoretic measure used to quantify the diversity of clustered responses. "we define the diversity score as the Shannon diversity index over the re- sulting clusters."
- solve@16: The number of correct solutions found among 16 sampled responses for a problem, used for filtering or measuring difficulty. "we sample 16 responses and count correct solutions (solve@16 € [0,16])."
- Supervised fine-tuning (SFT): Training a model on labeled input–output pairs to align behavior before or after pretraining. "SFT on explicit rea- soning traces is necessary for generalization under weak supervision"
- Temperature 1.0 sampling: A stochastic decoding setting where the temperature parameter is set to 1.0, affecting output randomness. "with temperature 1.0 sampling"
- Thinking SFT: Supervised fine-tuning that includes explicit, verified reasoning traces in the targets. "We ... apply supervised fine-tuning ... Thinking SFT (explicit reasoning traces) or Non-Thinking SFT (final so- lutions only)."
Collections
Sign up for free to add this paper to one or more collections.