The Unlearnability Phenomenon in RLVR for Language Models
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving LLM's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
The paper looks at how we teach LLMs to reason better using a method called โReinforcement Learning with Verifiable Rewardโ (RLVR). Think of RLVR like training with a coach who checks whether each answer is correct and gives a reward for correct ones. The surprising discovery: even when the coach finds correct attempts to reward, some hard problems still donโt get learned. The authors call these โunlearnableโ examples and try to figure out why that happens.
What questions the researchers asked
They set out to answer, in simple terms:
- Why do some hard problems stay unlearned even when the model sometimes gets them right and is rewarded for it?
- Is the issue caused by not seeing enough correct tries, by training rules that dampen learning, or by something deeper inside the model?
- Can we fix this by giving the model more practice problems, smaller subproblems, or extra preparation before RL?
How they studied it (methods, explained simply)
To explore this, they trained several LLMs on math problems and watched how learning progressed.
Key ideas, with analogies:
- RLVR: Like a quiz game where the model gives multiple answers (โrolloutsโ) to each question. A checker automatically marks correct answers and rewards them.
- Rollouts: Multiple tries for the same question in one training round.
- PPO, clipping, and KL penalty: Training โsafety rules.โ Clipping is like a speed limit on how fast the model can change; KL is a rubber band that pulls the model back if it changes too much from its earlier behavior.
- Gradients: Tiny nudges that tell the model how to change to get better. โGradient similarityโ measures whether two questions teach similar lessonsโlike asking, โDoes the way we learn from problem A help with problem B?โ
- Representation: The modelโs internal way of understanding and organizing ideasโlike its mental map of math strategies.
What they tested:
- They ensured each hard question had at least one correct attempt in training (oversampling and replay) to see if more โpositive examplesโ would help.
- They relaxed the training โsafety rulesโ (raised the speed limit and loosened the rubber band) to see if rules were blocking learning.
- They checked the quality of the modelโs step-by-step reasoning, not just the final answer, to see if correct answers sometimes came from shaky reasoning.
- They measured gradient similarity to see whether learning from other problems transfers to a given problem.
- They tried data augmentation: creating similar problems and breaking problems into subproblems, to see if that would help the model learn the original hard ones.
- They compared models that had extra โmid-trainingโ (extra general practice before RL) to see if better preparation improves learning later.
What they found and why it matters
Main findings:
- A real โunlearnabilityโ group exists: Among hard problems, a large chunkโoften close to half in their settingsโdid not improve, even though correct attempts were present and rewarded during training.
- Not just a reward shortage: Giving every hard question at least one correct attempt per training step didnโt fix it. Even training only on these hard questions, using many more attempts, or distilling correct solutions didnโt solve it.
- Not blocked by safety rules: Removing or relaxing the training rules (clipping and KL) didnโt help the unlearnable problems.
- Gradient outliers: Unlearnable problems had very low gradient similarity to the rest of the training set. In other words, the โlessonsโ learned from other problems didnโt transfer to them. Easy problems had highly similar gradients, which is what makes them learn smoothly.
- โFakeโ or fragile reasoning: For unlearnable problems, even the correct answers often came with low-quality or inconsistent step-by-step reasoning. That suggests the model sometimes uses shortcuts or brittle tricks rather than solid, generalizable reasoning.
- Data augmentation didnโt transfer back: The model could learn the new similar or subproblems themselves, but this did not translate into learning the original unlearnable problems. Semantically similar problems were not always similar in the modelโs โoptimization space,โ so practicing on them didnโt fix the core issue.
- Mid-training helps representations: Models that got extra general practice before RL showed higher gradient similarity on hard problems, meaning their internal representations were better aligned to learn from RL later.
Why this matters:
- It shows a fundamental limit: Just giving positive rewards (correct answers) isnโt enough if the modelโs internal representation of a problem is off. RL alone often canโt โrepairโ these flaws.
- It highlights the importance of preparation: Extra mid-training can reshape the modelโs mental map so RL works better afterward.
What this could mean going forward
- For training pipelines: Donโt rely on RL alone to build reasoning. Invest in mid-training (stronger base skills, broader practice) to align the modelโs internal representations before RL.
- For reward design: Checking only final answers can let the model โhackโ the reward with shortcut reasoning. Incorporating signals about the quality of intermediate steps may produce more reliable learning.
- For data strategy: Generating โsimilarโ problems doesnโt guarantee transfer. We need ways to create training data thatโs similar not only in meaning but also in how it shapes the modelโs learning (its gradients).
- For research: Understanding which examples produce transferable gradientsโand how to raise gradient similarity for tough casesโcould lead to more robust reasoning models.
In short: Some tough problems stay unlearned in todayโs RL setups, not because of missing rewards or strict rules, but because the modelโs internal understanding isnโt aligned. Strengthening those representations before RL (through mid-training) seems key to making reasoning training truly stick.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains uncertain or unexplored in the paper, phrased to guide concrete followโup research:
- Domain generality: Does the unlearnability phenomenon persist beyond mathematical reasoning (e.g., coding, multi-hop QA, planning/agent tasks, scientific reasoning)?
- Scale dependence: How does unlearnability change with larger-capacity models (e.g., โฅ7Bโ70B), different architectures (MoE, long-context variants), or instructionโ vs baseโinitialized policies?
- Sensitivity to the operational definition: How robust are findings to different pass@k thresholds, sample sizes (N), and convergence criteria, especially for near-boundary examples?
- Alternative RL objectives: Do algorithms beyond GRPO/PPO (e.g., off-policy actorโcritic, value-based, model-based RL, implicit KL, trust-region variants) reduce unlearnability?
- Reward design beyond binary outcomes: Can process-based or step-verified rewards (PRMs, partial credit, intermediate constraints) convert unlearnable examples into learnable ones?
- Exploration interventions: Do targeted exploration schemes (entropy scheduling, guided search, self-ask, beam search mixing, diversity-promoting rollouts) alter learnability trajectories?
- Correct rollout diversity: Does training with multiple distinct correct solution paths (rather than repeated near-duplicates) improve representation alignment and learnability?
- Predictive diagnostics: Can we predict unlearnable examples a priori using cheaply computable proxies (e.g., prompt features, token entropy, early-loss curvature, low-cost gradient sketches)?
- Online detection and routing: Can real-time indicators (e.g., streaming gradient similarity, advantage statistics) trigger specialized handling (alternative objectives, tools, or curricula) for prospective unlearnable examples?
- Taxonomy and causal factors: Which problem attributes (e.g., algebraic structure, symbolic depth, length, compositionality, error modes) correlate causally with gradient outlier status?
- Representation-level mechanism: What specific internal features or layers fail on unlearnable examples? Can layer-wise or subspace analyses identify where representation misalignment arises?
- Validity of gradient-similarity proxy: How do conclusions change when using full-parameter gradients vs LoRA proxies, different layers, token-level granularity, or Jacobian/Hessian-based similarity?
- Trainingโtime evolution: How do gradient similarities and reasoning quality evolve across many more training steps or restarts? Are there late-phase transitions that rescue some unlearnable cases?
- Gradient interference mitigation: Do methods like orthogonal gradient surgery (e.g., PCGrad), conflict-aware batching, or per-group optimizers reduce cross-example interference for outliers?
- Reference policy and KL schedules: Would alternative reference models, adaptive KL targets, or trust-region schedules change learnability, especially for examples with initially low policy mass?
- Sampling/grouping strategies: How do different k, grouping strategies, prioritized sampling by predicted similarity, or active selection impact the emergence of unlearnability?
- Data augmentation that targets optimization space: Can we design augmentation that is gradient-aligned (e.g., process-preserving transformations, formal symmetries, programmatic perturbations) rather than only semantically similar?
- Process supervision and antiโreward hacking: Does enforcing correctness of intermediate steps (or penalizing incoherent chains) repair โfake reasoningโ and increase gradient alignment for unlearnable examples?
- Tool-use and hybrid pipelines: Can tool-augmented rollouts (symbolic solvers, calculators, verifiers) or self-consistency checks help overcome representation gaps on unlearnable problems?
- Mid-training design space: Which mid-training datasets, mixtures, and objectives (e.g., math-specific corpora, program-of-thought, contrastive consistency, synthetic curricula) most improve gradient similarity on hard examples?
- Interaction with SFT: Beyond brief SFT checks, how do different SFT regimes (process SFT, weak-to-strong distillation, rationale distillation) interplay with RL to address unlearnability?
- Dynamic sampling filter effects: The pipeline drops zero-variance prompts; does this filtering bias optimization, masking examples that could become learnable later with different schedules?
- Robustness of reasoning-quality scoring: How sensitive are conclusions to the use of GPT-5-mini as the judge? Do human evaluations or alternative rubrics confirm the reasoning-quality gaps?
- Generalization to non-verifiable or learned rewards: Does unlearnability appear (or worsen) when rewards are model-based (RMs/AIF) or noisy, where โcorrectnessโ is not strictly verifiable?
- Architectural interventions: Would architectures with explicit planning/scratchpads, memory modules, or MoE routing reduce the incidence of gradient outliers?
- Theoretical characterization: What conditions make an example a persistent gradient outlier under outcome-based RL? Can we formalize links between representation geometry, reward sparsity, and learnability?
- Operational scalability: How can large-scale training monitor and act on gradient similarity or outlier status efficiently without prohibitive compute?
- Reproducibility and variance: How variable is the unlearnability set across seeds, data orders, and trainers? Is the intersection-over-runs approach conservative enough for pipeline decisions?
Practical Applications
Immediate Applications
The paperโs findings enable several concrete changes to current LLM training and deployment practices that can be implemented now, especially for tasks with verifiable rewards (e.g., math, coding, tool-using agents).
- RLVR pipeline triage via gradient-similarity diagnostics (Industry, Academia; Software, Agents)
- What to do: Add a lightweight LoRA-based per-example gradient similarity monitor in the RLVR loop to flag โgradient outlierโ prompts that are unlikely to learn despite correct rollouts.
- Tools/workflows:
- โUnlearnability detectorโ service that computes cosine similarity of example-level gradients (using a fixed LoRA adapter) and tags outliers.
- Dashboards tracking learnable vs. unlearnable cohorts over training steps.
- Impact: Reallocates compute away from unlearnable examples, shortens RL runs, and clarifies what data require upstream fixes.
- Assumptions/dependencies: Training-time access to gradients; using verifiable tasks; LoRA approximation suffices for similarity ranking.
- Mid-training-first pipeline adjustments (Industry, Academia; Foundation models)
- What to do: Prioritize mid-training to reshape representations before RL (as evidenced by improved gradient similarity in mid-trained OctoThinker models).
- Tools/workflows:
- โMid-training gateโ in the pipeline that requires a minimum gradient-similarity band on difficult examples before allowing RLVR.
- Mid-training data curation workflows aimed at reasoning-heavy corpora.
- Impact: Higher RLVR yield on hard tasks with the same compute.
- Assumptions/dependencies: Access to large-scale mid-training data/compute; transferability from math to target domain.
- Rollout quality control beyond outcome-only rewards (Industry, Academia; Education, Software)
- What to do: Score intermediate reasoning quality (e.g., with an LLM grader) and down-weight or filter โfake reasoningโ even when the final answer is correct.
- Tools/workflows:
- Reasoning-quality scorer integrated into sampling and credit assignment (process-aware filtering).
- In coding, complement unit tests with style/complexity checks to reduce exploitative shortcuts.
- Impact: Reduces noisy signals that reinforce ungeneralizable shortcuts; better generalization.
- Assumptions/dependencies: Availability of chain-of-thought or structured traces; access to a reliable judge model; careful handling to avoid unsafe CoT exposure.
- Compute-aware data scheduling and early stopping (Industry; MLOps)
- What to do: Monitor subgroup reward trajectories and clip/route examples that plateau (unlearnable) to alternative training stages instead of spending more RL budget.
- Tools/workflows:
- RL scheduler that dynamically de-prioritizes stagnant examples.
- โStop-lossโ criteria for example-level pass@k improvements.
- Impact: Prevents wasteful sampling and training on โstuckโ items.
- Assumptions/dependencies: Robust convergence/plateau criteria; accurate pass@k estimation.
- Data curation guidance for RLVR on verifiable tasks (Industry, Academia; Software, Education)
- What to do: Avoid assuming that โmore correct rolloutsโ or semantically similar problems will help; instead, identify and route unlearnable items to representation-shaping stages (mid-training/SFT with rationales).
- Tools/workflows:
- Triage labels (easy/learnable/unlearnable) maintained alongside datasets.
- Audit reports comparing semantic vs. gradient-space similarity.
- Impact: Better dataset ROI; fewer ineffective augmentations.
- Assumptions/dependencies: Ability to run diagnostic sampling; availability of alternative stages for those items.
- Enhanced evaluation and procurement protocols (Industry, Academia, Policy; Benchmarks, Safety)
- What to do: Report gradient-similarity distributions and reasoning-quality metrics alongside pass@k; add cohort-level learning curves to evaluation docs.
- Tools/workflows:
- Benchmark add-ons that quantify fraction of gradient outliers and their behavior over training.
- Procurement checklists requiring process-level metrics, not just outcomes.
- Impact: More informative model comparisons and safer deployment decisions.
- Assumptions/dependencies: Access to training-time diagnostics or validated post-hoc proxies.
- Deployment-time fallback strategies for โhardโ queries (Industry; Software, Education)
- What to do: For prompts known (from training logs) to be unlearnable, route to slower but robust inference strategies (e.g., self-consistency, tool use, retrieval, or human-in-the-loop).
- Tools/workflows:
- Prompt router referencing a registry of historically unlearnable patterns.
- Impact: Improves reliability for end-users without retraining.
- Assumptions/dependencies: Mapping from training-time unlearnable cohorts to similar production prompts; acceptable latency/cost for fallbacks.
- Responsible-use notices in learning products (Daily life, Education; EdTech)
- What to do: Display โshow your workโ validators and expose process-quality warnings when rationales appear incoherent despite correct answers.
- Tools/workflows:
- In-product reasoning-quality badges and optional trace viewers for students/teachers.
- Impact: Reduces overreliance on brittle reasoning; improves learning outcomes.
- Assumptions/dependencies: Access to reasoning traces; UX for communicating uncertainty/process flaws.
Long-Term Applications
The paper suggests several research and development directions that require new algorithms, tooling, or broader validation beyond math.
- Representation-aware RL objectives (Industry, Academia; Foundation models)
- Idea: Modify RLVR to incorporate representation/gradient-space regularizersโe.g., encourage alignment with reliable gradients, penalize gradient outliers, or add process-level rewards for correct intermediate steps.
- Potential products: โRepAlign-PPO/GRPOโ libraries; process-supervision toolkits.
- Dependencies: New credit assignment strategies; scalable proxies for gradient similarity; validation beyond math and coding.
- Gradient-space curriculum and active learning (Industry, Academia; Training systems)
- Idea: Build curricula using gradient similarity rather than semantic difficultyโmove from high-alignment tasks to outliers; actively select mid-training data that increase similarity for hard examples.
- Potential products: Curriculum designers that target improvement of similarity scores; active mid-training selectors.
- Dependencies: Reliable, low-cost similarity estimates; theory linking similarity gains to generalization.
- General-purpose unlearnability benchmarks and standards (Academia, Policy; Evaluation)
- Idea: Standardize datasets and protocols that quantify unlearnability rates and gradient outlier behavior across domains (math, code, tool-use, planning).
- Potential products: โUnlearnability Scorecardsโ included in model cards; regulatory guidance for high-stakes domains.
- Dependencies: Community agreement on definitions/thresholds; secure access to training diagnostics or accepted proxies.
- Proxy metrics for representation flaws without gradients (Industry; MLOps, Deployment)
- Idea: Develop inference-time proxies (e.g., activation similarity, influence functions, feature probes) that correlate with gradient outlier status to enable monitoring when gradients are unavailable.
- Potential products: Black-box โrepresentation-health monitorsโ for hosted models.
- Dependencies: New research validating proxyโgradient correlations; access to activations or distilled telemetry.
- Process-based supervision at scale (Industry, Academia; Healthcare, Finance, Education)
- Idea: Replace outcome-only RLVR with verifiers for intermediate steps (when feasible), reducing reinforcement of shortcut heuristics.
- Potential products: Domain-specific step verifiers (e.g., math proof checkers, code trace analyzers, clinical reasoning validators).
- Dependencies: Formal or semi-formal intermediate-checking infrastructure; domain-labeled rationale datasets; safety/privacy constraints.
- Automated mid-training data design (Industry, Academia; Foundation models)
- Idea: Optimize mid-training mixtures to maximize downstream gradient similarity on target hard examples (e.g., via bilevel optimization).
- Potential products: Data-mixture optimizers that tune corpora to reshape representation spaces for reasoning.
- Dependencies: Expensive training loops; feedback signals linking mixture changes to similarity gains.
- Cross-domain unlearnability routing in agents and robotics (Industry; Agents, Robotics)
- Idea: Detect โunlearnableโ tasks for RLVR-trained agents and route to alternative planners, symbolic solvers, or specialized modules.
- Potential products: Agent orchestrators that adaptively switch policies based on representation-health signals.
- Dependencies: Verifiable sub-task structure; interfaces between neural and symbolic/planning components.
- Safety and compliance frameworks emphasizing process validity (Policy; Healthcare, Finance, Public sector)
- Idea: Require evidence of process-level soundness (not just outcomes) and disclosure of unlearnability rates for regulated deployments.
- Potential tools: Audit templates capturing gradient/activation diagnostics; certification schemes for process-aware reasoning models.
- Dependencies: Consensus on acceptable process metrics; legal frameworks and auditing capacity.
- Better data augmentation rooted in optimization space (Academia, Industry)
- Idea: Generate augmentations that are similar in gradient spaceโnot merely semanticโso they actually transfer skills to target items.
- Potential products: โOptimization-awareโ augmentation generators using influence functions or gradient-matched synthesis.
- Dependencies: Efficient estimators of gradient-space similarity; generative systems controllable in optimization space.
- Mixture-of-experts training routes for hard examples (Industry; Foundation models)
- Idea: Train or select specialized experts for gradient outliers and route those items during training and serving.
- Potential products: MoE routers keyed on representation-health; per-expert mid-training curricula.
- Dependencies: Stable routing signals; cost/latency budgets; evidence that specialization overcomes representation flaws.
Cross-cutting assumptions and dependencies
- Verifiable reward signals are currently essential (math/coding/agentic tasks). Extending to less-verifiable domains (e.g., open-ended dialogue, medical advice) requires process verifiers or alternative supervision.
- Access to training-time signals (gradients/activations) is needed for the most direct diagnostics; black-box deployments will require validated proxies.
- Findings are demonstrated on small/mid-scale models and math; broader validation is needed to claim universality across domains and scales.
- Chain-of-thought availability and safe handling policies affect feasibility of reasoning-quality scoring and process supervision.
Glossary
- Advantage: In policy gradient methods, the estimated relative value of an action compared to a baseline, often used to weight updates. "the advantage is calculated as the standardized reward"
- Cosine similarity: A measure of alignment between two vectors, here used to compare per-example gradient directions. "Then we obtain the cosine similarity between gradients of each pair of examples."
- Credit assignment: The problem of attributing observed rewards to specific actions or tokens during training. "Other works adjust credit assignment by altering the granularity of gradient clipping and optimization (Liu et al., 2025b; Zheng et al., 2025a) to stabilize RL training and improve final performance."
- Curriculum learning: A training strategy that orders or schedules data from easier to harder to improve efficiency or stability. "Meanwhile, curriculum learning, as a more systematic dynamic sampling method, is also shown to improve training efficiency as well (Shi et al., 2025; Gao et al., 2025)."
- Data augmentation: The practice of synthesizing additional training examples (e.g., similar problems or subproblems) to improve learning. "Data Augmentation. We then explore whether data with high gradient similarity can be synthesized."
- Dynamic sampling: A data scheduling technique that selectively includes examples (e.g., based on reward variance) to improve efficiency. "we use GRPO with dynamic sampling (Yu et al., 2025) as our baseline RL algorithm"
- Entropy: An exploration-related quantity measuring uncertainty in the modelโs action distribution, sometimes used to reweight loss. "existing works often use entropy as an indicator for model exploration and apply entropy-based loss weight adjustment to improve model performance (Cui et al., 2025; Cheng et al., 2025; Jin et al., 2025b)."
- Experience replay: Reusing previously sampled trajectories or outputs to balance batches or stabilize learning. "we apply oversampling with experience replay (Sun et al., 2025b; Zhang et al., 2025d;c) to ensure the ratio of positive samples to negative ones is always the same for each training example."
- Gradient clipping: A stabilization technique that limits gradient magnitude to prevent large, destabilizing updates. "Other works adjust credit assignment by altering the granularity of gradient clipping and optimization (Liu et al., 2025b; Zheng et al., 2025a)"
- Gradient interference: Conflicting gradient signals from different samples or objectives that can hinder learning progress. "More analysis results on gradient interference (Nguyen et al., 2025) can be found in Appendix A.3."
- Gradient outliers: Examples whose gradients differ markedly from the bulk of the training distribution, reducing transfer. "Easy examples have highly concentrated gradients while unlearnable examples are distinct gradient outliers."
- Gradient similarity: The degree to which gradients from different examples point in similar directions, indicating shared learnable structure. "unlearnable examples exhibit substantially lower gradient similarity to the rest of the training data than both easy and learnable examples (Figure 1c)."
- Group Relative Policy Optimization (GRPO): An RL algorithm variant for LLMs that leverages relative performance within grouped rollouts. "with Group Relative Policy Optimization (GRPO) (Shao et al., 2024) as a standard algorithm."
- KL loss term: A regularization that penalizes divergence from a reference policy, often the KullbackโLeibler divergence added to the loss. "Clipping mechanisms (Schulman et al., 2017) suppress gradients for low-probability tokens, while KL loss term (Schulman et al., 2017) penalizes deviation from a reference model."
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds low-rank adapters to a frozen model. "we attach a fixed, randomly initialized LoRA adapter and compute gradients with respect to LoRA parameters only."
- Mid-training: An intermediate training stage on curated data to reshape representations before reinforcement learning. "Mid-training has shown to be effective to improve base model to make it more suitable for RL stage (Wang et al., 2025)."
- Outcome reward variance: Variability in final rewards across rollouts for the same prompt, which GRPO relies on for learning signal. "the success of GRPO relies on the outcome reward variance (Xu et al., 2025) within grouped rollouts"
- Oversampling: Increasing the presence of certain samples (e.g., positives) in training batches to balance signals. "we apply oversampling with experience replay (Sun et al., 2025b; Zhang et al., 2025d;c) to ensure the ratio of positive samples to negative ones is always the same for each training example."
- Pass@k: The probability that at least one of k sampled outputs is correct, used as a performance metric. "Starting with the first ever work that shows pass@k degrades after RL (Yue et al., 2025)"
- Proximal Policy Optimization (PPO): A popular on-policy RL algorithm using clipped objectives to stabilize policy updates. "The policy model is optimized to maximize the PPO (Schul- man et al., 2017) loss:"
- Reference log-likelihood: The log-probability assigned by a fixed reference model to a sequence, used to analyze rollout probabilities. "Distribution of reference log-likelihood for different data examples' correct rollouts."
- Reference model: A fixed policy used to regularize the current model via KL penalties during RL fine-tuning. "penalizes deviation from a reference model."
- Reinforcement Learning with Verifiable Reward (RLVR): An RL setup for LLMs where rewards are based on automatically verifiable outcomes (e.g., correct answers). "Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving LLM's (LLM) reasoning ability."
- Reasoning traces: The step-by-step intermediate reasoning produced by the model, analyzed for coherence and quality. "Qualitative inspection of reasoning traces further indicates that although the final answers may be correct, the model frequently produces incoherent or even erroneous intermediate reasoning steps on unlearnable examples (Figure 1d)."
- Rollouts: Sampled trajectories or model outputs for a given prompt used to compute rewards and gradients. "even when correct rollouts are present."
- Similar problems: Augmented problems designed to share solution strategies with originals, used to test transfer. "generate 5 similar problems that can be solved with the same strategy."
- Subproblems: Decomposed tasks whose solutions help solve the original problem, used for augmentation and compositional training. "generate subproblems Dsub."
Collections
Sign up for free to add this paper to one or more collections.