Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Unlearnability Phenomenon in RLVR for Language Models

Published 16 May 2026 in cs.LG and cs.CL | (2605.16787v1)

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving LLM's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

Authors (3)

Summary

  • The paper reveals that a significant subset of training instances remain unlearnable (pass@1 < 0.1) despite positive reward signals in RLVR.
  • It systematically challenges hypotheses such as rollout scarcity and gradient interference, showing these factors do not explain the phenomenon.
  • The study highlights mid-training as crucial for restructuring representations, exposing the limitations of standard RL post-training techniques.

The Unlearnability Phenomenon in RLVR for LLMs

Motivation and Scope

Reinforcement Learning with Verifiable Reward (RLVR) has been extensively adopted to enhance LLM reasoning capabilities in domains such as mathematics, code synthesis, and agentic tasks. Despite empirical improvements, the paper "The Unlearnability Phenomenon in RLVR for LLMs" (2605.16787) presents a systematic analysis of the learning dynamics within RLVR and reveals the existence of a significant subset of "unlearnable" examples that cannot be mastered by LLMs, even in the presence of correct rollouts and positive reward signals. This work delineates unlearnability as a fundamental limitation of RL post-training for reasoning LLMs.

Definition and Prevalence of Unlearnability

The paper operationalizes "unlearnable" examples as hard training instances with pass@1โ€‰<โ€‰0.1 post-convergence, despite the availability of correct rollouts during policy optimization. Experiments using Qwen2.5-0.5B, Qwen2.5-3B, and Llama3.2-3B-Instruct on varied mathematical datasets indicate that, after excluding data with no positive rewards, roughly half of the hard examples remain persistently unlearnable. The prevalence of unlearnability is robust across models and data regimes, making it a ubiquitous phenomenon in RLVR pipelines.

Analysis of Common Hypotheses

Positive Rollout Scarcity

Contrary to intuitive expectations, interventions designed to increase the frequency of correct rollouts (experience replay, oversampling, larger rollout groups, or supervised fine-tuning on distilled correct responses) do not ameliorate unlearnability. The gap between learnable and unlearnable examples persists regardless of rollout density, refuting the hypothesis that unlearnability is a consequence of signal scarcity.

Gradient Regularization

The paper investigates whether unlearnability is attributable to harsh gradient clipping or KL divergence penalties suppressing useful gradients from correct rollouts. Empirical results show that clipping ratios and reference log-likelihoods of correct rollouts are not systematically lower for unlearnable examples compared to learnable or easy examples. Ablations removing clipping or KL penalties fail to impact learning dynamics on unlearnable examples, demonstrating that optimization constraints are not primary explanatory factors.

Gradient Interference

Analysis of gradient similarity between correct and incorrect rollouts both within-prompt and across the training set reveals negligible interference effects. Gradients of correct and incorrect rollouts in unlearnable examples remain aligned; cancellation is not observed. These results further solidify that gradient interference is not a plausible primary driver of unlearnability.

Representation Analysis

Gradient Similarity and Outlier Detection

Cross-example gradient analysis establishes that unlearnable examples are pronounced outliers in the gradient space. Their per-example gradients exhibit a significantly lower cosine similarity with gradients of other examples, indicating that the skill transfer and representational generalization in the optimization space are severely impaired for this subset. Easy examples demonstrate highly concentrated gradients, while learnable examples are moderately well aligned.

Reasoning Trace Quality

Even when unlearnable examples yield correct answers post-training, the reasoning traces exhibit suboptimal logical coherence, consistency, and quality scores as annotated by automated evaluators. The model often resorts to "shortcut solutions" or non-generalizable heuristics, underscoring that RLVR is incentivizing outcome correctness rather than genuine reasoning, especially for unlearnable samples. The quality gap increases during training, with learnable examples benefiting from improved intermediate reasoning while unlearnable examples stagnate.

Ineffectiveness of Data Augmentation and Curriculum Learning

Attempts at mitigating unlearnability via data augmentation (generating similar problems or decomposed subproblems) and curriculum-based data schedules fail to improve gradient similarity or reasoning quality on the unlearnable examples. The augmented samplesโ€”even when structurally or semantically analogousโ€”do not exhibit improved optimization space overlap. This underscores the non-triviality and limits of data synthesis for post-training LLMs.

Role of Mid-Training

Mid-training, as demonstrated with OctoThinker variants, substantially enhances gradient alignment for difficult examples, unlike RL alone. Representation space restructuring at the mid-training stage appears essential for smoothing the subsequent RL optimization landscape, promoting greater skill transferability and reducing gradient outlier incidence for hard examples. The results emphasize that mid-training is critical in the LLM reasoning pipeline, as RLVR cannot reliably correct flawed initial representations.

Implications and Future Directions

The implications for LLM training pipelines are profound: RL post-training efficacy is fundamentally constrained by the quality and alignment of preexisting representations, and positive reward signals alone are insufficient for learning certain classes of reasoning tasks. Practical pipelines should prioritize robust mid-training regimes and seek new methods for representation repair prior to RL fine-tuning. Future research may focus on identifying structural properties of unlearnable examples, designing algorithmic interventions that act on intermediate reasoning, and probing latent representational signals for improved transferability.

Conclusion

The paper rigorously demonstrates the existence and persistence of unlearnable examples in RLVR training for LLMs and attributes this phenomenon to flawed representations that cannot be remedied by standard RL post-training techniques, including data augmentation and curriculum learning. Cross-example gradient analysis and reasoning quality evaluations provide direct evidence for representational outlier status and ungeneralizable reasoning in the unlearnable subset. Mid-training significantly alleviates these issues, suggesting the criticality of representation alignment before RL. These findings constitute a fundamental limitation in contemporary RLVR approaches and direct future research towards exploring alternate paradigms for skill composition and representation repair in reasoning LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

What this paper is about (big picture)

The paper looks at how we teach LLMs to reason better using a method called โ€œReinforcement Learning with Verifiable Rewardโ€ (RLVR). Think of RLVR like training with a coach who checks whether each answer is correct and gives a reward for correct ones. The surprising discovery: even when the coach finds correct attempts to reward, some hard problems still donโ€™t get learned. The authors call these โ€œunlearnableโ€ examples and try to figure out why that happens.

What questions the researchers asked

They set out to answer, in simple terms:

  • Why do some hard problems stay unlearned even when the model sometimes gets them right and is rewarded for it?
  • Is the issue caused by not seeing enough correct tries, by training rules that dampen learning, or by something deeper inside the model?
  • Can we fix this by giving the model more practice problems, smaller subproblems, or extra preparation before RL?

How they studied it (methods, explained simply)

To explore this, they trained several LLMs on math problems and watched how learning progressed.

Key ideas, with analogies:

  • RLVR: Like a quiz game where the model gives multiple answers (โ€œrolloutsโ€) to each question. A checker automatically marks correct answers and rewards them.
  • Rollouts: Multiple tries for the same question in one training round.
  • PPO, clipping, and KL penalty: Training โ€œsafety rules.โ€ Clipping is like a speed limit on how fast the model can change; KL is a rubber band that pulls the model back if it changes too much from its earlier behavior.
  • Gradients: Tiny nudges that tell the model how to change to get better. โ€œGradient similarityโ€ measures whether two questions teach similar lessonsโ€”like asking, โ€œDoes the way we learn from problem A help with problem B?โ€
  • Representation: The modelโ€™s internal way of understanding and organizing ideasโ€”like its mental map of math strategies.

What they tested:

  • They ensured each hard question had at least one correct attempt in training (oversampling and replay) to see if more โ€œpositive examplesโ€ would help.
  • They relaxed the training โ€œsafety rulesโ€ (raised the speed limit and loosened the rubber band) to see if rules were blocking learning.
  • They checked the quality of the modelโ€™s step-by-step reasoning, not just the final answer, to see if correct answers sometimes came from shaky reasoning.
  • They measured gradient similarity to see whether learning from other problems transfers to a given problem.
  • They tried data augmentation: creating similar problems and breaking problems into subproblems, to see if that would help the model learn the original hard ones.
  • They compared models that had extra โ€œmid-trainingโ€ (extra general practice before RL) to see if better preparation improves learning later.

What they found and why it matters

Main findings:

  • A real โ€œunlearnabilityโ€ group exists: Among hard problems, a large chunkโ€”often close to half in their settingsโ€”did not improve, even though correct attempts were present and rewarded during training.
  • Not just a reward shortage: Giving every hard question at least one correct attempt per training step didnโ€™t fix it. Even training only on these hard questions, using many more attempts, or distilling correct solutions didnโ€™t solve it.
  • Not blocked by safety rules: Removing or relaxing the training rules (clipping and KL) didnโ€™t help the unlearnable problems.
  • Gradient outliers: Unlearnable problems had very low gradient similarity to the rest of the training set. In other words, the โ€œlessonsโ€ learned from other problems didnโ€™t transfer to them. Easy problems had highly similar gradients, which is what makes them learn smoothly.
  • โ€œFakeโ€ or fragile reasoning: For unlearnable problems, even the correct answers often came with low-quality or inconsistent step-by-step reasoning. That suggests the model sometimes uses shortcuts or brittle tricks rather than solid, generalizable reasoning.
  • Data augmentation didnโ€™t transfer back: The model could learn the new similar or subproblems themselves, but this did not translate into learning the original unlearnable problems. Semantically similar problems were not always similar in the modelโ€™s โ€œoptimization space,โ€ so practicing on them didnโ€™t fix the core issue.
  • Mid-training helps representations: Models that got extra general practice before RL showed higher gradient similarity on hard problems, meaning their internal representations were better aligned to learn from RL later.

Why this matters:

  • It shows a fundamental limit: Just giving positive rewards (correct answers) isnโ€™t enough if the modelโ€™s internal representation of a problem is off. RL alone often canโ€™t โ€œrepairโ€ these flaws.
  • It highlights the importance of preparation: Extra mid-training can reshape the modelโ€™s mental map so RL works better afterward.

What this could mean going forward

  • For training pipelines: Donโ€™t rely on RL alone to build reasoning. Invest in mid-training (stronger base skills, broader practice) to align the modelโ€™s internal representations before RL.
  • For reward design: Checking only final answers can let the model โ€œhackโ€ the reward with shortcut reasoning. Incorporating signals about the quality of intermediate steps may produce more reliable learning.
  • For data strategy: Generating โ€œsimilarโ€ problems doesnโ€™t guarantee transfer. We need ways to create training data thatโ€™s similar not only in meaning but also in how it shapes the modelโ€™s learning (its gradients).
  • For research: Understanding which examples produce transferable gradientsโ€”and how to raise gradient similarity for tough casesโ€”could lead to more robust reasoning models.

In short: Some tough problems stay unlearned in todayโ€™s RL setups, not because of missing rewards or strict rules, but because the modelโ€™s internal understanding isnโ€™t aligned. Strengthening those representations before RL (through mid-training) seems key to making reasoning training truly stick.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains uncertain or unexplored in the paper, phrased to guide concrete followโ€‘up research:

  • Domain generality: Does the unlearnability phenomenon persist beyond mathematical reasoning (e.g., coding, multi-hop QA, planning/agent tasks, scientific reasoning)?
  • Scale dependence: How does unlearnability change with larger-capacity models (e.g., โ‰ฅ7Bโ€“70B), different architectures (MoE, long-context variants), or instructionโ€‘ vs baseโ€‘initialized policies?
  • Sensitivity to the operational definition: How robust are findings to different pass@k thresholds, sample sizes (N), and convergence criteria, especially for near-boundary examples?
  • Alternative RL objectives: Do algorithms beyond GRPO/PPO (e.g., off-policy actorโ€“critic, value-based, model-based RL, implicit KL, trust-region variants) reduce unlearnability?
  • Reward design beyond binary outcomes: Can process-based or step-verified rewards (PRMs, partial credit, intermediate constraints) convert unlearnable examples into learnable ones?
  • Exploration interventions: Do targeted exploration schemes (entropy scheduling, guided search, self-ask, beam search mixing, diversity-promoting rollouts) alter learnability trajectories?
  • Correct rollout diversity: Does training with multiple distinct correct solution paths (rather than repeated near-duplicates) improve representation alignment and learnability?
  • Predictive diagnostics: Can we predict unlearnable examples a priori using cheaply computable proxies (e.g., prompt features, token entropy, early-loss curvature, low-cost gradient sketches)?
  • Online detection and routing: Can real-time indicators (e.g., streaming gradient similarity, advantage statistics) trigger specialized handling (alternative objectives, tools, or curricula) for prospective unlearnable examples?
  • Taxonomy and causal factors: Which problem attributes (e.g., algebraic structure, symbolic depth, length, compositionality, error modes) correlate causally with gradient outlier status?
  • Representation-level mechanism: What specific internal features or layers fail on unlearnable examples? Can layer-wise or subspace analyses identify where representation misalignment arises?
  • Validity of gradient-similarity proxy: How do conclusions change when using full-parameter gradients vs LoRA proxies, different layers, token-level granularity, or Jacobian/Hessian-based similarity?
  • Trainingโ€‘time evolution: How do gradient similarities and reasoning quality evolve across many more training steps or restarts? Are there late-phase transitions that rescue some unlearnable cases?
  • Gradient interference mitigation: Do methods like orthogonal gradient surgery (e.g., PCGrad), conflict-aware batching, or per-group optimizers reduce cross-example interference for outliers?
  • Reference policy and KL schedules: Would alternative reference models, adaptive KL targets, or trust-region schedules change learnability, especially for examples with initially low policy mass?
  • Sampling/grouping strategies: How do different k, grouping strategies, prioritized sampling by predicted similarity, or active selection impact the emergence of unlearnability?
  • Data augmentation that targets optimization space: Can we design augmentation that is gradient-aligned (e.g., process-preserving transformations, formal symmetries, programmatic perturbations) rather than only semantically similar?
  • Process supervision and antiโ€“reward hacking: Does enforcing correctness of intermediate steps (or penalizing incoherent chains) repair โ€œfake reasoningโ€ and increase gradient alignment for unlearnable examples?
  • Tool-use and hybrid pipelines: Can tool-augmented rollouts (symbolic solvers, calculators, verifiers) or self-consistency checks help overcome representation gaps on unlearnable problems?
  • Mid-training design space: Which mid-training datasets, mixtures, and objectives (e.g., math-specific corpora, program-of-thought, contrastive consistency, synthetic curricula) most improve gradient similarity on hard examples?
  • Interaction with SFT: Beyond brief SFT checks, how do different SFT regimes (process SFT, weak-to-strong distillation, rationale distillation) interplay with RL to address unlearnability?
  • Dynamic sampling filter effects: The pipeline drops zero-variance prompts; does this filtering bias optimization, masking examples that could become learnable later with different schedules?
  • Robustness of reasoning-quality scoring: How sensitive are conclusions to the use of GPT-5-mini as the judge? Do human evaluations or alternative rubrics confirm the reasoning-quality gaps?
  • Generalization to non-verifiable or learned rewards: Does unlearnability appear (or worsen) when rewards are model-based (RMs/AIF) or noisy, where โ€œcorrectnessโ€ is not strictly verifiable?
  • Architectural interventions: Would architectures with explicit planning/scratchpads, memory modules, or MoE routing reduce the incidence of gradient outliers?
  • Theoretical characterization: What conditions make an example a persistent gradient outlier under outcome-based RL? Can we formalize links between representation geometry, reward sparsity, and learnability?
  • Operational scalability: How can large-scale training monitor and act on gradient similarity or outlier status efficiently without prohibitive compute?
  • Reproducibility and variance: How variable is the unlearnability set across seeds, data orders, and trainers? Is the intersection-over-runs approach conservative enough for pipeline decisions?

Practical Applications

Immediate Applications

The paperโ€™s findings enable several concrete changes to current LLM training and deployment practices that can be implemented now, especially for tasks with verifiable rewards (e.g., math, coding, tool-using agents).

  • RLVR pipeline triage via gradient-similarity diagnostics (Industry, Academia; Software, Agents)
    • What to do: Add a lightweight LoRA-based per-example gradient similarity monitor in the RLVR loop to flag โ€œgradient outlierโ€ prompts that are unlikely to learn despite correct rollouts.
    • Tools/workflows:
    • โ€œUnlearnability detectorโ€ service that computes cosine similarity of example-level gradients (using a fixed LoRA adapter) and tags outliers.
    • Dashboards tracking learnable vs. unlearnable cohorts over training steps.
    • Impact: Reallocates compute away from unlearnable examples, shortens RL runs, and clarifies what data require upstream fixes.
    • Assumptions/dependencies: Training-time access to gradients; using verifiable tasks; LoRA approximation suffices for similarity ranking.
  • Mid-training-first pipeline adjustments (Industry, Academia; Foundation models)
    • What to do: Prioritize mid-training to reshape representations before RL (as evidenced by improved gradient similarity in mid-trained OctoThinker models).
    • Tools/workflows:
    • โ€œMid-training gateโ€ in the pipeline that requires a minimum gradient-similarity band on difficult examples before allowing RLVR.
    • Mid-training data curation workflows aimed at reasoning-heavy corpora.
    • Impact: Higher RLVR yield on hard tasks with the same compute.
    • Assumptions/dependencies: Access to large-scale mid-training data/compute; transferability from math to target domain.
  • Rollout quality control beyond outcome-only rewards (Industry, Academia; Education, Software)
    • What to do: Score intermediate reasoning quality (e.g., with an LLM grader) and down-weight or filter โ€œfake reasoningโ€ even when the final answer is correct.
    • Tools/workflows:
    • Reasoning-quality scorer integrated into sampling and credit assignment (process-aware filtering).
    • In coding, complement unit tests with style/complexity checks to reduce exploitative shortcuts.
    • Impact: Reduces noisy signals that reinforce ungeneralizable shortcuts; better generalization.
    • Assumptions/dependencies: Availability of chain-of-thought or structured traces; access to a reliable judge model; careful handling to avoid unsafe CoT exposure.
  • Compute-aware data scheduling and early stopping (Industry; MLOps)
    • What to do: Monitor subgroup reward trajectories and clip/route examples that plateau (unlearnable) to alternative training stages instead of spending more RL budget.
    • Tools/workflows:
    • RL scheduler that dynamically de-prioritizes stagnant examples.
    • โ€œStop-lossโ€ criteria for example-level pass@k improvements.
    • Impact: Prevents wasteful sampling and training on โ€œstuckโ€ items.
    • Assumptions/dependencies: Robust convergence/plateau criteria; accurate pass@k estimation.
  • Data curation guidance for RLVR on verifiable tasks (Industry, Academia; Software, Education)
    • What to do: Avoid assuming that โ€œmore correct rolloutsโ€ or semantically similar problems will help; instead, identify and route unlearnable items to representation-shaping stages (mid-training/SFT with rationales).
    • Tools/workflows:
    • Triage labels (easy/learnable/unlearnable) maintained alongside datasets.
    • Audit reports comparing semantic vs. gradient-space similarity.
    • Impact: Better dataset ROI; fewer ineffective augmentations.
    • Assumptions/dependencies: Ability to run diagnostic sampling; availability of alternative stages for those items.
  • Enhanced evaluation and procurement protocols (Industry, Academia, Policy; Benchmarks, Safety)
    • What to do: Report gradient-similarity distributions and reasoning-quality metrics alongside pass@k; add cohort-level learning curves to evaluation docs.
    • Tools/workflows:
    • Benchmark add-ons that quantify fraction of gradient outliers and their behavior over training.
    • Procurement checklists requiring process-level metrics, not just outcomes.
    • Impact: More informative model comparisons and safer deployment decisions.
    • Assumptions/dependencies: Access to training-time diagnostics or validated post-hoc proxies.
  • Deployment-time fallback strategies for โ€œhardโ€ queries (Industry; Software, Education)
    • What to do: For prompts known (from training logs) to be unlearnable, route to slower but robust inference strategies (e.g., self-consistency, tool use, retrieval, or human-in-the-loop).
    • Tools/workflows:
    • Prompt router referencing a registry of historically unlearnable patterns.
    • Impact: Improves reliability for end-users without retraining.
    • Assumptions/dependencies: Mapping from training-time unlearnable cohorts to similar production prompts; acceptable latency/cost for fallbacks.
  • Responsible-use notices in learning products (Daily life, Education; EdTech)
    • What to do: Display โ€œshow your workโ€ validators and expose process-quality warnings when rationales appear incoherent despite correct answers.
    • Tools/workflows:
    • In-product reasoning-quality badges and optional trace viewers for students/teachers.
    • Impact: Reduces overreliance on brittle reasoning; improves learning outcomes.
    • Assumptions/dependencies: Access to reasoning traces; UX for communicating uncertainty/process flaws.

Long-Term Applications

The paper suggests several research and development directions that require new algorithms, tooling, or broader validation beyond math.

  • Representation-aware RL objectives (Industry, Academia; Foundation models)
    • Idea: Modify RLVR to incorporate representation/gradient-space regularizersโ€”e.g., encourage alignment with reliable gradients, penalize gradient outliers, or add process-level rewards for correct intermediate steps.
    • Potential products: โ€œRepAlign-PPO/GRPOโ€ libraries; process-supervision toolkits.
    • Dependencies: New credit assignment strategies; scalable proxies for gradient similarity; validation beyond math and coding.
  • Gradient-space curriculum and active learning (Industry, Academia; Training systems)
    • Idea: Build curricula using gradient similarity rather than semantic difficultyโ€”move from high-alignment tasks to outliers; actively select mid-training data that increase similarity for hard examples.
    • Potential products: Curriculum designers that target improvement of similarity scores; active mid-training selectors.
    • Dependencies: Reliable, low-cost similarity estimates; theory linking similarity gains to generalization.
  • General-purpose unlearnability benchmarks and standards (Academia, Policy; Evaluation)
    • Idea: Standardize datasets and protocols that quantify unlearnability rates and gradient outlier behavior across domains (math, code, tool-use, planning).
    • Potential products: โ€œUnlearnability Scorecardsโ€ included in model cards; regulatory guidance for high-stakes domains.
    • Dependencies: Community agreement on definitions/thresholds; secure access to training diagnostics or accepted proxies.
  • Proxy metrics for representation flaws without gradients (Industry; MLOps, Deployment)
    • Idea: Develop inference-time proxies (e.g., activation similarity, influence functions, feature probes) that correlate with gradient outlier status to enable monitoring when gradients are unavailable.
    • Potential products: Black-box โ€œrepresentation-health monitorsโ€ for hosted models.
    • Dependencies: New research validating proxyโ€“gradient correlations; access to activations or distilled telemetry.
  • Process-based supervision at scale (Industry, Academia; Healthcare, Finance, Education)
    • Idea: Replace outcome-only RLVR with verifiers for intermediate steps (when feasible), reducing reinforcement of shortcut heuristics.
    • Potential products: Domain-specific step verifiers (e.g., math proof checkers, code trace analyzers, clinical reasoning validators).
    • Dependencies: Formal or semi-formal intermediate-checking infrastructure; domain-labeled rationale datasets; safety/privacy constraints.
  • Automated mid-training data design (Industry, Academia; Foundation models)
    • Idea: Optimize mid-training mixtures to maximize downstream gradient similarity on target hard examples (e.g., via bilevel optimization).
    • Potential products: Data-mixture optimizers that tune corpora to reshape representation spaces for reasoning.
    • Dependencies: Expensive training loops; feedback signals linking mixture changes to similarity gains.
  • Cross-domain unlearnability routing in agents and robotics (Industry; Agents, Robotics)
    • Idea: Detect โ€œunlearnableโ€ tasks for RLVR-trained agents and route to alternative planners, symbolic solvers, or specialized modules.
    • Potential products: Agent orchestrators that adaptively switch policies based on representation-health signals.
    • Dependencies: Verifiable sub-task structure; interfaces between neural and symbolic/planning components.
  • Safety and compliance frameworks emphasizing process validity (Policy; Healthcare, Finance, Public sector)
    • Idea: Require evidence of process-level soundness (not just outcomes) and disclosure of unlearnability rates for regulated deployments.
    • Potential tools: Audit templates capturing gradient/activation diagnostics; certification schemes for process-aware reasoning models.
    • Dependencies: Consensus on acceptable process metrics; legal frameworks and auditing capacity.
  • Better data augmentation rooted in optimization space (Academia, Industry)
    • Idea: Generate augmentations that are similar in gradient spaceโ€”not merely semanticโ€”so they actually transfer skills to target items.
    • Potential products: โ€œOptimization-awareโ€ augmentation generators using influence functions or gradient-matched synthesis.
    • Dependencies: Efficient estimators of gradient-space similarity; generative systems controllable in optimization space.
  • Mixture-of-experts training routes for hard examples (Industry; Foundation models)
    • Idea: Train or select specialized experts for gradient outliers and route those items during training and serving.
    • Potential products: MoE routers keyed on representation-health; per-expert mid-training curricula.
    • Dependencies: Stable routing signals; cost/latency budgets; evidence that specialization overcomes representation flaws.

Cross-cutting assumptions and dependencies

  • Verifiable reward signals are currently essential (math/coding/agentic tasks). Extending to less-verifiable domains (e.g., open-ended dialogue, medical advice) requires process verifiers or alternative supervision.
  • Access to training-time signals (gradients/activations) is needed for the most direct diagnostics; black-box deployments will require validated proxies.
  • Findings are demonstrated on small/mid-scale models and math; broader validation is needed to claim universality across domains and scales.
  • Chain-of-thought availability and safe handling policies affect feasibility of reasoning-quality scoring and process supervision.

Glossary

  • Advantage: In policy gradient methods, the estimated relative value of an action compared to a baseline, often used to weight updates. "the advantage is calculated as the standardized reward"
  • Cosine similarity: A measure of alignment between two vectors, here used to compare per-example gradient directions. "Then we obtain the cosine similarity between gradients of each pair of examples."
  • Credit assignment: The problem of attributing observed rewards to specific actions or tokens during training. "Other works adjust credit assignment by altering the granularity of gradient clipping and optimization (Liu et al., 2025b; Zheng et al., 2025a) to stabilize RL training and improve final performance."
  • Curriculum learning: A training strategy that orders or schedules data from easier to harder to improve efficiency or stability. "Meanwhile, curriculum learning, as a more systematic dynamic sampling method, is also shown to improve training efficiency as well (Shi et al., 2025; Gao et al., 2025)."
  • Data augmentation: The practice of synthesizing additional training examples (e.g., similar problems or subproblems) to improve learning. "Data Augmentation. We then explore whether data with high gradient similarity can be synthesized."
  • Dynamic sampling: A data scheduling technique that selectively includes examples (e.g., based on reward variance) to improve efficiency. "we use GRPO with dynamic sampling (Yu et al., 2025) as our baseline RL algorithm"
  • Entropy: An exploration-related quantity measuring uncertainty in the modelโ€™s action distribution, sometimes used to reweight loss. "existing works often use entropy as an indicator for model exploration and apply entropy-based loss weight adjustment to improve model performance (Cui et al., 2025; Cheng et al., 2025; Jin et al., 2025b)."
  • Experience replay: Reusing previously sampled trajectories or outputs to balance batches or stabilize learning. "we apply oversampling with experience replay (Sun et al., 2025b; Zhang et al., 2025d;c) to ensure the ratio of positive samples to negative ones is always the same for each training example."
  • Gradient clipping: A stabilization technique that limits gradient magnitude to prevent large, destabilizing updates. "Other works adjust credit assignment by altering the granularity of gradient clipping and optimization (Liu et al., 2025b; Zheng et al., 2025a)"
  • Gradient interference: Conflicting gradient signals from different samples or objectives that can hinder learning progress. "More analysis results on gradient interference (Nguyen et al., 2025) can be found in Appendix A.3."
  • Gradient outliers: Examples whose gradients differ markedly from the bulk of the training distribution, reducing transfer. "Easy examples have highly concentrated gradients while unlearnable examples are distinct gradient outliers."
  • Gradient similarity: The degree to which gradients from different examples point in similar directions, indicating shared learnable structure. "unlearnable examples exhibit substantially lower gradient similarity to the rest of the training data than both easy and learnable examples (Figure 1c)."
  • Group Relative Policy Optimization (GRPO): An RL algorithm variant for LLMs that leverages relative performance within grouped rollouts. "with Group Relative Policy Optimization (GRPO) (Shao et al., 2024) as a standard algorithm."
  • KL loss term: A regularization that penalizes divergence from a reference policy, often the Kullbackโ€“Leibler divergence added to the loss. "Clipping mechanisms (Schulman et al., 2017) suppress gradients for low-probability tokens, while KL loss term (Schulman et al., 2017) penalizes deviation from a reference model."
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds low-rank adapters to a frozen model. "we attach a fixed, randomly initialized LoRA adapter and compute gradients with respect to LoRA parameters only."
  • Mid-training: An intermediate training stage on curated data to reshape representations before reinforcement learning. "Mid-training has shown to be effective to improve base model to make it more suitable for RL stage (Wang et al., 2025)."
  • Outcome reward variance: Variability in final rewards across rollouts for the same prompt, which GRPO relies on for learning signal. "the success of GRPO relies on the outcome reward variance (Xu et al., 2025) within grouped rollouts"
  • Oversampling: Increasing the presence of certain samples (e.g., positives) in training batches to balance signals. "we apply oversampling with experience replay (Sun et al., 2025b; Zhang et al., 2025d;c) to ensure the ratio of positive samples to negative ones is always the same for each training example."
  • Pass@k: The probability that at least one of k sampled outputs is correct, used as a performance metric. "Starting with the first ever work that shows pass@k degrades after RL (Yue et al., 2025)"
  • Proximal Policy Optimization (PPO): A popular on-policy RL algorithm using clipped objectives to stabilize policy updates. "The policy model is optimized to maximize the PPO (Schul- man et al., 2017) loss:"
  • Reference log-likelihood: The log-probability assigned by a fixed reference model to a sequence, used to analyze rollout probabilities. "Distribution of reference log-likelihood for different data examples' correct rollouts."
  • Reference model: A fixed policy used to regularize the current model via KL penalties during RL fine-tuning. "penalizes deviation from a reference model."
  • Reinforcement Learning with Verifiable Reward (RLVR): An RL setup for LLMs where rewards are based on automatically verifiable outcomes (e.g., correct answers). "Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving LLM's (LLM) reasoning ability."
  • Reasoning traces: The step-by-step intermediate reasoning produced by the model, analyzed for coherence and quality. "Qualitative inspection of reasoning traces further indicates that although the final answers may be correct, the model frequently produces incoherent or even erroneous intermediate reasoning steps on unlearnable examples (Figure 1d)."
  • Rollouts: Sampled trajectories or model outputs for a given prompt used to compute rewards and gradients. "even when correct rollouts are present."
  • Similar problems: Augmented problems designed to share solution strategies with originals, used to test transfer. "generate 5 similar problems that can be solved with the same strategy."
  • Subproblems: Decomposed tasks whose solutions help solve the original problem, used for augmentation and compositional training. "generate subproblems Dsub."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 368 likes about this paper.