Unlearnability in RLVR: Limits and Impacts
- The paper identifies that unlearnability in RLVR results from exponential decay in atomic skills and persistent capability erosion.
- It outlines formal mechanisms—such as the gradient gap and support preservation—that create barriers to optimal reward-based learning.
- Empirical findings show that despite verifiable rewards, models struggle to acquire or retain complex, compositional reasoning abilities.
Reinforcement Learning with Verifiable Rewards (RLVR) has become a central paradigm for post-training LLMs on complex reasoning and compositional tasks. Despite empirical advances, a wide range of theoretical, algorithmic, and empirical studies have consistently identified “unlearnability” phenomena—the persistent failure to acquire or retain certain capabilities even in the presence of correct reward signals, algorithmic refinements, or abundant data. This article synthesizes the technical origins, manifestations, and consequences of unlearnability in RLVR, referencing key contributions, formal definitions, and experimental findings from the most current arXiv literature.
1. Foundational Definitions and Formalism
Unlearnability in RLVR refers to the systematic inability of an RLVR process to acquire, retain, or recover specific solution patterns or reasoning capabilities, regardless of positive reward signals or extensive training. Formally, RLVR typically operates in either an autoregressive LLM regime or as a learner from positive/verifiable data under an equivalence relation.
Consider a model parameterized by , acting via policy , which generates solutions for prompt and receives reward . For compositional tasks, “instance-level solvability” is the probability of generating a correct solution for . When decomposes into atomic steps , and 0 denotes the stepwise success, the composite success is approximated as 1.
Unlearnability arises when either:
- 2 cannot be elevated above a minimal threshold for some 3 even with RLVR.
- The support of 4 fails to include correct solutions not already present in the base model.
- Certain atomic primitives or sub-tasks suffer performance regressions during global reward optimization (“capability erosion”).
- Empirical RLVR training plateaus or collapses, preventing asymptotic improvement in accuracy (as characterized through the “Gradient Gap” and step size thresholds).
In positive-equivalence RLVR (learning in the limit from positive data), unlearnability is formalized through the failure of explanatory, behavioral, vacillatory, or confident learning on infinite families or ascending union sets (Belanger et al., 2020).
2. Theoretical Mechanisms and Sharp Barriers
Unlearnability in RLVR arises from multiple, precisely characterized mechanisms:
- Multiplicative Decay and Atomic Sharpening: For long-chain reasoning, each atomic error probability 5 leads to exponential decay in 6, rendering multi-step tasks unsolvable unless atomic skills are “sharpened” collectively (Wang et al., 9 Feb 2026).
- Task-Advantage Ratio 7 and Structural Advantage: For a composite reasoning step 8 and candidate move 9, learning proceeds only if the task-advantage ratio 0. If this signal is absent or exponentially small, RLVR cannot enhance the correct path and may reinforce suboptimal solutions (Barzilai et al., 8 Feb 2026).
- Gradient Gap and Step-Size Thresholds: The optimization dynamics depend critically on the “Gradient Gap” 1. If the learning rate 2 is too large relative to 3 and response length 4, training stalls (plateau) or collapses (accuracy 5) rather than converging to optimality (Suk et al., 9 Oct 2025).
- Support Preservation (Invisible Leash): Standard RLVR with on-policy updates cannot place any probability mass on solutions with zero base-model support. As a result, the set of discoverable solutions is restricted—the empirical support shrinkage usually outweighs expansion (Wu et al., 20 Jul 2025).
- Verifier Noise and Phase Transition: The net discriminative power of the verifier, captured by Youden’s index 6, induces a phase transition: if 7 learning succeeds, if 8 learning stalls (“neutral drift”), if 9 anti-learning occurs (collapse to incorrect solutions) (Rad et al., 7 Jan 2026).
- Adversarial and Ascent-Union Family Barriers: Algorithmic learning in positive data settings inherits impossibility results from Gold’s theorem—no behavioral correct learner exists for ascending union families, and other sharp class-separation results persist for any positive equivalence (Belanger et al., 2020).
3. Empirical Manifestations and Case Studies
Unlearnability is not merely theoretical; it manifests in diverse practical RLVR deployments:
- Capability Erosion: Empirical studies using the Algebrarium framework reveal that, when optimizing global expected reward, RLVR can “sacrifice” certain atomic skills (high 0 falling to low 1) to maximize aggregate performance, especially for under-sampled or minority patterns. This is strictly quantifiable via negative 2 and high Pearson correlation between composite and atomic-step probabilities (3) (Wang et al., 9 Feb 2026).
- Persistent Unsolvability: For “hard” examples with low initial pass@1, a substantial subset remains unlearnable, i.e., pass@1 remains 4 for all 5 validation rollouts, even after observing correct training trajectories and despite positive reward signals. Gradient similarity analysis reveals that unlearnable cases are representation outliers—no amount of increased rollouts, relaxed PPO constraints, or standard data augmentation remedies this deficit (Chen et al., 16 May 2026).
- Limits in Combinatorial Reasoning: On tasks such as Activity Scheduling and Longest Increasing Subsequence, RLVR amplifies superficial heuristics already present in the base model (e.g., format mimicry, greedy patterns) rather than acquiring new algorithmic reasoning—genuine combinatorial reasoning remains unlearnable under conventional RLVR schemes (Alam et al., 30 Oct 2025).
- Generalization Collapse Under Noise: Even moderate label noise (10–50%) in the verifier or output labels reduces reasoning accuracy and contracts the set of solvable problems. Algorithmic variants (Dr.GRPO, TIS, DAPO, SAPO, PGFC) do not materially mitigate the destructiveness of noise (Zhu et al., 17 Mar 2026).
4. Diagnosis via Representation and Optimization Analysis
Cross-example gradient similarity and concept-network analyses have provided systematic tools for diagnosing unlearnability:
- Gradient Similarity Metrics 6: For each training example 7, the mean cosine similarity of its correct-rollout gradient to other examples, 8, serves as an indicator of representation alignment. Unlearnable examples exhibit distinctly low 9, marking them as isolated from the broader training geometry. Attempts to augment data or subtask decomposition do not elevate 0 or performance for these cases (Chen et al., 16 May 2026).
- Complex Network Frustration: RLVR-trained LLMs induce a sparse semantic complex network. The “frustration index” 1, measuring the network’s fragmentation into disconnected skill islands, peaks at the regime of maximal unlearnability. The emergence and resolution of this plateau can be tracked via the number of connected components and cluster entropy (Hu et al., 28 Sep 2025).
- Empirical Support Transitions:
| Metric | Shrinkage | Expansion | Preservation | |--------------------|--------------|--------------|--------------| | OlympiadBench (k=8192) | 26 | 3 | 600 | | MATH500 + Minerva | Dozens lost | Few gained | Majority |
Shrinkage in support (correct completions with reduced mass) exceeds expansion by wide margins (Wu et al., 20 Jul 2025).
5. Broader Implications and Mitigation Strategies
Unlearnability reveals fundamental limits of RLVR and prescribes requirements for more robust capability acquisition.
- Necessity of High-Quality, Verified Data: Empirically, RLVR cannot compensate for noisy or incorrect labels. High label quality is indispensable; algorithmic tweaks or bias corrections are insufficient (Zhu et al., 17 Mar 2026).
- Instance-wise Constraints and Diversity-Aware Sampling: Mitigation may require enforcing minimal per-instance solvability thresholds (e.g., via Lagrangian penalties), or batch stratification to maintain representation of rare atomic primitives (Wang et al., 9 Feb 2026).
- Reward Redesign and Intermediate Verification: Enhancing the inductive structure of the reward (e.g., process-level or partial reward on intermediate steps) and incorporating auxiliary objectives that validate reasoning traces can create the structural advantage needed for new solution discovery (Barzilai et al., 8 Feb 2026, Alam et al., 30 Oct 2025).
- Exploration and Off-Policy Mass Injection: Overcoming the “invisible leash” necessitates explicit exploration mechanisms—exploration mixtures, diversity-promoting regularization, or off-policy data augmentation—that inject probability mass into underrepresented or previously zero-probability solution regions (Wu et al., 20 Jul 2025).
- Representation-Aligned Mid-Training: Addressing deep representational flaws may require curriculum mid-training or pre-training on reasoning-heavy corpora before RLVR, as post-hoc RLVR updates do not repair misaligned features (Chen et al., 16 May 2026).
- Annealed Training Procedures: Temporarily increasing exploration via supervised fine-tuning (e.g., “Annealed-RLVR”) at frustration peaks can break skill-competition bottlenecks and reduce catastrophic forgetting (Hu et al., 28 Sep 2025).
6. Connections to Classical and Modern Learning Theory
Unlearnability phenomena in RLVR are consistent with classical impossibility results in algorithmic learning theory. For any positive (r.e.) equivalence relation on data, there exist infinite ascending chain families, as well as non-behaviorally correctly learnable, non-vacillatorily learnable, and non-confidently learnable r.e. families (Belanger et al., 2020). These limitations are inherited in modern RLVR by both deterministic and stochastic outcome-level feedback, linking theoretical barriers in computability to practical inductive failures in LLMs.
7. Outlook and Future Directions
The persistent and multifaceted nature of unlearnability in RLVR underscores the need for a rigorous, geometry-aware understanding of representation, reward structure, verifier quality, and optimization stability. Breakthroughs will require theoretical innovations in credit assignment, process-level verification, and algorithmic exploration, as well as scalable systems for precise and diverse data curation. Further progress in RLVR as a route to robust reasoning capability will depend on systematically addressing the multiple sharp failure modes catalogued in current research (Zhu et al., 17 Mar 2026, Chen et al., 16 May 2026, Barzilai et al., 8 Feb 2026, Wu et al., 20 Jul 2025, Hu et al., 28 Sep 2025, Suk et al., 9 Oct 2025, Alam et al., 30 Oct 2025, Belanger et al., 2020, Wang et al., 9 Feb 2026).