Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

Published 21 May 2026 in cs.LG | (2605.22703v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

Summary

  • The paper identifies the clipping mask as the primary instability source in RLVR and shows that hard clipping discards valuable near-boundary gradients.
  • It introduces Near-boundary Stochastic Rescue (NSR), which stochastically reseeds marginally out-of-bound token updates to recover informative gradients.
  • Extensive experiments confirm NSR improves training stability, sample efficiency, and reasoning performance across various LLM architectures.

Clipping Bottleneck and NSR: Advancing Stability in RLVR

Motivation and Diagnosis

Reinforcement Learning with Verifiable Rewards (RLVR) constitutes a foundational approach for scaling LLM reasoning, leveraging deterministic verifiers for robust supervision in domains such as mathematics and coding. Widely-used policy optimization objectives in RLVR—specifically, GRPO-style clipping (e.g., DAPO, GSPO)— impose token- and sequence-level trust region constraints via hard clipping, which zeroes out gradient flow for policy ratios outside an admissible interval. These objectives are theoretically justified but empirically, they exhibit high sensitivity to hyperparameters and significant instability, manifesting as entropy collapse, reproducibility issues, and oscillatory optimization trajectories.

A systematic analysis presented in this work reveals that the primary bottleneck is the binary clipping decision. In particular, informative learning signals may reside just outside the clipping thresholds, but hard clipping discards them wholesale, failing to discriminate marginal from severe out-of-bound violations. Empirical interventions demonstrate optimization is robust to gradient magnitude, but hypersensitive to the clipping mask; coupled perturbations that affect the mask lead to collapse, but decoupled noise (which preserves the clipping decision) stabilizes training and improves sample efficiency.

Mechanism: Near-boundary Stochastic Rescue (NSR)

The authors propose Near-boundary Stochastic Rescue (NSR), a plug-and-play mechanism that targets the clipping boundary. NSR operates by stochastically re-admitting out-of-bound token updates based on their proximity to the trust region. Specifically, the clipping mask is preserved according to the original importance ratio, but for tokens slightly out-of-bound, a stochastic executor samples a multiplicative perturbation and rescues gradients if the perturbed token falls within the trust region. NSR thus retains informative near-boundary learning signals, while conservative policy constraints remain intact for deep violations.

Mechanistically, NSR transforms the rigid binary gate into a probabilistic filter, where the admission probability decays with the degree of boundary violation. Ablation studies conclusively show that gains stem from recovering gradients in the rescue zone, not from exaggerating in-bound updates (push-out zone), highlighting that standard hard clipping is empirically over-aggressive.

Theoretical Analysis

NSR admits a rigorous mechanistic interpretation and theoretical characterization. In expectation, NSR induces implicit soft clipping: the stochastic process modulates the effective gradient for out-of-bound tokens by an inverse-square decay O(1/r2)O(1/r^2), yielding smooth attenuation rather than binary censorship. The mathematical derivations confirm this attenuation profile symmetrically applies to both upper and lower trust region bounds. Layered ablation distinguishes stochastic filtering from deterministic gradient decay: while deterministic attenuation (e.g., explicit decay w(r)=(u/r)kw(r) = (u/r)^k) provides stability improvements, NSR outperforms in aggregate performance and robustness, due to its boundary-local probabilistic rejection of unreliable directions.

Empirical Results

Extensive experiments spanning hundreds of thousands of GPU hours and diverse LLM architectures (dense and MoE, 7B–30B scale) validate NSR's efficacy. NSR consistently improves both peak and aggregate performance across math benchmarks (AIME24/25, AMC) and general reasoning (GPQA, MMLU-Pro), substantially boosting Pass@1/Pass@16 metrics (e.g., >10% improvement in 7B models; statistically significant gains in larger MoEs). NSR maintains markedly lower clipped gradient fractions, converges faster, and exhibits more stable policy entropy compared to strong baselines (DAPO, GSPO). Qualitative analysis further indicates rescued gradients promote longer output sequences, facilitating deeper reasoning chains.

Crucially, NSR generalizes robustly across optimization granularities, adapting from token-level to sequence-level objectives. It introduces minimal computational overhead and only one additional hyperparameter (rescue window), offering a practical, boundary-local fix suitable for large-scale RLVR training regimes.

Implications and Future Directions

The identification of the clipping decision bottleneck and the deployment of NSR carry both pragmatic and theoretical implications. On the practical side, NSR provides immediate improvements in sample efficiency, training stability, and reasoning quality for RLVR-based LLMs, with minimal changes to established algorithms. Theoretically, NSR establishes that gradient-based RL fine-tuning benefits from boundary-local stochastic filtering and soft constraint mechanisms, rather than rigid binary gates—this principle may inform the design of future policy optimization objectives, especially in verifiable reward settings.

Unresolved questions include optimal tuning of the rescue window across algorithms and architectures, the interplay between NSR and MoE routing, and broader applicability to RL setups with learned value functions or dense rewards. Additionally, dynamic adaptation of NSR parameters, integration with entropy regularization, and formalization in model selection contexts represent promising directions.

Conclusion

This paper presents a rigorous diagnosis of instability in clipping-based RLVR, isolates the clipping mask as the principal bottleneck, and introduces NSR as an effective plug-and-play solution. NSR probabilistically rescues near-boundary tokens, recovers censored informative gradients, and induces a theoretically sound soft constraint mechanism. Empirical evaluation confirms its stability and superiority over deterministic attenuation schemes. NSR thus advances both the practice and understanding of verifiable reward RL for LLM reasoning, suggesting further development of boundary-local stochastic filtering paradigms is warranted.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper studies how to train LLMs to reason better using a method called Reinforcement Learning with Verifiable Rewards (RLVR). In math or coding, a computer can check whether an answer is correct. RLVR uses that check as a “reward” to teach the model. The paper finds a hidden problem in a common training rule called “clipping,” and introduces a simple fix—Near-boundary Stochastic Rescue (NSR)—that makes training more stable and improves results.

The big questions the paper asks

  • Why do popular RLVR training methods sometimes feel shaky—hard to tune, jumpy, and slow to settle?
  • Is the problem caused by how big the training signals are, or by how the system decides which signals to keep?
  • Can we design a tiny, easy-to-add change that keeps helpful signals instead of throwing them away?

How the researchers approached the problem (with simple analogies)

First, some quick translations:

  • “Rewards” are like grades the model gets for its answers (correct or incorrect).
  • “Gradient” is the “direction and strength” the model uses to adjust itself after each answer—a learning signal.
  • “Clipping” is a safety rule that says: “Don’t change too much at once.” Think of it like a speed limit for learning.

Here’s the issue. In many RLVR methods, clipping is “hard”: if a learning signal is slightly outside the allowed range, it’s completely dropped (treated as zero). Imagine a referee who marks any runner 1 millimeter beyond the track boundary as “out,” even if they were very close to staying in. That rigid rule can throw away useful hints.

What the team did:

  • They ran controlled experiments to separate two things: the size of the learning signal versus the yes/no decision to keep or drop it at the clipping boundary.
  • They discovered training is pretty tolerant to signal size changes (like slightly louder or softer hints), but extremely sensitive to the keep/drop decision at the boundary.
  • Based on this, they designed NSR. Instead of always dropping signals that are just barely outside the boundary, NSR flips a fair coin (with odds based on how far out the signal is). If the coin says “rescue,” that signal is nudged back inside and used. If it’s clearly too far out, it still gets dropped.

Think of NSR like a teacher who sometimes accepts almost-late homework if it’s only a minute past the deadline, but still rejects work that’s hours late. This small bit of flexibility keeps helpful learning moments.

What they found and why it matters

Here are the main takeaways:

  • The real bottleneck is the rigid keep/drop decision at the clipping boundary. Slightly out-of-bounds signals can be valuable, but hard clipping throws them away.
  • Simply changing the size of signals (making them a bit larger or smaller) doesn’t fix instability. Changing the boundary decision does.
  • NSR is a tiny, “plug-and-play” tweak that rescues near-boundary signals probabilistically. It doesn’t relax safety rules in general; it only helps borderline cases.
  • In tests across different model sizes and types (about 7B, 8B, and 30B parameters, including dense and MoE models), NSR consistently improved performance and stability compared to strong baselines like DAPO and GSPO. On math tasks, models gained roughly 4–6 percentage points in accuracy (Pass@1), and training avoided nasty spikes in randomness (“entropy”) that can make models behave unpredictably.
  • Models trained with NSR tended to produce slightly longer, more thoughtful reasoning chains, which often helps on complex problems.
  • Even though NSR acts like a “soft” version of clipping on average (it gently shrinks the effect of out-of-bounds signals), the paper shows the stochastic rescue near the boundary works better than simple, always-on soft shrinking. Why? Because NSR filters out unreliable directions entirely unless they’re close enough to be useful.

What this could mean going forward

  • More stable training for reasoning-focused LLMs: NSR helps models learn smoothly and avoid chaotic swings, making results more reliable.
  • Better sample efficiency: By recovering signals that would otherwise be thrown away, NSR squeezes more learning out of the same data and compute.
  • Easy adoption: NSR is a small change that can be added to existing RLVR systems without redoing everything.
  • Potential for stronger math and coding assistants: If training is steadier and more accurate, users get better step-by-step solutions.

In short, this paper spots a simple but important problem—throwing away near-good learning signals—and fixes it with a small, smart rule. That makes training more reliable and helps models reason better without heavy changes to the core algorithms.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and open questions left unresolved by the paper; each item is framed to guide actionable future research.

  • Scope transferability: Does NSR retain its benefits under standard RL settings with learned value functions (e.g., GAE), dense/continuous rewards, or unclipped/KL-only objectives (PPO/TRPO), beyond the GRPO/DAPO-style RLVR regime studied?
  • Convergence guarantees: Can we establish theory for monotonic improvement, stability, or trust-region-style KL bounds under NSR, beyond the expectation-level 1/r2 decay argument?
  • Bias–variance characterization: How does NSR affect gradient bias and variance relative to hard clipping and deterministic soft clipping, and what are the implications for sample complexity?
  • KL behavior: What is NSR’s effect on empirical KL dynamics and trust-region adherence during training, and can we derive analytic links or bounds relating NSR to KL-constrained updates?
  • Hyperparameter sensitivity: How sensitive are outcomes to the rescue window δ and clip widths (εlow, εhigh), and what principled tuning or scheduling strategies (e.g., adaptive δ tied to clip-fraction targets) work across tasks and models?
  • Alternative rescue distributions: Beyond uniform z ~ U(1−δ, 1+δ), do other noise distributions (e.g., truncated normal, Beta, asymmetric or distance-aware noise) or advantage-dependent noise improve stability or performance?
  • Optimal stochasticity design: Can we derive or learn an “optimal” rescue probability as a function of distance-to-boundary, advantage magnitude, or token entropy that provably minimizes variance while retaining informative signals?
  • Deterministic soft clipping vs. stochastic rescue: Is there a deterministic attenuation function (beyond (u/r)k) that matches NSR’s expected gradient and stability in practice, or is stochastic filtering fundamentally necessary?
  • Admission criteria: Should rescue decisions integrate additional signals (e.g., advantage magnitude, token entropy, token position, correctness indicators) instead of solely using the decision ratio rdec?
  • Sequence-level rescue formalization: How exactly should NSR be instantiated for sequence-level clipping (e.g., GSPO) to balance per-token credit assignment and sequence-level accept/reject decisions?
  • Exploration and entropy: What mechanisms explain NSR’s entropy behavior (e.g., avoiding spikes), and can we predict or control its exploration impact analytically?
  • Safety and robustness: Does stochastic rescue increase the risk of admitting harmful or highly divergent updates (e.g., outliers), and what guardrails (e.g., heavy-tail detection, robust clipping) mitigate such risks?
  • Verifier noise and graded rewards: How does NSR perform with noisy/approximate verifiers or graded/partial-credit rewards instead of binary verifiable outcomes?
  • Generalization across domains: Do NSR’s gains transfer to coding, tool-use, program synthesis, multimodal reasoning, and long-context tasks beyond math/STEM benchmarks?
  • Model/architecture diversity: Are results consistent for non-Qwen architectures (e.g., Llama, Mixtral, DeepSeek) and larger/smaller scales, including different MoE sparsity patterns?
  • Interaction with MoE routing: How does NSR alter expert load, routing stability, and token distribution across experts, and can it reduce MoE-specific training pathologies?
  • Compatibility with other stabilizers: What are the synergies or redundancies when combining NSR with entropy regularization, explicit KL penalties, difficulty-aware sampling, low-probability/high-entropy token strategies, or negative reinforcement?
  • Off-policy and replay: How does NSR behave with off-policy data, experience replay, and importance sampling corrections, and does it introduce additional bias or instability in those regimes?
  • Clip-fraction control: Can we design closed-loop controllers that adapt δ to maintain a target clip fraction, and does such control improve stability or efficiency across training phases?
  • Failure-mode mapping: In which regimes (datasets, reward regimes, clip widths) do near-boundary tokens cease to be informative, and how can we detect and disable NSR when it is harmful?
  • Distance-aware rescue beyond a window: Would a continuous, distance-conditioned rescue probability (rather than a fixed δ window) yield better stability or performance?
  • Calibration effects: What is NSR’s impact on probability calibration, uncertainty estimates, and confidence–accuracy trade-offs in reasoning outputs?
  • Long-horizon credit assignment: How does NSR reshape token-level credit along reasoning chains (e.g., early vs. late steps), and is the observed increase in response length causally tied to improved correctness?
  • Compute and efficiency accounting: What are the true wall-clock and GPU-hour gains (or costs) from lower clip fractions under NSR, including RNG overhead and distributed training effects?
  • Reproducibility depth: Do results hold over more seeds and statistical tests (e.g., confidence intervals), and are runs reproducible across hardware/framework variations and RNG stream management?
  • Group-size sensitivity: How does NSR’s efficacy vary with GRPO/DAPO group sizes and the enforced mix of positive/negative samples?
  • Monitoring and diagnostics: Which training-time metrics (e.g., rescue rate, distance-to-boundary histogram, clip fraction) best predict NSR success/failure, and can we standardize their reporting?
  • Implementation portability: Are there framework- or accelerator-specific pitfalls (e.g., PyTorch vs. JAX, CUDA RNG determinism) that affect NSR correctness or stability at scale?
  • Learnable rescue policies: Can rescue be parameterized and learned (e.g., a small controller predicting admission probabilities), and how do we prevent gaming or destabilization?
  • Symmetry in negative advantages: Do empirical dynamics under the lower bound (Â < 0) mirror the theoretical symmetry, or are there practical asymmetries that require specialized handling?
  • Alignment and safety outcomes: Does NSR affect hallucination, toxicity, or other alignment metrics, and how should one audit and constrain potential adverse shifts?
  • Dataset breadth and OOD robustness: How does NSR perform on broader and out-of-distribution benchmarks (e.g., GSM8K, MATH, HumanEval, BBH, long-form QA), including ablations on contamination and domain shift?
  • Partial-credit and multi-objective settings: Can NSR be adapted for multi-objective or constraint-augmented rewards (e.g., correctness plus brevity), and how should rescue interact with competing objectives?

Practical Applications

Immediate Applications

The following applications can be deployed now with modest engineering effort, assuming access to RLVR pipelines (e.g., DAPO/GSPO-style training), deterministic verifiers, and standard PyTorch-based stacks.

  • Stable RLVR training plug-in for LLM reasoning
    • Sector: software/AI infrastructure
    • What to deploy: a PyTorch module wrapping the clipping step with Near-boundary Stochastic Rescue (NSR) and a config flag (e.g., rescue_window_delta = 0.1) for DAPO/GSPO/GRPO-like trainers
    • Workflow: keep the Judge ratio unchanged; sample z ~ U(1-δ, 1+δ) only for out-of-bound tokens; admit execution when rexec returns within bounds; log clip fraction, rescue rate, and entropy
    • Dependencies/assumptions: deterministic reward/verifier; clipping-based objectives; δ requires light tuning; currently validated on Qwen-family models (7B–30B dense/MoE)
  • More reliable math and coding model fine-tuning
    • Sector: education, software engineering
    • What to deploy: RLVR fine-tunes of math tutors and code assistants using NSR to reduce entropy spikes and improve Pass@k
    • Workflow: use problem verifiers (math solvers, unit tests, static analyzers) as rewards; integrate NSR into token- or sequence-level clipping; monitor response length and entropy for stability
    • Dependencies/assumptions: high-quality verifiers and task suites (AIME/AMC-like for math, curated unit-test sets for code); existing PPO/GRPO-style trainers
  • Cost and time reduction via faster, steadier convergence
    • Sector: AI operations (training efficiency)
    • What to deploy: training schedules that exploit NSR’s faster convergence to reach target Pass@1 earlier; early-stopping criteria that watch clip fraction and rescue rate
    • Workflow: adopt identical hyperparameters as baseline except NSR; use saved GPU hours to extend curricula or increase sample count
    • Dependencies/assumptions: training telemetry in place; comparable hardware; routine hyperparameter sweeps primarily around δ and clip widths
  • Sequence-level rescue for group sequence optimization
    • Sector: AI research labs training MoE or long-context models
    • What to deploy: NSR inside GSPO-style sequence-level clipping to probabilistically retain near-threshold sequences otherwise dropped
    • Workflow: apply rescue at sequence-level ratio; track “rescued sequence” share and its impact on entropy and Pass@k
    • Dependencies/assumptions: group-based sampling; sequence-level importance ratios available
  • Reproducibility and monitoring upgrades
    • Sector: academia, evaluation
    • What to deploy: standardized logs for decision-vs-execution ratios, rescue-zone proportions, clip fraction, entropy dynamics, and near-boundary token distributions
    • Workflow: add dashboards and regression tests over these signals; publish per-run mean±SD for stability
    • Dependencies/assumptions: minimal—pure instrumentation; highly recommended for papers and shared checkpoints
  • Safer entropy dynamics in RLVR
    • Sector: training safety/robustness in AI labs
    • What to deploy: run-time guardrails that auto-adjust δ within a small band (e.g., 0.05–0.15) to damp entropy spikes without widening trust regions globally
    • Workflow: detect rising entropy or oscillations; briefly lower δ; revert when stable
    • Dependencies/assumptions: entropy/length monitoring; policy constraints remain clipping-based
  • Data engineering and analytics assistants trained with verifiable SQL/DSL tasks
    • Sector: data platforms, BI/analytics
    • What to deploy: RLVR fine-tunes for SQL/DSL generation using query validators and schema checkers as verifiers, with NSR to recover near-boundary learning signals
    • Workflow: compile unit tests and constraint checkers (e.g., EXPLAIN validation, type/constraint satisfiability); integrate NSR into training loop
    • Dependencies/assumptions: robust validators for target DSLs; clipping-based RL objective
  • MLOps packaging and integration
    • Sector: MLOps/tools
    • What to deploy: a reusable library component exposing NSR as a drop-in “stochastic soft clipping” layer; a Hydra/YAML config block; CI tests covering rescue logic
    • Workflow: add config toggle (enable_nsr), δ, and logging; support both token- and sequence-level use
    • Dependencies/assumptions: PyTorch clamp-based clipping; minimal changes to trainer code

Long-Term Applications

The following opportunities require further research, scaling, or validation beyond the paper’s scope (e.g., non-verifiable rewards, safety-critical domains, or different RL objectives).

  • Extending stochastic rescue beyond verifiable rewards (RLHF/RLAIF and KL-only PPO)
    • Sector: general AI alignment and instruction tuning
    • What to build: NSR-inspired boundary-local stochastic filtering for PPO with learned value functions or KL-only constraints
    • Dependencies/assumptions: theory and ablations in non-verifiable regimes; careful safety analysis; alternative trust-region definitions
  • Robotics and planning with simulator/constraint verifiers
    • Sector: robotics, logistics, operations research
    • What to build: RLVR on plan- or trajectory-generation using deterministic constraint/safety verifiers; NSR to stabilize learning near constraints without expanding trust regions
    • Dependencies/assumptions: reliable, fast simulators and safety checkers; mapping plans to verifiable specs; sim2real concerns
  • Finance and energy scheduling under rule/constraint checkers
    • Sector: finance, energy/grid, supply chain
    • What to build: planning LLMs trained with RLVR using formal compliance or feasibility checks; NSR to retain near-feasible proposals for learning
    • Dependencies/assumptions: high-fidelity validators; risk controls; regulatory acceptance; robust evaluation pipelines
  • Adaptive and learned rescue policies
    • Sector: AI research/infrastructure
    • What to build: meta-learned or adaptive δ schedules; learned admission probabilities conditioned on token context, advantage, or uncertainty
    • Dependencies/assumptions: added complexity must outperform simple uniform z sampling; research on overfitting/instability trade-offs
  • Standardization and policy guidance for stable, energy-aware RLVR
    • Sector: policy, funding agencies, industry consortia
    • What to build: best-practice guidelines encouraging boundary-local softening (e.g., NSR) in clipping-based RL; reporting standards for clip fraction, rescue rate, entropy, and energy per achieved Pass@k
    • Dependencies/assumptions: community agreement; benchmark suites with standardized telemetry
  • On-device or federated fine-tuning of compact reasoning models
    • Sector: edge AI, education, enterprise knowledge assistants
    • What to build: small (≤8B) models fine-tuned with RLVR+NSR on-device/edge with local verifiers (math/code), leveraging NSR’s stability to reduce failed runs
    • Dependencies/assumptions: efficient verifiers on-device; memory-constrained training stacks; privacy-preserving telemetry
  • Multi-modal RLVR with verifiers (vision-language, programmatic perception)
    • Sector: vision, scientific imaging, document understanding
    • What to build: verifiable tasks (e.g., LaTeX derivations from diagrams, executable scene graphs) with NSR stabilizing near-boundary updates
    • Dependencies/assumptions: deterministic verifiers for multi-modal outputs; latency constraints for verifiers
  • Formal-methods-integrated code synthesis and compiler optimization
    • Sector: software toolchains, compilers
    • What to build: LLM-based program synthesis trained against SMT/LLVM verification checks; NSR to harvest near-correct generations for faster learning
    • Dependencies/assumptions: scalable formal checkers; caching and parallelization to control verifier overhead
  • Automated theorem proving and scientific discovery agents
    • Sector: academia, R&D
    • What to build: RLVR with proof verifiers or symbolic math engines; NSR to maintain stable exploratory reasoning without entropy collapse
    • Dependencies/assumptions: comprehensive proof/verifier coverage; careful curriculum design
  • Safety and robustness research: probabilistic admission as a filter against spurious directions
    • Sector: AI safety
    • What to build: analyses of whether stochastic boundary-local admission reduces reward hacking or propagates fewer brittle gradients than deterministic soft clipping
    • Dependencies/assumptions: adversarial evaluations and stress tests; auditing for unintended behaviors

Cross-cutting assumptions and dependencies

  • Verifier availability and quality: Applications are strongest where rewards are deterministically verifiable (math, code, SQL, formal constraints); weak or noisy verifiers reduce gains.
  • Objective compatibility: NSR targets clipping-based GRPO/PPO-style objectives; transfer to KL-only or value-based RL needs research.
  • Hyperparameters: δ (rescue window) and clip bounds remain tunable; default δ≈0.1 worked in reported settings.
  • Compute and logging: Benefits include stability and faster convergence, but require run-time telemetry (clip fraction, entropy, rescue rate) and standard MLOps practices.
  • Model and domain generalization: Evidence is from Qwen-family models; expect retuning when porting to other architectures or domains.

Glossary

  • Ablation study: An experimental analysis technique that systematically removes or modifies components to assess their impact. "our ablation studies reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay."
  • Advantage (Ât): A reinforcement-learning signal estimating how much better an action is compared to a baseline. "Ât is the advantage estimated via a learned value network"
  • Advantage normalization: A stabilization technique that rescales advantages to control update magnitudes. "Removing Advantage Normalization (w/o Norm; A):"
  • Advantage-dependent trust region: A clipping interval for ratios that depends on whether the advantage is positive or negative. "we define the advantage-dependent trust region I(Ât)"
  • AIME24: A math-reasoning benchmark (American Invitational Mathematics Examination 2024) used for evaluation. "AIME24"
  • AIME25: A math-reasoning benchmark (American Invitational Mathematics Examination 2025) used for evaluation. "AIME25"
  • AMC: A math-reasoning benchmark (American Mathematics Competitions) used for evaluation. "AMC"
  • Asymmetric clipping: Clipping with different lower and upper bounds to bias learning toward helpful directions. "employs asymmetric clipping (1 - €low, 1 + €high)"
  • Boundary-local: An intervention focused only near the clipping boundary rather than globally across all ratios. "boundary-local, plug-and-play modification"
  • Clamping (torch.clamp): A hard-bounding operation that truncates values to a fixed interval, zeroing gradients outside. "clamping functions (torch. clamp in PyTorch)"
  • Clip fraction: The proportion of updates that are affected by clipping during optimization. "effectively lower the clip fraction"
  • Clipping mask: The binary keep/drop decision indicating whether a token’s gradient is allowed to flow after clipping. "This conflates the clipping mask with the gradient value"
  • Computation graph: The differentiable graph connecting operations whose gradients are tracked during training. "detached from the computation graph"
  • Counterfactual intervention: A diagnostic change applied to test causal hypotheses about training dynamics. "we design two counterfactual interventions on the standard DAPO baseline"
  • Decoupled ratio noise: Perturbations applied only to the execution ratio while keeping the decision ratio clean to study effects. "Decoupled Ratio Noise (Clean Decision; D):"
  • Decision ratio: The ratio used to decide whether an update is in-bound (kept) or out-of-bound (clipped). "the decision ratio r dec and the execution ratio rexec"
  • Deterministic verifier: A rule-based checker that deterministically verifies an output (e.g., correctness in math or code). "By leveraging deterministic verifiers in domains such as mathematics and coding"
  • Dynamic sampling constraint: A training rule ensuring each sampled group contains both positive and negative examples for stability. "it enforces a dynamic sampling constraint"
  • Entropy collapse: A failure mode where the policy’s output distribution becomes overly peaked, reducing exploration. "prone to entropy collapse or spikes"
  • Entropy spike: A sudden increase in policy entropy indicating instability or excessive randomness. "avoiding the entropy spike observed in the Qwen3-8B baseline."
  • Execution ratio: The ratio actually used to scale gradients once a token passes the clipping decision. "the decision ratio r dec and the execution ratio rexec"
  • GPQA: A general STEM reasoning benchmark used to evaluate scientific question answering. "GPQA"
  • GRPO (Group Relative Policy Optimization): A value-network-free policy optimization method using group-based relative advantages. "Group Relative Policy Optimization (GRPO) (Shao et al., 2024)."
  • GSPO (Group Sequence Policy Optimization): A policy optimization method operating at the sequence level, used as a baseline. "GSPO (Zheng et al., 2025)"
  • Importance ratio: The ratio of current to old policy probabilities used in clipping-based objectives. "where rt(θ) is the importance ratio"
  • Inverse-square law (O(1/r2)): A decay pattern where gradient weight diminishes proportionally to 1/r² beyond the boundary. "inverse-square law (~ 0(1/12))."
  • KL penalty: A regularization term based on Kullback–Leibler divergence that constrains the new policy from deviating too far from the old policy. "removing the explicit KL penalty"
  • Mixture of Experts (MoE): A neural architecture where multiple expert networks are selectively routed to handle different inputs. "both dense and MoE architectures"
  • MMLU-Pro: A robust, challenging multi-task language understanding benchmark. "MMLU-Pro"
  • Near-boundary Stochastic Rescue (NSR): A stochastic mechanism that probabilistically admits slightly out-of-bound tokens to recover informative gradients. "we propose Near-boundary Stochastic Rescue (NSR)"
  • Pass@1: The fraction of questions correctly solved by the model on its first attempt. "Standard DAPO achieves a peak Pass@1 of ~ 37% (step 140)"
  • Policy entropy: A measure of randomness in the policy’s token distribution, often used to monitor exploration and collapse. "Policy entropy under NSR remains more stable"
  • Probabilistic admission process: A randomized keep/drop rule at the boundary that retains near-boundary signals with some probability. "transforming the rigid binary decision into a probabilistic admission process"
  • Proximal Policy Optimization (PPO): A policy gradient algorithm that stabilizes updates via a clipped surrogate objective. "Proximal Policy Optimization (PPO) (Schulman et al., 2017)."
  • Reinforcement Learning with Verifiable Rewards (RLVR): An RL framework for LLMs where binary verifiers provide outcome-based rewards. "Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as the critical engine for scaling the reasoning capabilities of LLMs"
  • Rescue Zone: The region where out-of-bound tokens can be stochastically pulled back into the trust region. "Rescue Zone (Gradients Recovered)"
  • Sequence-level advantage: An advantage computed per full output sequence (e.g., correctness) rather than per token. "The sequence-level advantage for the i-th sample is computed as:"
  • Sequence-level clipping: Applying clipping decisions at the granularity of entire sequences instead of individual tokens. "generalizes beyond token-level clipping (DAPO) to sequence-level clipping (GSPO)"
  • Soft clipping: A smoothing alternative to hard clipping that attenuates rather than zeroes gradients beyond the boundary. "implicit soft-clipping mechanism"
  • Stochastic rescue: A randomized boundary-local mechanism to admit slightly out-of-bound updates and recover learning signals. "NSR employs a stochastic rescue mechanism to retain informative near-boundary signals."
  • Surrogate objective: An alternative objective optimized during training that approximates the true objective while being easier to compute. "clipped surrogate objective"
  • Token-level probability ratio: The importance ratio computed per token comparing new and old policy probabilities. "denotes the token-level probability ratio at step t"
  • Trust region: A bounded region around the old policy within which updates are constrained to prevent instability. "within a trust region around the old policy"
  • Value-network-free: An RL setup that avoids learning a separate value function by using alternative advantage estimators. "PPO and its value-network-free variants, GRPO and DAPO."
  • Zero-shot: Evaluation without task-specific fine-tuning or examples, often using fixed decoding settings. "All evaluations are zero-shot (k = 32, T = 1.0)."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 56 likes about this paper.