- The paper introduces two novel techniques—Adaptive NSR (A-NSR) and Confidence-Weighted NSR (CW-NSR)—to dynamically adjust reinforcement in LLM training.
- It employs time-dependent scheduling and per-sample confidence scaling to improve sample efficiency while preserving output diversity.
- Empirical evaluations on datasets like AIME 2025 and AMC23 demonstrate that these adaptive methods outperform static NSR baselines in various regimes.
Adaptive Negative Reinforcement for LLM Reasoning: Dynamic Correction and Diversity in RLVR
Problem Statement and Motivation
LLMs performing complex reasoning, such as mathematical problem solving, exhibit substantial gains under reinforcement learning with verifiable rewards (RLVR) using deterministic, binary feedback. Existing approaches, in particular negative sample reinforcement (NSR), penalize incorrect responses and shift sampling distributions towards correct answers. However, prevailing NSR variants apply static penalties throughout training and treat all incorrect samples equivalently, regardless of the confidence of those outputs or the training phase. This ignores the evolving error landscape as learning progresses and the increased significance of high-confidence errors, potentially limiting both sample efficiency and the diversity of generated reasoning paths.
Proposed Methodology
This work introduces two complementary mechanisms: Adaptive NSR (A-NSR) and Confidence-Weighted NSR (CW-NSR), each independently addressing shortcomings in the NSR paradigm within RLVR frameworks.
Adaptive Negative Sample Reinforcement (A-NSR)
A-NSR incorporates time-dependent scheduling for the positive and negative reinforcement weights, denoted X(t) and B(t). Early training is dominated by strong negative updates to rapidly suppress common errors, while as accuracy increases, the scheduling anneals negative weights and shifts towards preserving diversity (i.e., entropy) among correct outputs. Three scheduling strategies are formalized:
- Exponential decay for NSR with linear PSR ramp: Ensures aggressive error correction early, with gradual stabilization for output diversity.
- Cosine annealing for NSR: Provides smooth, non-abrupt adaptation of penalty magnitude.
- Performance-driven adaptation: Dynamically scales negative weighting based on empirical accuracy.
Gradient-level analysis confirms that this approach reweights—without changing the direction—the contributions of positive and negative sample gradients across training, enabling a natural curriculum on the feedback signal itself.
Confidence-Weighted Negative Sample Reinforcement (CW-NSR)
CW-NSR establishes per-sample hardness scores determined via the normalized sequence likelihood (geometric mean of autoregressive token probabilities). Confidently incorrect generations (systematic errors) incur larger penalties, while uncertain guesses are treated leniently, effectively softening penalties when the model explores. This is formalized by scaling the NSR gradient magnitude by a function w(y) of the sample's confidence, guaranteeing that confident errors dominate update priorities. The approach preserves the prior-guided redistribution benefits of NSR while introducing targeted hard example mining, reminiscent of focal loss in CV and OHEM strategies, but without external scoring.
Theoretical Analysis
Formal decompositions and proofs (detailed in the supplementary appendices) demonstrate:
- Convergence: Under the specified scheduling, the NSR/PSR gradient ratio converges to a controllable limit, ensuring late-stage training behavior mirrors desirable fixed-weight variants.
- Entropy regulation: Annealed NSR modulates entropy decrease, directly controlling diversity among outputs.
- Gradient scaling properties: CW-NSR provably scales gradients as a function of sample-level confidence without altering the essential redistribution mechanics of NSR, in contrast to unlikelihood training or entropy bonuses.
- Variance bounds: CW-NSR reduces wasted updates on low-confidence errors, yielding improved sample efficiency.
Experimental Evaluation
Experiments leverage Qwen2.5-Math-1.5B trained on MATH, AIME 2025, and AMC23 datasets using the unbiased Pass@k estimator. Baselines feature Weighted-REINFORCE (W-REINFORCE), a fixed-weight variant of NSR previously shown to be competitive with PPO and GRPO. No entropy bonus is used, and hyperparameters are exhaustively reported.
Strong empirical findings:
- A-NSR: Achieves superior Pass@k scores over W-REINFORCE on AIME 2025 (for k≤32) and outperforms across all k for AMC23. Improvements are most pronounced in low-sample regimes, indicating efficient correction of frequent early-stage errors.
- CW-NSR: Delivers consistent gains on high-variance reasoning sets (AIME 2025, AMC23), notably dominating W-REINFORCE for mid-to-high k ranges, confirming that confidence weighting is more effective in ambiguous or complex domains. On the highly-structured MATH, fixed-weight methods retain a marginal advantage at larger k.
The results demonstrate a clear, non-trivial performance improvement in reasoning under variable difficulty and confidence regimes, subject to test set and model constraints.
Practical and Theoretical Implications
By explicitly disentangling and adapting both temporal (training stage) and sample-wise (confidence) penalty signals in RLVR, this work offers a framework that tightly controls the trade-off between correction (sample efficiency, accuracy at low k) and diversity (solution variety at high B(t)0).
Practical consequences:
- Enables more robust LLM training for domains with wide variance in reasoning depth, e.g., competitive mathematics, program synthesis, or scientific inference.
- Provides generic strategies for fine-grained control of RL-based LLM training objectives, immediately applicable to workflows beyond mathematical benchmarks.
Theoretical consequences:
- Clarifies the distinct roles of error correction and solution diversity in NSR; establishes that adaptivity in these axes is more effective than uniform or entropy-bonus-based regulation.
- Proposes integration paths for these mechanisms with advanced policy optimization algorithms, given their compatibility with token-level clipped objectives.
Limitations and Future Directions
The approach is currently tailored for sparse, verifiable reward settings. Training stability under long horizons requires further study, particularly as adaptivity in penalty schedules may interact with latent optimization instabilities. Extension to dense/continuous reward settings or token-level confidence estimation promises more granular corrective feedback and better localization of systematic errors. Broader applicability to other domains—such as multi-step program repair, scientific discovery, or high-variance RL—is a logical expansion trajectory.
Conclusion
Adaptive and confidence-weighted negative reinforcement mechanisms systematically improve LLM reasoning under RLVR by dynamically targeting both the nature of the response and the training trajectory. These methods offer effective correction-diversity balancing, consistent empirical advantages over static baselines, and broaden the RLVR methodological toolkit by treating the reward feedback as a learnable curriculum, not a static heuristic. Their integration into the RL training stack has strong implications for advanced LLM alignment, mathematical reasoning, and beyond.
Reference: "Adaptive Negative Reinforcement for LLM Reasoning: Dynamically Balancing Correction and Diversity in RLVR" (2605.07137)