Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Published 8 May 2026 in cs.LG and cs.AI | (2605.07137v1)

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of LLMs. Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model's normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors -- where the model is effectively exploring -- are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces two novel techniques—Adaptive NSR (A-NSR) and Confidence-Weighted NSR (CW-NSR)—to dynamically adjust reinforcement in LLM training.
It employs time-dependent scheduling and per-sample confidence scaling to improve sample efficiency while preserving output diversity.
Empirical evaluations on datasets like AIME 2025 and AMC23 demonstrate that these adaptive methods outperform static NSR baselines in various regimes.

Adaptive Negative Reinforcement for LLM Reasoning: Dynamic Correction and Diversity in RLVR

Problem Statement and Motivation

LLMs performing complex reasoning, such as mathematical problem solving, exhibit substantial gains under reinforcement learning with verifiable rewards (RLVR) using deterministic, binary feedback. Existing approaches, in particular negative sample reinforcement (NSR), penalize incorrect responses and shift sampling distributions towards correct answers. However, prevailing NSR variants apply static penalties throughout training and treat all incorrect samples equivalently, regardless of the confidence of those outputs or the training phase. This ignores the evolving error landscape as learning progresses and the increased significance of high-confidence errors, potentially limiting both sample efficiency and the diversity of generated reasoning paths.

Proposed Methodology

This work introduces two complementary mechanisms: Adaptive NSR (A-NSR) and Confidence-Weighted NSR (CW-NSR), each independently addressing shortcomings in the NSR paradigm within RLVR frameworks.

Adaptive Negative Sample Reinforcement (A-NSR)

A-NSR incorporates time-dependent scheduling for the positive and negative reinforcement weights, denoted $X(t)$ and $B(t)$ . Early training is dominated by strong negative updates to rapidly suppress common errors, while as accuracy increases, the scheduling anneals negative weights and shifts towards preserving diversity (i.e., entropy) among correct outputs. Three scheduling strategies are formalized:

Exponential decay for NSR with linear PSR ramp: Ensures aggressive error correction early, with gradual stabilization for output diversity.
Cosine annealing for NSR: Provides smooth, non-abrupt adaptation of penalty magnitude.
Performance-driven adaptation: Dynamically scales negative weighting based on empirical accuracy.

Gradient-level analysis confirms that this approach reweights—without changing the direction—the contributions of positive and negative sample gradients across training, enabling a natural curriculum on the feedback signal itself.

Confidence-Weighted Negative Sample Reinforcement (CW-NSR)

CW-NSR establishes per-sample hardness scores determined via the normalized sequence likelihood (geometric mean of autoregressive token probabilities). Confidently incorrect generations (systematic errors) incur larger penalties, while uncertain guesses are treated leniently, effectively softening penalties when the model explores. This is formalized by scaling the NSR gradient magnitude by a function $w(y)$ of the sample's confidence, guaranteeing that confident errors dominate update priorities. The approach preserves the prior-guided redistribution benefits of NSR while introducing targeted hard example mining, reminiscent of focal loss in CV and OHEM strategies, but without external scoring.

Theoretical Analysis

Formal decompositions and proofs (detailed in the supplementary appendices) demonstrate:

Convergence: Under the specified scheduling, the NSR/PSR gradient ratio converges to a controllable limit, ensuring late-stage training behavior mirrors desirable fixed-weight variants.
Entropy regulation: Annealed NSR modulates entropy decrease, directly controlling diversity among outputs.
Gradient scaling properties: CW-NSR provably scales gradients as a function of sample-level confidence without altering the essential redistribution mechanics of NSR, in contrast to unlikelihood training or entropy bonuses.
Variance bounds: CW-NSR reduces wasted updates on low-confidence errors, yielding improved sample efficiency.

Experimental Evaluation

Experiments leverage Qwen2.5-Math-1.5B trained on MATH, AIME 2025, and AMC23 datasets using the unbiased Pass@ $k$ estimator. Baselines feature Weighted-REINFORCE (W-REINFORCE), a fixed-weight variant of NSR previously shown to be competitive with PPO and GRPO. No entropy bonus is used, and hyperparameters are exhaustively reported.

Strong empirical findings:

A-NSR: Achieves superior Pass@ $k$ scores over W-REINFORCE on AIME 2025 (for $k \leq 32$ ) and outperforms across all $k$ for AMC23. Improvements are most pronounced in low-sample regimes, indicating efficient correction of frequent early-stage errors.
CW-NSR: Delivers consistent gains on high-variance reasoning sets (AIME 2025, AMC23), notably dominating W-REINFORCE for mid-to-high $k$ ranges, confirming that confidence weighting is more effective in ambiguous or complex domains. On the highly-structured MATH, fixed-weight methods retain a marginal advantage at larger $k$ .

The results demonstrate a clear, non-trivial performance improvement in reasoning under variable difficulty and confidence regimes, subject to test set and model constraints.

Practical and Theoretical Implications

By explicitly disentangling and adapting both temporal (training stage) and sample-wise (confidence) penalty signals in RLVR, this work offers a framework that tightly controls the trade-off between correction (sample efficiency, accuracy at low $k$ ) and diversity (solution variety at high $B(t)$ 0).

Practical consequences:

Enables more robust LLM training for domains with wide variance in reasoning depth, e.g., competitive mathematics, program synthesis, or scientific inference.
Provides generic strategies for fine-grained control of RL-based LLM training objectives, immediately applicable to workflows beyond mathematical benchmarks.

Theoretical consequences:

Clarifies the distinct roles of error correction and solution diversity in NSR; establishes that adaptivity in these axes is more effective than uniform or entropy-bonus-based regulation.
Proposes integration paths for these mechanisms with advanced policy optimization algorithms, given their compatibility with token-level clipped objectives.

Limitations and Future Directions

The approach is currently tailored for sparse, verifiable reward settings. Training stability under long horizons requires further study, particularly as adaptivity in penalty schedules may interact with latent optimization instabilities. Extension to dense/continuous reward settings or token-level confidence estimation promises more granular corrective feedback and better localization of systematic errors. Broader applicability to other domains—such as multi-step program repair, scientific discovery, or high-variance RL—is a logical expansion trajectory.

Conclusion

Adaptive and confidence-weighted negative reinforcement mechanisms systematically improve LLM reasoning under RLVR by dynamically targeting both the nature of the response and the training trajectory. These methods offer effective correction-diversity balancing, consistent empirical advantages over static baselines, and broaden the RLVR methodological toolkit by treating the reward feedback as a learnable curriculum, not a static heuristic. Their integration into the RL training stack has strong implications for advanced LLM alignment, mathematical reasoning, and beyond.

Reference: "Adaptive Negative Reinforcement for LLM Reasoning: Dynamically Balancing Correction and Diversity in RLVR" (2605.07137)

Markdown Report Issue