Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR (2507.15778v1)
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of LLMs, mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.
Summary
- The paper introduces Archer, an entropy-aware RLVR framework that applies dual-token constraints to stabilize knowledge while enhancing reasoning.
- It leverages token-level entropy to assign differentiated KL regularization and clipping, achieving state-of-the-art results on math and code benchmarks.
- Empirical results demonstrate efficiency gains with fewer GPU hours and improved cross-domain generalization through synchronous token updates.
Entropy-Aware Dual-Token Constraints for RLVR: A Technical Overview
This paper introduces Archer, an entropy-aware reinforcement learning with verifiable rewards (RLVR) framework that applies dual-token constraints to LLM post-training. The method is motivated by the observation that RLVR primarily enhances reasoning by reorganizing existing model capabilities (e.g., reflection, planning) rather than altering factual knowledge. Archer leverages token-level entropy to distinguish between knowledge-oriented (low-entropy) and reasoning-oriented (high-entropy) tokens, applying differentiated regularization and clipping strategies to each. The approach is evaluated on mathematical reasoning and code generation benchmarks, demonstrating consistent improvements over prior RLVR methods and achieving state-of-the-art (SOTA) results for 1.5B-parameter models.
Motivation and Theoretical Basis
Recent analyses of RLVR training dynamics reveal that high-entropy tokens—typically logical connectors or decision points—are the primary locus of RL-driven improvements, while low-entropy tokens encode factual or domain knowledge and should remain stable. Prior works attempted to exploit this by masking gradients or asynchronously updating token types, but these approaches disrupt the sequential and semantic dependencies inherent in LLM outputs, leading to suboptimal learning and degraded performance on reasoning tasks.
Archer addresses these limitations by synchronously updating all tokens but applying dual constraints:
- Reasoning tokens (high-entropy): Weaker KL regularization and higher clipping thresholds to encourage exploration and flexible adaptation.
- Knowledge tokens (low-entropy): Stronger KL regularization and tighter clipping to preserve factual accuracy and prevent catastrophic forgetting.
Token classification is performed at the response level using entropy quantiles, mitigating the misclassification issues observed with batch-level statistics due to prompt- and response-level entropy variation.
Methodological Details
Token Classification
For each generated response, Archer computes the entropy of each token and determines a threshold (e.g., 80th percentile) to separate high- and low-entropy tokens. This is performed independently per response, ensuring robust token type assignment regardless of prompt difficulty or response diversity.
Dual-Token Constraints
The RL objective extends the Group Relative Policy Optimization (GRPO) and DAPO frameworks by introducing token-type-dependent clipping and KL penalties:
- Clipping:
- εr (reasoning): Larger, allowing greater policy deviation for high-entropy tokens.
- εk (knowledge): Smaller, restricting updates for low-entropy tokens.
- KL Penalty:
- βr (reasoning): Lower or zero, reducing regularization on reasoning tokens.
- βk (knowledge): Higher, enforcing proximity to the base model for knowledge tokens.
The loss for each token is thus:
1 |
L_t = min(r_t * A_t, clip(r_t, 1-ε(e_t), 1+ε(e_t)) * A_t) - β(e_t) * KL(π_θ || π_ref) |
Synchronous Updates
Unlike gradient masking or asynchronous updates, Archer performs joint optimization over all tokens, preserving the sequential and contextual dependencies critical for effective reasoning.
Implementation
- Base Model: DeepSeek-R1-Distill-Qwen-1.5B, distilled and SFT-finetuned on high-quality reasoning data.
- Training: 16 rollouts per prompt, batch size 64, learning rate 1×10−6, temperature 1.0, max response length 32,768 tokens.
- Hardware: 2 nodes × 8 NVIDIA H800 80GB GPUs.
- Hyperparameters: ρ=0.8 (entropy quantile), εr=0.5, εk=0.2, βr=0.0, βk=0.001.
Empirical Results
Mathematical Reasoning
On AIME24, AIME25, AMC23, MATH-500, Minerva, and OlympiadBench, Archer achieves the highest average accuracy among all 1.5B models, with notable gains over DAPO and other SOTA baselines. For example, on AIME24, Archer improves avg@64 by +6.6 points over DAPO.
Code Generation
On LiveCodeBench v5 and v6, Archer outperforms DeepCoder-1.5B, Nemotron-1.5B, and DAPO, with avg@8 and avg@16 improvements of 2–3 points.
Efficiency
Archer achieves these results with single-stage training and significantly fewer GPU hours compared to multi-stage SOTA baselines (e.g., 1,900 H800 GPU hours for math RL vs. 16,000 H100 GPU hours for Nemotron-1.5B).
Ablation Studies
- KL Weight: Both insufficient and excessive KL regularization on low-entropy tokens degrade performance; a moderate value (0.001) is optimal for stability and learning.
- Clip Thresholds: Tighter clipping on low-entropy tokens preserves knowledge but slows learning; looser clipping on high-entropy tokens accelerates reasoning improvements but can risk overfitting if excessive.
- Token Participation: Excluding low-entropy tokens from updates (as in gradient masking) impairs learning due to broken dependencies; synchronous, differentiated updates are empirically superior.
Cross-Domain Generalization
RL training on math or code tasks yields improvements on out-of-domain benchmarks, with gains correlating to problem difficulty rather than topical overlap. This supports the hypothesis that RLVR primarily enhances reasoning organization and integration, not factual knowledge.
Implications and Future Directions
Archer demonstrates that fine-grained, entropy-aware token-level optimization can simultaneously preserve factual knowledge and promote reasoning in LLMs. The approach is computationally efficient and robust, outperforming prior methods without complex multi-stage training. The findings suggest several avenues for future research:
- Generalization to Larger Models: Scaling the dual-token constraint framework to larger LLMs and diverse domains.
- Adaptive Constraint Scheduling: Dynamically adjusting KL and clipping parameters during training based on learning progress or token context.
- Integration with Process-Based RL: Combining Archer with stepwise or process-based reward modeling for even finer control over reasoning behaviors.
- Theoretical Analysis: Formalizing the relationship between token entropy, learning dynamics, and generalization in RLVR.
Conclusion
The Archer framework provides a principled, empirically validated method for stabilizing knowledge and promoting reasoning in RLVR post-training of LLMs. By synchronously applying entropy-aware dual-token constraints, it achieves SOTA results on challenging reasoning and code generation tasks, with strong efficiency and generalization properties. This work advances the understanding of token-level optimization in RL for LLMs and sets a foundation for future research on fine-grained, structure-aware RL training strategies.
Follow-up Questions
- How does the dual-token constraint mechanism distinguish between high-entropy and low-entropy tokens?
- What specific improvements in mathematical reasoning and code generation benchmarks does Archer deliver compared to previous methods?
- How do the differential clipping thresholds and KL regularization parameters affect the learning dynamics in Archer?
- What challenges were encountered with asynchronous updating methods that Archer overcomes with synchronous updates?
- Find recent papers about entropy-aware reinforcement learning.
Related Papers
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example (2025)
- Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs (2025)
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (2025)
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (2025)
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning (2025)
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy (2025)
- Reasoning with Exploration: An Entropy Perspective (2025)
- First Return, Entropy-Eliciting Explore (2025)
- Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning (2025)