Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR (2507.15778v1)
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of LLMs, mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.
Summary
- The paper proposes an entropy-aware RLVR framework (Archer) that uses differentiated training constraints to enhance reasoning while preserving factual knowledge.
- It classifies tokens using entropy metrics and applies looser clipping for reasoning tokens and stricter KL regularization for knowledge tokens.
- Empirical results on math and code benchmarks demonstrate significant improvements over prior methods with reduced computational resources.
Entropy-Aware Dual-Token Constraints for RLVR: A Technical Overview
This paper introduces Archer, an entropy-aware reinforcement learning with verifiable rewards (RLVR) framework for LLMs, which applies differentiated training constraints to knowledge- and reasoning-related tokens. The approach is motivated by the observation that RLVR primarily enhances reasoning by reorganizing existing model capabilities, rather than altering factual knowledge. Archer synchronously updates all tokens but applies weaker KL regularization and higher clipping thresholds to high-entropy (reasoning) tokens, while imposing stronger constraints on low-entropy (knowledge) tokens. The method is evaluated on mathematical reasoning and code generation benchmarks, demonstrating significant improvements over prior RLVR methods, including DAPO and other state-of-the-art (SOTA) 1.5B-parameter models.
Motivation and Theoretical Basis
Recent analyses of RLVR training dynamics reveal that high-entropy tokens—those with greater uncertainty in the model's output distribution—are typically associated with logical connectors and reasoning steps, while low-entropy tokens encode factual or domain knowledge. Prior RLVR methods, such as GRPO and DAPO, apply uniform optimization across all tokens, which fails to account for these functional distinctions. Some recent works attempt to separate token types via gradient masking or asynchronous updates, but these approaches disrupt the semantic dependencies inherent in sequential token generation, leading to suboptimal learning.
Archer addresses these limitations by:
- Using response-level entropy statistics to classify tokens as knowledge- or reasoning-related within each generated response, rather than relying on batch-level statistics that can misclassify tokens due to prompt-level entropy variation.
- Applying differentiated constraints during synchronous updates, preserving the structural dependencies among tokens.
Methodology
Token Classification
For each generated response, Archer computes the entropy of each token and determines a quantile-based threshold (e.g., 80th percentile) to distinguish high-entropy (reasoning) from low-entropy (knowledge) tokens. This response-level approach ensures that token classification is robust to prompt-specific entropy distributions.
Dual-Token Constraints
The core of Archer's method is the application of two distinct sets of constraints during RL optimization:
- Clipping Constraint: The policy update magnitude is controlled via token-type-specific clipping thresholds. High-entropy tokens receive a looser clip range (e.g., εr = 0.5), encouraging exploration and adaptation in reasoning regions. Low-entropy tokens are subject to a stricter clip (e.g., εk = 0.2), preserving the base model's factual knowledge.
- KL Regularization: The KL divergence penalty is also token-type-specific. High-entropy tokens have a lower or zero KL weight (βr = 0.0), while low-entropy tokens have a higher KL weight (βk = 0.001), further stabilizing knowledge retention.
The overall RL objective is a token-level sum of clipped policy gradients and KL penalties, with both terms parameterized by token entropy.
Implementation Details
- Base Model: DeepSeek-R1-Distill-Qwen-1.5B, distilled and SFT-finetuned on high-quality reasoning data.
- Training: 16 rollouts per prompt, batch size 64, learning rate 1e-6, maximum response length 32,768 tokens.
- Hardware: 2 nodes × 8 NVIDIA H800 80GB GPUs.
- Benchmarks: AIME24/25, AMC23, MATH-500, Minerva Math, OlympiadBench (math); LiveCodeBench v5/v6 (code).
- Evaluation: Pass@K and avg@K metrics, with K adapted to benchmark size.
Empirical Results
Archer achieves consistent and substantial improvements over both the base model and DAPO across all benchmarks. Notable results include:
- Mathematical Reasoning: +6.6 Pass@1 on AIME24, +5.2 on AIME25, +3.4 on LiveCodeBench v5, +2.6 on LiveCodeBench v6 over DAPO.
- Code Generation: Outperforms DeepCoder-1.5B and Nemotron-1.5B on LiveCodeBench v5/v6.
- Efficiency: Achieves SOTA results with single-stage training and significantly fewer GPU hours compared to multi-stage SOTA baselines (e.g., 1,900 H800 GPU hours for math RL vs. 16,000 H100 GPU hours for Nemotron-1.5B).
Ablation studies demonstrate that:
- Removing or excessively increasing the KL penalty on low-entropy tokens leads to model collapse or stagnation.
- Adjusting the clip threshold for low-entropy tokens has a strong effect on both training stability and final performance, while the model is less sensitive to the high-entropy token clip threshold.
Analysis and Implications
Token-Level Optimization
The results confirm that RLVR's primary effect is to enhance the integration and organization of existing model capabilities, particularly in reasoning steps, rather than to modify factual knowledge. Archer's entropy-aware, token-level constraints enable more effective exploration in reasoning regions while maintaining knowledge stability.
Synchronous vs. Asynchronous Updates
The paper provides empirical evidence that synchronous updates with differentiated constraints outperform masking or asynchronous updates, as the latter disrupt the sequential dependencies critical for effective learning in autoregressive models.
Cross-Domain Generalization
RL training in one domain (math or code) yields improvements in the other, primarily on problems where the base model already has moderate accuracy. This suggests that RLVR enhances general reasoning organization and attention to detail, rather than introducing new domain knowledge.
Practical Considerations
- Resource Requirements: Archer is efficient, requiring fewer training steps and GPU hours than comparable SOTA models.
- Deployment: The method is compatible with standard RLHF/RLVR pipelines and can be implemented in existing frameworks (e.g., verl).
- Hyperparameter Sensitivity: Careful tuning of KL weights and clip thresholds is necessary to balance exploration and stability.
- Scalability: The approach is applicable to larger models and other domains, provided that entropy-based token classification remains meaningful.
Future Directions
- Fine-Grained Token Typing: Beyond binary entropy-based classification, more nuanced token-type taxonomies could further improve optimization.
- Adaptive Constraints: Dynamic adjustment of KL and clip parameters during training may yield additional gains.
- Process-Level Reward Modeling: Integration with process-based RL or stepwise reward models could further enhance reasoning capabilities.
- Structural Dependency Modeling: Explicit modeling of token and sentence dependencies may improve the coordination of learning signals.
Conclusion
Archer demonstrates that entropy-aware, dual-token constraints in RLVR provide a principled and effective means to balance knowledge preservation and reasoning enhancement in LLMs. The approach achieves SOTA results on challenging reasoning and code generation tasks with improved training efficiency. The findings underscore the importance of token-level optimization strategies that respect the functional heterogeneity and sequential dependencies of LLM outputs, and point toward further research in fine-grained, structure-aware RL for LLMs.
Follow-up Questions
- What are the main advantages of dual-token constraints in the Archer framework?
- How does token entropy influence the clipping and regularization strategies in RLVR?
- In what ways do differentiated constraints improve the model's reasoning capability?
- How do the performance gains of Archer compare to conventional RLVR methods like DAPO?
- Find recent papers about entropy-aware reinforcement learning methods.
Related Papers
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example (2025)
- Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs (2025)
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (2025)
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (2025)
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning (2025)
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy (2025)
- Reasoning with Exploration: An Entropy Perspective (2025)
- First Return, Entropy-Eliciting Explore (2025)
- Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning (2025)