Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR (2507.15778v1)

Published 21 Jul 2025 in cs.CL

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of LLMs, mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.

Summary

The paper introduces Archer, an entropy-aware RLVR framework that applies dual-token constraints to stabilize knowledge while enhancing reasoning.
It leverages token-level entropy to assign differentiated KL regularization and clipping, achieving state-of-the-art results on math and code benchmarks.
Empirical results demonstrate efficiency gains with fewer GPU hours and improved cross-domain generalization through synchronous token updates.

Entropy-Aware Dual-Token Constraints for RLVR: A Technical Overview

This paper introduces Archer, an entropy-aware reinforcement learning with verifiable rewards (RLVR) framework that applies dual-token constraints to LLM post-training. The method is motivated by the observation that RLVR primarily enhances reasoning by reorganizing existing model capabilities (e.g., reflection, planning) rather than altering factual knowledge. Archer leverages token-level entropy to distinguish between knowledge-oriented (low-entropy) and reasoning-oriented (high-entropy) tokens, applying differentiated regularization and clipping strategies to each. The approach is evaluated on mathematical reasoning and code generation benchmarks, demonstrating consistent improvements over prior RLVR methods and achieving state-of-the-art (SOTA) results for 1.5B-parameter models.

Motivation and Theoretical Basis

Recent analyses of RLVR training dynamics reveal that high-entropy tokens—typically logical connectors or decision points—are the primary locus of RL-driven improvements, while low-entropy tokens encode factual or domain knowledge and should remain stable. Prior works attempted to exploit this by masking gradients or asynchronously updating token types, but these approaches disrupt the sequential and semantic dependencies inherent in LLM outputs, leading to suboptimal learning and degraded performance on reasoning tasks.

Archer addresses these limitations by synchronously updating all tokens but applying dual constraints:

Reasoning tokens (high-entropy): Weaker KL regularization and higher clipping thresholds to encourage exploration and flexible adaptation.
Knowledge tokens (low-entropy): Stronger KL regularization and tighter clipping to preserve factual accuracy and prevent catastrophic forgetting.

Token classification is performed at the response level using entropy quantiles, mitigating the misclassification issues observed with batch-level statistics due to prompt- and response-level entropy variation.

Methodological Details

Token Classification

For each generated response, Archer computes the entropy of each token and determines a threshold (e.g., 80th percentile) to separate high- and low-entropy tokens. This is performed independently per response, ensuring robust token type assignment regardless of prompt difficulty or response diversity.

Dual-Token Constraints

The RL objective extends the Group Relative Policy Optimization (GRPO) and DAPO frameworks by introducing token-type-dependent clipping and KL penalties:

Clipping:
- $\varepsilon^\text{r}$ (reasoning): Larger, allowing greater policy deviation for high-entropy tokens.
- $\varepsilon^\text{k}$ (knowledge): Smaller, restricting updates for low-entropy tokens.
KL Penalty:
- $\beta^\text{r}$ (reasoning): Lower or zero, reducing regularization on reasoning tokens.
- $\beta^\text{k}$ (knowledge): Higher, enforcing proximity to the base model for knowledge tokens.

The loss for each token is thus:

1	L_t = min(r_t * A_t, clip(r_t, 1-ε(e_t), 1+ε(e_t)) * A_t) - β(e_t) * KL(π_θ \|\| π_ref)

where

e_t

is the entropy of token

t

, and the functions

\varepsilon(e_t)

and

\beta(e_t)

select the appropriate hyperparameters based on token type.

Synchronous Updates

Unlike gradient masking or asynchronous updates, Archer performs joint optimization over all tokens, preserving the sequential and contextual dependencies critical for effective reasoning.

Implementation

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, distilled and SFT-finetuned on high-quality reasoning data.
Training: 16 rollouts per prompt, batch size 64, learning rate $1 \times 10^{-6}$ , temperature 1.0, max response length 32,768 tokens.
Hardware: 2 nodes × 8 NVIDIA H800 80GB GPUs.
Hyperparameters: $\rho=0.8$ (entropy quantile), $\varepsilon^\text{r}=0.5$ , $\varepsilon^\text{k}=0.2$ , $\beta^\text{r}=0.0$ , $\beta^\text{k}=0.001$ .

Empirical Results

Mathematical Reasoning

On AIME24, AIME25, AMC23, MATH-500, Minerva, and OlympiadBench, Archer achieves the highest average accuracy among all 1.5B models, with notable gains over DAPO and other SOTA baselines. For example, on AIME24, Archer improves avg@64 by +6.6 points over DAPO.

Code Generation

On LiveCodeBench v5 and v6, Archer outperforms DeepCoder-1.5B, Nemotron-1.5B, and DAPO, with avg@8 and avg@16 improvements of 2–3 points.

Efficiency

Archer achieves these results with single-stage training and significantly fewer GPU hours compared to multi-stage SOTA baselines (e.g., 1,900 H800 GPU hours for math RL vs. 16,000 H100 GPU hours for Nemotron-1.5B).

Ablation Studies

KL Weight: Both insufficient and excessive KL regularization on low-entropy tokens degrade performance; a moderate value (0.001) is optimal for stability and learning.
Clip Thresholds: Tighter clipping on low-entropy tokens preserves knowledge but slows learning; looser clipping on high-entropy tokens accelerates reasoning improvements but can risk overfitting if excessive.
Token Participation: Excluding low-entropy tokens from updates (as in gradient masking) impairs learning due to broken dependencies; synchronous, differentiated updates are empirically superior.

Cross-Domain Generalization

RL training on math or code tasks yields improvements on out-of-domain benchmarks, with gains correlating to problem difficulty rather than topical overlap. This supports the hypothesis that RLVR primarily enhances reasoning organization and integration, not factual knowledge.

Implications and Future Directions

Archer demonstrates that fine-grained, entropy-aware token-level optimization can simultaneously preserve factual knowledge and promote reasoning in LLMs. The approach is computationally efficient and robust, outperforming prior methods without complex multi-stage training. The findings suggest several avenues for future research:

Generalization to Larger Models: Scaling the dual-token constraint framework to larger LLMs and diverse domains.
Adaptive Constraint Scheduling: Dynamically adjusting KL and clipping parameters during training based on learning progress or token context.
Integration with Process-Based RL: Combining Archer with stepwise or process-based reward modeling for even finer control over reasoning behaviors.
Theoretical Analysis: Formalizing the relationship between token entropy, learning dynamics, and generalization in RLVR.

Conclusion

The Archer framework provides a principled, empirically validated method for stabilizing knowledge and promoting reasoning in RLVR post-training of LLMs. By synchronously applying entropy-aware dual-token constraints, it achieves SOTA results on challenging reasoning and code generation tasks, with strong efficiency and generalization properties. This work advances the understanding of token-level optimization in RL for LLMs and sets a foundation for future research on fine-grained, structure-aware RL training strategies.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

GitHub

GitHub - wizard-III/ArcherCodeR: ArcherCodeR is an open-source initiative enhancing code reasoning in large language models through scalable, rule-governed reinforcement learning. (5 stars)

Tweets

https://twitter.com/_akhaliq/status/1947490184326512720

YouTube

Show All Videos