Entrocraft: Controlled Entropy in LLM RL
- Entrocraft is a reinforcement learning framework for LLMs that regulates entropy through user-specified schedules and rejection sampling, addressing performance saturation.
- It employs a modular, advantage-based filtering mechanism during rollout sampling to dynamically adjust policy entropy without modifying gradient computations.
- Empirical evaluations show up to 50% gains in pass@K metrics and extended productive training periods, underscoring its robustness in long-horizon RL tasks.
Entrocraft is a framework for reinforcement learning (RL) on LLMs that provides precise, user-customized control over the entropy curve during policy-gradient optimization. Developed to address performance saturation—a widespread problem where RL loses effectiveness due to entropy collapse—Entrocraft directly manipulates the advantage distribution through a rejection-sampling procedure, enabling explicit scheduling of entropy without requiring auxiliary loss regularization or changes to gradient computation. Its empirical and theoretical basis establishes entropy as a first-class, scheduleable hyperparameter for robust, long-horizon RL scaling in LLMs (Li et al., 29 Apr 2026).
1. Motivation: Entropy Collapse and Performance Saturation
Performance saturation in LLM RL emerges from the collapse of policy entropy during training. In standard policy-gradient updates, positive-advantage samples become increasingly dominant, systematically reducing both token-level and sequence-level entropy. Formally, Theorem 3.2 shows that every positive-advantage update results in a negative entropy change:
so . As entropy collapses, exploration vanishes, biasing the model toward a narrow, overfit solution space and severely curtailing the effectiveness of continued RL—“performance saturation.” Prior methods apply entropy regularization or use clipping, but these only indirectly affect the entropy and often lead to long-term instability.
Entrocraft counters this by directly enforcing a user-specified entropy trajectory, thereby sustaining exploration and avoiding premature saturation.
2. Algorithmic Structure of Entrocraft
Entrocraft is designed as a modular, advantage-estimator-agnostic filter interposed at rollout sampling time, compatible with any policy-gradient RL method (e.g., PPO, GRPO/GSPO). Its procedure for each batch is as follows:
- Compute the average entropy of all sampled rollouts.
- Set a schedule-based entropy band for allowable entropy at this step.
- Calculate an entropy-control flag :
- : entropy exceeds upper bound, decrease entropy.
- : entropy falls below lower bound, increase entropy.
- For each rollout with estimated advantage , accept it into the update set with probability
0
with 1 a tunable sharpness parameter. The surrogate RL objective is computed solely on 2.
The algorithm’s pseudocode is as follows: 6 This adaptive filter re-weights advantages: when entropy is too low, negative-advantage samples (promoting stochasticity) are favored; when it is too high, positive-advantage samples are prioritized. Rollout selection is thus tightly coupled to a prescribed entropy schedule without modifying the policy-loss form.
3. Theoretical Foundations
Entrocraft’s approach is grounded in a set of minimal-assumption theorems analyzing entropy drift in policy gradient RL. The core assumptions are on-policy gradients and small learning rates 3, such that 4 (enabling Taylor expansion).
- Token-Level Entropy (Theorem 3.1):
5
For a token 6 where 7 and 8:
9
with high probability if 0 exceeds the “output-space baseline”: 1.
- Sequence-Level Entropy (Theorem 3.2):
2
where 3.
4
whenever 5.
These results explain why standard RL, dominated by positive-advantage trajectories, inherently reduces entropy and why strategies such as clipping provide only coarse control. Entrocraft, by shaping the distribution of accepted advantages, provides precise, parametric steering of policy entropy.
4. Entropy Schedule Design
The core benefit of Entrocraft is the ability to enact arbitrary entropy trajectories via user-defined schedules 6, mapping training step to target entropy. Common choices are:
- Constant: 7 for all 8.
- Linear annealing: 9.
- Cosine decay: 0.
Entrocraft instates these by setting 1 bounds around 2 and applying the filtered-sampling algorithm per batch.
Empirical evidence supports the use of a high initial entropy (e.g., 3) linearly annealed to a moderate target (e.g., 4), yielding optimal early-stage exploration and late-stage convergence. Constant or high entropy schedules can destabilize training, emphasizing the importance of annealing.
5. Empirical Evaluation and Results
Evaluation was conducted primarily on the Qwen3-4B-Base model, with comparisons to Qwen3-8B, Qwen3-14B, and Llama-3.1-8B, trained on the Numina-Math corpus (440K math reasoning samples) and evaluated on benchmarks such as MATH-500, AMC-23, and AIME-24/25/26. Rollout inference was performed with 32 samples per question at 5.
Key metrics and improvements using Entrocraft (especially with linear annealing) in comparison to existing baselines (GRPO/GSPO, entropy-preserving techniques):
- Qwen3-4B + Entrocraft exceeded the baseline Qwen3-8B performance.
- Achieved a 50% relative gain in pass@32 on MATH-500/AIME.
- Prolonged productive RL training by up to 4× before plateau onset.
- Output diversity, as measured by pass@K growth, increased more rapidly, signaling sustained exploration.
These results establish Entrocraft as an effective means for overcoming saturation and enhancing output quality and diversity in LLM RL.
6. Discussion, Significance, and Limitations
By directly regulating the distribution of accepted advantages at each RL iteration, Entrocraft neutralizes the self-amplifying drift toward low entropy found in standard RL while preserving upward reward optimization. The filter dynamically tracks any user-specified entropy schedule within a few steps, obviating the need for auxiliary regularization or explicit entropy terms in the loss.
Limitations and open questions include:
- Fixed, excessively high entropy schedules can severely limit the supply of positive-advantage (reward-improving) samples, leading to unstable long-run training. Annealing from high to moderate entropy is necessary for balance.
- Extensions to multi-turn dialogue RL, sparse-MoE policies, or more complex reward structures are unexplored.
- Theoretical developments assume on-policy gradients and modest learning rates; behavior under heavy off-policy trajectories or quickly shifting baselines remains undetermined.
Entrocraft reframes policy entropy as an actively managed hyperparameter rather than a passive diagnostic, providing a foundation for robust, scalable RL in LLM systems (Li et al., 29 Apr 2026).