Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Published 28 Dec 2025 in cs.LG, cs.AI, cs.IT, and stat.ML | (2512.23075v1)

Abstract: Policy gradient methods for LLMs optimize a surrogate objective computed from samples of a rollout policy $π{\text{roll}}$. When $π{\text{roll}} \ne πθ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D{kl}{tok,max}$ -- the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 37 likes about this paper.