Beyond Pessimism: Offline Learning in KL-regularized Games

Published 8 Apr 2026 in cs.GT and cs.LG | (2604.06738v1)

Abstract: We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast $\widetilde{\mathcal{O}}(1/n)$ statistical rate as the minimax estimator.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that explicit pessimism is unnecessary in offline KL-regularized zero-sum games by exploiting the game’s strong convex-concavity to achieve an O(1/n) statistical rate.
The methodology uses a direct minimax estimation with least squares regression and a self-play mirror descent algorithm, eliminating conservative uncertainty penalties.
The results have practical implications for RLHF and LLM alignment, showing that KL-regularization can stabilize offline learning without complex pessimistic adjustments.

Fast-Rate Offline Learning in KL-Regularized Zero-Sum Games Without Pessimism

Introduction and Problem Formulation

The paper "Beyond Pessimism: Offline Learning in KL-regularized Games" (2604.06738) addresses the problem of policy learning in offline two-player zero-sum games with entropic (KL) regularization, where both players' policies are regularized with respect to a fixed reference policy by a KL divergence. This framework is highly relevant to contemporary LLM alignment strategies, e.g., RLHF under KL constraints, and general preference optimization via pairwise comparison games with safety constraints.

The principal challenge in offline settings is distributional shift: the evaluation and optimization policies may diverge from the behavior policy that generated the static dataset, rendering value estimates unreliable for actions insufficiently covered by the data. Previous work has addressed this through explicit pessimism—down-weighting uncertain or out-of-distribution values via lower-confidence bounds or uncertainty penalization—yielding $\tilde{\mathcal{O}}(1/\sqrt{n})$ statistical rates under unilateral coverage. However, pessimism introduces algorithmic complexity and requires careful parameter tuning.

This paper poses a crucial question: Can one exploit the geometry of KL-regularized games to bypass pessimism and obtain sharper statistical guarantees? The authors' contributions provide an affirmative answer.

Main Technical Contributions

Pessimism-Free Equilibrium Learning

The paper introduces a direct minimax estimation procedure for KL-regularized games that completely eschews pessimism. Policies are learned by first fitting the game payoff function via least squares regression, then computing the Nash equilibrium for the empirical game induced by the estimated payoff and a KL constraint with respect to the reference policy. Notably, value estimation does not introduce any confidence lower bounds or auxiliary uncertainty penalties.

Key technical leverage is drawn from two geometric properties of the KL-regularized game objective:

Strong convex-concavity induced by the KL terms: The regularized game admits a unique, smooth Nash equilibrium and exhibits stability to small perturbations in payoffs.
Stability under the game’s operator skew-symmetry: The duality gap, the principal measure of policy suboptimality, can be tightly related to unilateral estimation errors under realized policies, instead of requiring uniform error bounds over the space of all policies.

This structural exploitation allows the authors to circumvent the standard pessimism-based error decomposition and, crucially, isolate the core statistical error to the unilateral function class estimation error.

Fast-Rate Generalization Bound

Underlying the analysis is the observation that the duality gap of the learned (non-pessimistic) policies is controlled by a quadratic function of the $\ell_1$ distance between the computed and optimal regularized policies. The strong convexity conferred by the KL terms allows this distance to be bounded by the unilateral estimation error evaluated only at the Nash equilibrium policies and their unilateral deviations—thus requiring only a unilateral concentrability assumption (standard for games).

The main theoretical result is a $\tilde{\mathcal{O}}(1/n)$ sample complexity for the Nash duality gap in KL-regularized offline games, leveraging fast-rate concentration bounds for least-squares regression in the finite function class regime. This is a qualitative improvement over the $\tilde{\mathcal{O}}(1/\sqrt{n})$ rates for pessimistic approaches, establishing that explicit pessimism is not statistically necessary in the KL-regularized setting.

Efficient Self-Play Policy Optimization

The authors also propose a computationally efficient self-play algorithm based on KL-regularized mirror descent-ascent. This approach alternates between regularized best-response updates for each player, implemented as regularized mirror descent in policy space against the fixed empirical payoff estimate. The self-play dynamic matches the statistical efficiency of direct minimax estimation, achieving the same $\tilde{\mathcal{O}}(1/n)$ statistical rate when the number of mirror descent iterations is linear in the sample size, thus resolving both computational and statistical aspects in a principled manner.

Contrasting with Prior Work

Prior analyses (e.g., [ye2024online]) have relied on constructing a pessimistic payoff estimator—by subtracting confidence-based uncertainty penalties—ensuring that policy optimization is carried out with guaranteed lower bounds on the true value. The decomposition of policy suboptimality in these works necessarily leads to a dominant $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, rooted in worst-case uncertainty over function classes and policy coverage.

In contrast, this paper's analysis leverages the geometry and softmax structure of KL-regularized responses: via monotonicity and stability results, policy errors can be tightly controlled by observed unilateral errors, which admit fast-rate statistical control. This departs from classical pessimism and delivers a matching minimax rate, under standard (and necessary) function approximation and unilateral coverage assumptions.

Remarkably, the analysis shows that KL-regularization intrinsically stabilizes offline equilibrium computation against extrapolation error, as long as the data covers unilateral deviations from the Nash equilibrium. This is in marked contrast to the pessimistic paradigm for general MDPs and games.

Implications and Future Directions

Theoretical Implications

The work shows that, for sufficiently regularized zero-sum games, explicit pessimism is unnecessary for statistical efficiency, provided that the regularization is strong enough to ensure uniqueness and stability of the Nash equilibrium.
This sharp statistical characterization (fast rate under unilateral coverage) suggests KL-regularization is not only a practical stabilization device, but also a theoretically sufficient mechanism for safe offline equilibrium learning in adversarial domains.
The non-pessimistic framework opens avenues for refined regret and sample complexity analyses in broader settings such as general-sum games, extensive-form games, and multi-agent RL under regularization.

Practical and Algorithmic Implications

By eliminating explicit pessimism and corresponding hyperparameter tuning, the framework enables simpler, more robust implementation of RLHF, offline game-theoretic optimization, and preference model alignment (including LLM alignment) using static datasets.
The mirror descent self-play algorithm is scalable and aligns naturally with standard multi-agent online learning dynamics, making it attractive for large-scale applications.
The results point to the importance of identifying and exploiting problem geometry (e.g., strong convex-concave structure) in offline RL as a means of achieving strong robustness without conservative bias.

Speculative Directions

Extending the analysis to infinite function approximation classes, non-zero-sum and stochastic games, or more general $f$ -divergence regularizations could yield further insight into the generality of the pessimism-free approach.
Connections to contemporary methods for RLHF and direct preference optimization (e.g., DPO) deserve investigation, especially as LLM alignment spawns new empirical protocols grounded in KL-regularized objectives.
There are open questions around the optimal interplay between regularization strength, data coverage, and achievable minimax rates.

Conclusion

This paper delivers a decisive answer to the necessity of pessimism in KL-regularized offline game learning: fast-rate ( $\tilde{\mathcal{O}}(1/n)$ ) sample efficiency is achievable without explicit pessimism in zero-sum games given standard function approximation and coverage. The technical core is an exploitation of KL-regularization-induced geometry and a reduction of policy suboptimality analysis to unilateral estimation error. By providing both algorithmic and analytical frameworks, the work positions KL-regularized games as a fertile ground for principled, practical, and efficient offline multi-agent learning, with significant implications for emerging domains such as LLM alignment and RLHF (2604.06738).