Bidirectional KL & Entropy Regularization

Updated 8 June 2026

Bidirectional KL and entropy regularization are defined as techniques that combine forward and reverse divergence measures to balance mass-covering and mode-seeking behaviors in policy optimization.
These methods integrate forward KL for sample-efficient policy initialization and reverse KL for robust convergence, substantially improving reward outcomes and reducing gradient variance.
The approach enables adaptive entropy modulation and scalable applications in reinforcement learning, deep control, and optimal planning, with empirical gains in stability and performance.

Bidirectional KL divergence and entropy regularization constitute a family of approaches in machine learning and control where optimization objectives are augmented with Shannon entropy, forward and reverse Kullback-Leibler (KL) terms, or more general $f$ -divergences to promote stability, regularity, and improved exploration. These methodologies unify and generalize a spectrum of techniques in deep learning, reinforcement learning (RL), and optimal control, ranging from classic maximum-entropy principles to nuanced bidirectional or asymmetric regularizations with explicit impact on sample efficiency, policy expressiveness, and convergence robustness.

1. Foundations: KL Divergences and Entropy in Learning

KL divergence, $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ , serves as a metric for mismatch between probability distributions. In optimization, the forward KL ( $D_{\mathrm{KL}}(q\|\pi)$ ) and reverse KL ( $D_{\mathrm{KL}}(\pi\|q)$ ) yield distinct behaviors: the forward direction is mass-covering, often leading to broader support, while the reverse direction is mode-seeking, focusing probability mass on the peaks of the target.

Shannon entropy, $\mathcal{H}(\pi)=-\mathbb{E}_{x\sim\pi}[\log\pi(x)]$ , incentivizes policy stochasticity and exploration. In entropy-regularized RL or classification, adding $-\alpha\mathcal{H}(\pi)$ or KL terms to the objective shapes the optimization trajectory and solution support. More generally, $\alpha$ -divergences interpolate between KL directions and yield a spectrum of actor-critic and trust region methods (Belousov et al., 2019).

2. Bidirectional KL in RL and Control Algorithms

Recent RL algorithms employ both forward and reverse KL divergences in alternating or composite roles. In "Bidirectional Soft Actor-Critic: Leveraging Forward and Reverse KL Divergence for Efficient Reinforcement Learning" (Zhang et al., 2 Jun 2025), the Bidirectional SAC algorithm exploits the tractable closed-form forward-KL projection onto a Boltzmann policy for initialization, followed by reverse KL refinement:

Stage 1 (Forward KL): For Gaussian policies, forward KL projection aligns the mean and covariance with the target Boltzmann distribution’s marginals, yielding a policy that quickly matches $q(a|s)\propto\exp(Q(s,a)/\alpha)$ . This update provides sample-efficient, stable exploration early in training.
Stage 2 (Reverse KL): The policy is then optimized with a reverse KL loss plus a proximity penalty to the forward solution, ensuring monotonic improvement and robust final convergence.

This bidirectional approach empirically yields up to 30% improvements in reward and 25% greater sample efficiency on continuous control benchmarks, reducing gradient variance and premature convergence risks inherent in one-sided regularization.

Extending beyond SAC, general actor-critic frameworks can be parameterized by $\alpha$ -divergence to interpolate between the forward- and reverse-KL regimes, recovering popular approaches such as TRPO, REPS (forward KL), and mode-seeking Q-learning (reverse KL), as well as the intermediate Hellinger divergence (Belousov et al., 2019).

3. Entropy Regularization and Its Generalizations

Entropy regularization, classically the maximization of expected entropy or minimization of cross-entropy loss in supervised learning, has emerged as a key tool in RL, especially for stability and exploration. In stochastic optimal control, separate KL penalties can be assigned to both policy and transition dynamics, yielding a general two-parameter design space that encompasses:

Stochastic Optimal Control (SOC): Classical cost-expectation with no regularization.
Maximum-Entropy (“soft policy”) Control: Adds a KL penalty on the policy to a reference.
Risk-Sensitive Control: KL or exponential regularization on transition or outcome distributions.
Joint Policy-and-Transition KLs: Unify and extend previous approaches, allowing separate weighting for compositionality and path-integral solutions (Bhole et al., 5 Dec 2025).

In the synchronized case ( $\alpha=\beta$ ), compositionality arises, enabling linear Bellman updates and efficient reuse of subtask policy solutions.

Minimum entropy regularization, pursued in classification as in "Regularizing cross entropy loss via minimum entropy and K-L divergence" (Ibraheem, 23 Jan 2025), seeks to minimize the entropy of the output distribution, favoring confident predictions while optionally penalizing divergence from targets with KL terms. The proposed MIX-ENT and MIN-ENT losses empirically outperform classical cross-entropy on benchmarks, with the MIX-ENT regularizer introducing a reverse-like KL (target and hypothesis swapped), highlighting the nuanced impact of KL directionality even in supervised regimes.

4. Adaptive and Bidirectional Entropy Modulation

Uniform entropy maximization can be suboptimal in structured or sparse-reward domains such as RL with verifiable rewards (RLVR) in LLMs. "Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation" (Gu et al., 6 Apr 2026) proposes decomposing total policy entropy into:

Informative Entropy ( $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 0): Entropy on successful (positive-reward) trajectories, crucial for sustaining exploration of promising reasoning paths.
Spurious Entropy ( $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 1): Entropy on failed (negative) trajectories, interpreted as noise.

Bidirectional entropy modulation leverages asymmetric weighting to preserve informative entropy via moderated gradients on positive outcomes while aggressively reducing spurious entropy through stronger penalization on negatives. This is operationalized in Asymmetric Group-Relative Policy Optimization (AsymGRPO), which independently sets reweighting exponents for positive and negative rollouts.

Empirical results on Qwen3-4B demonstrate that AsymGRPO surpasses baseline entropy-regularized and group-relative algorithms, with independent tuning yielding an additional 2.22% average accuracy gain over symmetric ablations. The explicit control of entropy dynamics via bidirectional regularization is shown to be necessary for robust and generalizable exploration in LLM RLVR.

5. Theoretical Frameworks: $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 2-Divergences, Proximal Methods, and Monotonicity

Both entropy and KL-based regularization can be formalized within an $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 3-divergence framework. For $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 4-divergence, one recovers forward KL as $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 5 and reverse KL as $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 6. Proximal policy iteration methods solve for

$D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 7

yielding smooth, closed-form trust region updates, with actor-critic pairs (e.g., LS+AWML) arising in the Pearson $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 8 case ( $D_{\mathrm{KL}}(p\|q)=\int p(x)\log\frac{p(x)}{q(x)}dx$ 9) (Belousov et al., 2019). As $D_{\mathrm{KL}}(q\|\pi)$ 0 varies, the exploration-exploitation, stability, and convergence properties interpolate continuously between paradigms.

Regularized objectives incorporating KL, entropy, or general $D_{\mathrm{KL}}(q\|\pi)$ 1-divergences admit monotonic policy improvement guarantees and underlie a wide spectrum of modern RL algorithms (Lee, 2020). In the context of optimal control, the iterative use of soft KL-regularized subproblems constitutes a majorization-minimization scheme guaranteed to converge to the unregularized classical optima (Bhole et al., 5 Dec 2025).

6. Practical Implications and Empirical Outcomes

Bidirectional KL and entropy regularization offer practical advantages in terms of sample efficiency, stability, and final performance, especially in high-dimensional, multimodal, or compositional decision-making problems. Table 1 summarizes core mechanisms and their empirical effects:

Method/Principle	Mechanism	Empirical Impact
Bidirectional SAC (Zhang et al., 2 Jun 2025)	Forward-KL init, reverse-KL refinement	+30% reward, +25% sample eff.
AsymGRPO (Gu et al., 6 Apr 2026)	Bidirectional entropy modulation (pos/neg split)	+2.22% avg. over symmetric
$D_{\mathrm{KL}}(q\\|\pi)$ 2-div trust region (Belousov et al., 2019)	Parameterized forward/reverse KL, closed-form updates	Stability/exploitation trade-off
Soft-policy SOC (Bhole et al., 5 Dec 2025)	Separate KL on policy and transitions	Majorizing classical control

Sophisticated regularization, including bidirectional KL terms and adaptive entropy refinement, is particularly impactful in settings where naïve entropy maximization leads to poor calibration or inefficient exploration, such as LLM-based RLVR, deep continuous control, or compositional planning in optimal control.

7. Synthesis and Outlook

Bidirectional KL and entropy regularization unify principles from statistical learning theory, RL, and stochastic control. These methods allow controlled interpolation between conservative, mass-covering approaches and aggressive, mode-seeking policies, governed by explicit divergence terms. In practice, independently tuning regularization on positive and negative outcomes, as well as combining the analytic tractability of forward-KL with the policy-improvement monotonicity of reverse-KL, yields enhanced exploration-exploitation trade-offs and tractable, compositional solutions. Continued theoretical development and empirical evaluation in large-scale, structured domains will inform best practices regarding divergence selection and regularization schedules, potentially extending these principles to non-Gaussian, hierarchical, or adversarial settings.