Papers
Topics
Authors
Recent
Search
2000 character limit reached

Discounted Trajectory-wise KL Penalty

Updated 8 February 2026
  • Discounted Trajectory-wise KL Penalty is a reinforcement learning method that regularizes policy updates using a KL-divergence penalty combined with path consistency loss.
  • It leverages off-policy data and a squared loss over trajectories to achieve superior sample efficiency and stability in both continuous control and generative tasks.
  • The approach unifies KL-regularized RL and Relative Trajectory Balance objectives, offering improved performance over traditional on-policy methods.

Trust-PCL is an off-policy reinforcement learning (RL) algorithm that optimizes a KL-regularized objective using a pathwise consistency loss. Originally introduced to address the sample inefficiency of on-policy trust region methods like TRPO, Trust-PCL leverages off-policy data and squared path consistency losses to stabilize training and enable effective optimization of RL policies under value-based and information-theoretic regularization. The framework is relevant not only for continuous control but, as recent work has shown, is theoretically equivalent to objectives that have emerged in sequential generative modeling, such as Relative Trajectory Balance (RTB) in GFlowNets, thus situating Trust-PCL as a central method in the landscape of KL-regularized RL methods (Deleu et al., 1 Sep 2025, Nachum et al., 2017).

1. KL-Regularized RL Objective and Pathwise Consistency

Trust-PCL is formulated within the general KL-regularized RL framework. The agent interacts with a Markov Decision Process (MDP) M=(S{},A,P,r)M=(S\cup\{\perp\},A,P,r), where trajectories τ=(s0,,sT,)\tau=(s_0, \dots, s_T, \perp) accrue reward t=0Tr(st,st+1)\sum_{t=0}^T r(s_t, s_{t+1}), often designed such that the total trajectory reward equals E(sT)-E(s_T) for some terminal energy E(sT)E(s_T). The learning objective augments the standard RL objective with a KL-divergence penalty, balancing cumulative reward with the cost of deviating from a fixed prior policy πprior\pi_{\text{prior}}:

π=argmaxπ  Eτπ[t=0Tr(st,st+1)αKL(π(st)πprior(st))]\pi^* = \underset{\pi}{\arg\max} \; \mathbb{E}_{\tau\sim\pi} \Bigl[ \sum_{t=0}^{T} r(s_t, s_{t+1}) - \alpha\,\mathrm{KL}(\pi(\cdot|s_t)\|\pi_{\text{prior}}(\cdot|s_t)) \Bigr]

Here α>0\alpha>0 scales the relative weight of the KL penalty. The optimal policy has the form:

π(ss)πprior(ss)exp([Qsoft(s,s)Vsoft(s)]/α)\pi^*(s'|s) \propto \pi_{\text{prior}}(s'|s) \exp\left( [Q^*_{\text{soft}}(s,s') - V^*_{\text{soft}}(s)]/\alpha \right)

with soft Q- and V-functions given by:

Qsoft(s,s)=r(s,s)+Vsoft(s),Vsoft(s)=αlogsπprior(ss)exp(Qsoft(s,s)/α).Q^*_{\text{soft}}(s,s') = r(s,s') + V^*_{\text{soft}}(s'),\quad V^*_{\text{soft}}(s) = \alpha\, \log \sum_{s'} \pi_{\text{prior}}(s'|s) \exp(Q^*_{\text{soft}}(s,s')/\alpha).

Pathwise consistency extends Bellman optimality to multi-step or trajectory-level relations, and is central to Trust-PCL’s off-policy capabilities (Nachum et al., 2017).

2. Definition and Loss Function of Trust-PCL

Trust-PCL introduces parametric forms Vψ(s)V^\psi(s) for the soft value function and πϕ(ss)\pi_\phi(s'|s) for the policy. The core construct is the per-trajectory residual:

ΔT(τ;ϕ,ψ)=Vψ(s0)+t=0Tr(st,st+1)+αt=0Tlogπprior(st+1st)πϕ(st+1st)\Delta_T(\tau; \phi, \psi) = -V^\psi(s_0) + \sum_{t=0}^T r(s_t, s_{t+1}) + \alpha \sum_{t=0}^T \log\frac{\pi_{\text{prior}}(s_{t+1}|s_t)}{\pi_\phi(s_{t+1}|s_t)}

The optimization target is the squared loss over trajectories sampled from a behavior policy πb\pi_b (possibly from a replay buffer):

LT(ϕ,ψ)=12Eτπb[ΔT(τ;ϕ,ψ)2]L_T(\phi, \psi) = \frac{1}{2} \mathbb{E}_{\tau\sim\pi_b}[\Delta_T(\tau; \phi, \psi)^2]

Minimizing LTL_T via stochastic gradient descent refines both policy and value estimates, encouraging both Bellman pathwise consistency and restrained deviation from the prior policy (Deleu et al., 1 Sep 2025).

3. Algorithmic Structure and Implementation

Trust-PCL leverages off-policy learning, allowing for improved sample efficiency. The training loop (omitting line-by-line pseudocode) proceeds as follows:

  • Initialize policy πϕ\pi_\phi, value VψV^\psi, fixed prior πprior\pi_{\text{prior}}, and buffer BB.
  • At each iteration:
    • Collect transitions or trajectories using a behavior policy πb\pi_b (e.g., ε\varepsilon-greedy variant or delayed copy of current policy).
    • Add to the buffer, then sample a minibatch.
    • For each trajectory, compute:
    • R=tr(st,st+1)R = \sum_t r(s_t, s_{t+1})
    • KL=tlog[πprior(st+1st)/πϕ(st+1st)]KL = \sum_t \log[\pi_{\text{prior}}(s_{t+1}|s_t)/\pi_\phi(s_{t+1}|s_t)]
    • Δ=Vψ(s0)+R+αKL\Delta = -V^\psi(s_0) + R + \alpha\cdot KL
    • Minimize L=(1/2M)j=1MΔj2L = (1/2M)\sum_{j=1}^M \Delta_j^2 using gradients with respect to both sets of parameters.
  • Separate learning rates ηϕ,ηψ\eta_\phi,\eta_\psi are typically used. Annealing schedules for α\alpha and optional PPO-style clipping can enhance stability. No discount factor γ\gamma is required in finite horizons (or set γ=1\gamma=1). Importance sampling corrections may be used if πbπϕ\pi_b \neq \pi_\phi (Deleu et al., 1 Sep 2025).

Key hyperparameters are summarized below:

Hyperparameter Description Typical Range or Notes
α\alpha KL-regularization weight Annealed or fixed
ηϕ\eta_\phi Policy learning rate Task-dependent
ηψ\eta_\psi Value learning rate Task-dependent
MM Batch size Implementation dependent
TT Rollout length Match task horizon
πb\pi_b Behavior policy ε\varepsilon-greedy, delayed πϕ\pi_\phi
Replay buffer Off-policy storage Size capped for recency

4. Theoretical Properties and Path Consistency

Trust-PCL’s objective yields optimality guarantees rooted in the fixed-point contraction properties of the pathwise “soft” Bellman operator. The overall learning dynamics can be linked to the minimization of a contractive loss landscape, implying uniqueness and stability of solutions under standard boundedness and regularity assumptions on rewards and policy spaces.

By directly minimizing squared consistency errors on arbitrary-length or full-trajectory segments, Trust-PCL is able to exploit off-policy data without the need for importance reweighting. Additionally, adjusting λ\lambda (or α\alpha) to enforce trajectory-level KL constraints yields trust-region properties analogous to those of TRPO, but operationalized via purely gradient-based off-policy updates (Nachum et al., 2017).

5. Equivalence to Relative Trajectory Balance (RTB)

Recent theoretical work has established that Trust-PCL is equivalent, up to scaling and reparameterization, to the Relative Trajectory Balance (RTB) objective introduced in the context of GFlowNets (Deleu et al., 1 Sep 2025). RTB evaluates the squared error of a residual defined as

ΔRTB(τ;ϕ,ψ)=log[πprior(τ)Zψπϕ(τ)]E(sT)α\Delta_{\mathrm{RTB}}(\tau;\phi,\psi) = \log\left[\frac{\pi_{\text{prior}}(\tau)}{Z_\psi \pi_\phi(\tau)}\right] - \frac{E(s_T)}{\alpha}

with the mapping:

  • πϕ(ss)Pϕ(ss)\pi_\phi(s'|s) \equiv P_\phi(s'|s)
  • Vψ(s0)=αlogZψV^\psi(s_0) = \alpha \log Z_\psi
  • Only the terminal transition’s reward is E(sT)-E(s_T)

Under this identification, ΔT=αΔRTB\Delta_T = \alpha \Delta_{\mathrm{RTB}}, and LT=α2LRTBL_T = \alpha^2 L_{\mathrm{RTB}}. Thus, RTB and Trust-PCL instantiate the same learning dynamics in different guises. This equivalence positions both within the broader theory of KL-regularized RL, unifying techniques across RL and generative modeling (Deleu et al., 1 Sep 2025).

6. Empirical Behavior and Illustrative Examples

In a canonical 2D Gaussian mixture example, the prior is a uniform mixture of 25 Gaussians, while the target distribution applies exponential tilting by an energy function. RTB was reported to recover all 25 modes, outperforming basic KL-regularized REINFORCE. But with proper reward design (r=E(sT)r=-E(s_T)) and off-policy sampling, Trust-PCL (and off-policy REINFORCE with KL and self-normalized importance weights) also recovers all modes as accurately as RTB. This result demonstrates that prior differences in performance trace to reward specification and off-policy learning regime, not a fundamental algorithmic distinction (Deleu et al., 1 Sep 2025).

In continuous control benchmarks, Trust-PCL achieves the sample efficiency and ultimate performance of TRPO with 10×10\times--100×100\times fewer environmental steps, and the algorithm remains stable even in highly off-policy settings, provided the KL penalty is appropriately tuned. Removal of the KL regularizer (λ=0\lambda=0) induces instability, confirming its centrality. Trust-PCL’s curves on learning benchmarks lead those of TRPO, reflecting improved data efficiency (Nachum et al., 2017).

7. Connections, Extensions, and Significance

Trust-PCL occupies a central position among KL-regularized RL algorithms. Its pathwise squared loss is applicable in both finite- and infinite-horizon settings, and its off-policy structure enables significant sample reuse and practical training efficiency. The demonstrated equivalence to RTB unifies trajectory-balance objectives for GFlowNets with the theory of maximum-entropy and KL-regularized RL. Practical deployments frequently leverage the replay buffer structure and adaptive KL constraints of Trust-PCL. Its theoretical and empirical properties continue to inform advances in RL for both control and generative modeling domains (Deleu et al., 1 Sep 2025, Nachum et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Discounted Trajectory-wise KL Penalty.