Papers
Topics
Authors
Recent
Search
2000 character limit reached

Outcome-Driven Actor-Critic (ODAC)

Updated 29 January 2026
  • Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning framework that unifies actor and critic updates via a joint decision-aware objective to maximize expected returns.
  • It introduces tailored surrogate losses based on convex mirror maps and Bregman divergences, ensuring coherent updates and improved policy performance.
  • Empirical benchmarks and theoretical guarantees show that ODAC achieves faster convergence and more robust performance compared to standard actor-critic methods.

Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning (RL) framework that introduces a principled joint objective for actor-critic (AC) algorithms, ensuring that policy and critic updates are coherently aligned with the goal of maximizing expected return. Unlike standard AC approaches—where the critic is typically trained via a decorrelated temporal-difference (TD) loss—ODAC couples the actor and critic through a decision-aware objective derived from a lower bound on the true return. The formulation is general and supports arbitrary policy and value function parameterizations, convex mirror maps, and provides theoretical guarantees of monotonic policy improvement under mild conditions. The approach subsumes commonly used surrogate actor losses and yields critic learning rules superior to the standard mean squared error (MSE), particularly under restricted critic capacity (Vaswani et al., 2023).

1. Joint Decision-Aware Objective Formulation

ODAC's central contribution is a joint functional lower bound for expected return J(π)J(\pi), evaluated at policy iteration tt:

J(π)    J(πt)  +  g^t,  ππt    (1η+1c)DΦ(π,πt)    1cDΦ(Φ(πt)c[J(πt)g^t],Φ(πt))J(\pi)\;\ge\;J(\pi_t)\;+\;\big\langle \hat g_t,\;\pi-\pi_t\big\rangle \;-\;\Bigl(\frac1\eta+\frac1c\Bigr)\,D_\Phi(\pi,\pi_t) \;-\;\frac1c\,D_{\Phi^*}\Big(\nabla\Phi(\pi_t)-c[\nabla J(\pi_t)-\hat g_t],\nabla\Phi(\pi_t)\Big)

where

  • πt\pi_t is the current policy,
  • g^t\hat g_t is a parameterized estimator of the policy gradient,
  • Φ\Phi is a strictly convex mirror map (e.g., negative entropy or log-sum-exp),
  • DΦD_\Phi is the corresponding Bregman divergence,
  • c>0c>0 controls the actor-critic trade-off,
  • η>0\eta>0 is a functional step-size.

Actor and critic updates optimize two surrogates derived from this lower bound. The actor surrogate is given by

t(θ)=g^t,  π(θ)πt(1η+1c)DΦ(π(θ),πt)\ell_t(\theta) = \bigl\langle \hat g_t,\;\pi(\theta)-\pi_t\bigr\rangle - \Bigl(\frac1\eta+\frac1c\Bigr)D_\Phi(\pi(\theta),\pi_t)

Parameter θ\theta indexes the policy, and maximizing t\ell_t subject to DΦD_\Phi trust regions provably tightens the lower bound on J(π)J(\pi).

The critic is fitted via the loss

Lt(ω)=1cDΦ(Φ(πt)c[πJ(πt)g^t(ω)],Φ(πt))\mathcal L_t(\omega) = \frac1c D_{\Phi^*} (\nabla\Phi(\pi_t)-c[\nabla_\pi J(\pi_t)-\hat g_t(\omega)], \nabla\Phi(\pi_t))

using a parametric critic ωg^t(ω)πJ(πt)\omega \mapsto \hat g_t(\omega) \approx \nabla_\pi J(\pi_t). This loss directly enforces fidelity to the actor’s improvement metric.

ODAC supports closed-form instantiations for common mirror maps:

  • Negative entropy: yields weighted KL-regularized surrogates for the actor and critic.
  • Log-sum-exp: yields log-weighted advantage-based surrogate functions.

2. Algorithmic Structure and Implementation

The ODAC algorithm alternates actor and critic updates, maintaining joint optimization of their respective decision-aware surrogates. Below is the pseudocode as presented:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Input: mirror map Φ, policy π(θ), critic ω, step-sizes η, {α_a}, {α_c}, trade-off c, inner-loops m_a, m_c, AC iterations T.
initialize θ, ω, ππ(θ).

for t=0T1 do
  # Critic update
  estimate πJ(π_t) via MC or TD  ḡ_t
  form L_t(ω) = (1/c)·D_{Φ*}(Φ(π_t)c[ḡ_t\hat g(ω)], Φ(π_t))
  ω  ω
  repeat m_c times:
    ω  ω  α_c·_ω L_t(ω)
  end
  set \hat g_t  \hat g(ω)

  # Actor update
  form ℓ_t(θ)= \hat g_t, π(θ)π_t  (1/η+1/c)·D_Φ(π(θ),π_t)
  θ  θ
  repeat m_a times:
    θ  θ + α_a·_θ ℓ_t(θ)
  end
  π_{t+1}  π(θ)
end

return π_T

Key hyperparameters include:

  • cc: actor–critic coupling, optimal in range [103,101][10^{-3},10^{-1}] via grid search.
  • η\eta: functional step size, independent of parameterization, selected from {103,,1}\{10^{-3}, \dots, 1\}.
  • αa\alpha_a, αc\alpha_c: dynamic step-sizes, set via Armijo line-search.
  • mam_a, mcm_c: number of actor and critic gradient steps, balancing computational budget against bias/variance.

3. Theoretical Guarantees

ODAC is equipped with explicit monotonic improvement criteria (Prop 4). Defining bt=θπ(θt)g^tb_t = \nabla_\theta \pi(\theta_t)^\top \hat g_t, H~t=θπ(θt)π2Φ(πt)θπ(θt)\tilde H_t = \nabla_\theta \pi(\theta_t)^\top \nabla_\pi^2 \Phi(\pi_t) \nabla_\theta \pi(\theta_t), guaranteed improvement is obtained if:

bt,(H~t)bt>πJ(πt)g^t,[π2Φ(πt)]1πJ(πt)g^t\langle b_t, (\tilde H_t)^\dagger b_t \rangle > \langle \nabla_\pi J(\pi_t)-\hat g_t, [ \nabla_\pi^2 \Phi(\pi_t)]^{-1}|\nabla_\pi J(\pi_t)-\hat g_t \rangle

In tabular plus Euclidean-KL settings, this reduces to the “relative error < 1” condition of the critic approximation.

ODAC further ensures convergence to a stationary point neighborhood (Prop 5):

min0t<TπJ(πt)=O(1T)+O(1TtDΦ(Φcδt))+O(et)\min_{0 \leq t < T} \| \nabla_\pi J(\pi_t) \| = O\big( \tfrac{1}{\sqrt{T}} \big) + O\big( \tfrac{1}{T} \sum_t D_{\Phi^*}(\nabla\Phi-c\delta_t) \big) + O(e_t)

with ete_t denoting critic and projection errors, and J+1ηΦJ + \frac{1}{\eta}\Phi convexity and LL-smoothness assumptions.

4. Comparison with Trust-Region and Proximal Policy Optimization Methods

ODAC’s actor surrogate structurally interpolates standard actor-critic and trust-region style updates, such as TRPO and PPO.

  • TRPO maximizes the relative entropy regularized objective:

maxθEs,aπold[rlogπθπold]\max_\theta\, E_{s,a \sim \pi_{\text{old}}}\big[ r \log \frac{\pi_\theta}{\pi_{\text{old}}} \big]

subject to KL trust-region constraints.

  • PPO uses a clipped surrogate:

maxE[min(r(θ)A,rclip(θ)A)]\max\, E[ \min( r(\theta)A, r_\text{clip}(\theta)A ) ]

  • ODAC actor surrogate takes the form:

t(θ)=E[w(s,a)(πθ(as)πt(as))]τDΦ\ell_t(\theta) = E\big[ w(s,a)(\pi_\theta(a|s) - \pi_t(a|s)) \big] - \tau D_\Phi

For weighted negative entropy, this reduces to E[A^πt(s,a)logπθπt](1η+1c)KL(πθπt)E[ \hat A^{\pi_t}(s,a) \log \frac{\pi_\theta}{\pi_t} ] - ( \frac{1}{\eta} + \frac{1}{c} ) KL(\pi_\theta \|\pi_t ), paralleling TRPO but with a data-adaptive $1/c$ coefficient coupling to critic error.

A plausible implication is that ODAC retains the stability of trust-region methods but offers adaptive error-sensitivity through the critic coupling.

5. Empirical Evaluation and Benchmarks

ODAC’s efficacy was benchmarked on:

  • Two-armed bandits (deterministic Bernoulli payoff): demonstrated that standard MSE critics may drive policies to sub-optimal arms, while decision-aware critics reliably select optimal actions.
  • Grid-world RL tasks, including:
    • Cliff World (Sutton & Barto)
    • Frozen Lake (OpenAI Gym)

Baselines:

  • Standard MSE critic (TD loss)
  • Adv-MSE critic (squared advantage loss)

Metrics:

  • Average return J(π)J(\pi) over AC iterations.
  • 95% confidence intervals over 5 random seeds.
  • Critic expressivity varied by feature dimension (d=40,60,80d=40, 60, 80).

Key findings:

Critic Expressivity MSE-Critic Adv-MSE Decision-Aware (DA) Critic
High (d ≥ 80) Optimal return Optimal return Optimal return
Moderate (d=40,60) Sub-optimal, non-monotonic Nearly always converges Typically converges faster, higher return

Empirical performance held under both tabular and linear policy parameterizations, for both exact and MC-estimated critic targets. This suggests ODAC’s advantages are robust to policy/critic model class and stochasticity in return estimation.

6. Practical Considerations for Implementation

Key practitioner tips:

  • Tuning cc: Moderate values in [103,101][10^{-3}, 10^{-1}] favor balance; lower cc suppresses unreliable critic contributions.
  • Step-sizes (αa\alpha_a, αc\alpha_c): Use Armijo line-search for both actor and critic losses for adaptive curvature scaling.
  • Functional step-size (η\eta): Decoupled from model parameters; select from {103,,1}\{10^{-3},\ldots,1\}.
  • Inner iterations (mam_a, mcm_c): Increase mcm_c when critic evaluation is cheap to reduce bias; reduce mam_a when policy evaluation is costly.
  • Critic error: Large critic error (Jg^\| \nabla J - \hat g \|) invalidates improvement conditions; remedy by collecting additional data or enriching critic representation.

A plausible implication is that ODAC's design systematically eliminates the actor–critic mismatch present in standard AC, guarantees monotonic improvement under mild oracle error conditions, and demonstrates superior practical performance, especially for resource-constrained critics (Vaswani et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Outcome-Driven Actor-Critic (ODAC).