Outcome-Driven Actor-Critic (ODAC)

Updated 29 January 2026

Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning framework that unifies actor and critic updates via a joint decision-aware objective to maximize expected returns.
It introduces tailored surrogate losses based on convex mirror maps and Bregman divergences, ensuring coherent updates and improved policy performance.
Empirical benchmarks and theoretical guarantees show that ODAC achieves faster convergence and more robust performance compared to standard actor-critic methods.

Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning (RL) framework that introduces a principled joint objective for actor-critic (AC) algorithms, ensuring that policy and critic updates are coherently aligned with the goal of maximizing expected return. Unlike standard AC approaches—where the critic is typically trained via a decorrelated temporal-difference (TD) loss—ODAC couples the actor and critic through a decision-aware objective derived from a lower bound on the true return. The formulation is general and supports arbitrary policy and value function parameterizations, convex mirror maps, and provides theoretical guarantees of monotonic policy improvement under mild conditions. The approach subsumes commonly used surrogate actor losses and yields critic learning rules superior to the standard mean squared error (MSE), particularly under restricted critic capacity (Vaswani et al., 2023).

1. Joint Decision-Aware Objective Formulation

ODAC's central contribution is a joint functional lower bound for expected return $J(\pi)$ , evaluated at policy iteration $t$ :

$J(\pi)\;\ge\;J(\pi_t)\;+\;\big\langle \hat g_t,\;\pi-\pi_t\big\rangle \;-\;\Bigl(\frac1\eta+\frac1c\Bigr)\,D_\Phi(\pi,\pi_t) \;-\;\frac1c\,D_{\Phi^*}\Big(\nabla\Phi(\pi_t)-c[\nabla J(\pi_t)-\hat g_t],\nabla\Phi(\pi_t)\Big)$

where

$\pi_t$ is the current policy,
$\hat g_t$ is a parameterized estimator of the policy gradient,
$\Phi$ is a strictly convex mirror map (e.g., negative entropy or log-sum-exp),
$D_\Phi$ is the corresponding Bregman divergence,
$c>0$ controls the actor-critic trade-off,
$\eta>0$ is a functional step-size.

Actor and critic updates optimize two surrogates derived from this lower bound. The actor surrogate is given by

$\ell_t(\theta) = \bigl\langle \hat g_t,\;\pi(\theta)-\pi_t\bigr\rangle - \Bigl(\frac1\eta+\frac1c\Bigr)D_\Phi(\pi(\theta),\pi_t)$

Parameter $\theta$ indexes the policy, and maximizing $\ell_t$ subject to $D_\Phi$ trust regions provably tightens the lower bound on $J(\pi)$ .

The critic is fitted via the loss

$\mathcal L_t(\omega) = \frac1c D_{\Phi^*} (\nabla\Phi(\pi_t)-c[\nabla_\pi J(\pi_t)-\hat g_t(\omega)], \nabla\Phi(\pi_t))$

using a parametric critic $\omega \mapsto \hat g_t(\omega) \approx \nabla_\pi J(\pi_t)$ . This loss directly enforces fidelity to the actor’s improvement metric.

ODAC supports closed-form instantiations for common mirror maps:

Negative entropy: yields weighted KL-regularized surrogates for the actor and critic.
Log-sum-exp: yields log-weighted advantage-based surrogate functions.

2. Algorithmic Structure and Implementation

The ODAC algorithm alternates actor and critic updates, maintaining joint optimization of their respective decision-aware surrogates. Below is the pseudocode as presented:

Input: mirror map Φ, policy π(θ), critic ω, step-sizes η, {α_a}, {α_c}, trade-off c, inner-loops m_a, m_c, AC iterations T.
initialize θ₀, ω₀, π₀←π(θ₀).

for t=0…T−1 do
  # Critic update
  estimate ∇πJ(π_t) via MC or TD → ḡ_t
  form L_t(ω) = (1/c)·D_{Φ*}(∇Φ(π_t)−c[ḡ_t−\hat g(ω)], ∇Φ(π_t))
  ω ← ω₀
  repeat m_c times:
    ω ← ω − α_c·∇_ω L_t(ω)
  end
  set \hat g_t ← \hat g(ω)

  # Actor update
  form ℓ_t(θ)= ⟨\hat g_t, π(θ)−π_t⟩ − (1/η+1/c)·D_Φ(π(θ),π_t)
  θ ← θ₀
  repeat m_a times:
    θ ← θ + α_a·∇_θ ℓ_t(θ)
  end
  π_{t+1} ← π(θ)
end

return π_T

Key hyperparameters include:

$c$ : actor–critic coupling, optimal in range $[10^{-3},10^{-1}]$ via grid search.
$\eta$ : functional step size, independent of parameterization, selected from $\{10^{-3}, \dots, 1\}$ .
$\alpha_a$ , $\alpha_c$ : dynamic step-sizes, set via Armijo line-search.
$m_a$ , $m_c$ : number of actor and critic gradient steps, balancing computational budget against bias/variance.

3. Theoretical Guarantees

ODAC is equipped with explicit monotonic improvement criteria (Prop 4). Defining $b_t = \nabla_\theta \pi(\theta_t)^\top \hat g_t$ , $\tilde H_t = \nabla_\theta \pi(\theta_t)^\top \nabla_\pi^2 \Phi(\pi_t) \nabla_\theta \pi(\theta_t)$ , guaranteed improvement is obtained if:

$\langle b_t, (\tilde H_t)^\dagger b_t \rangle > \langle \nabla_\pi J(\pi_t)-\hat g_t, [ \nabla_\pi^2 \Phi(\pi_t)]^{-1}|\nabla_\pi J(\pi_t)-\hat g_t \rangle$

In tabular plus Euclidean-KL settings, this reduces to the “relative error < 1” condition of the critic approximation.

ODAC further ensures convergence to a stationary point neighborhood (Prop 5):

$\min_{0 \leq t < T} \| \nabla_\pi J(\pi_t) \| = O\big( \tfrac{1}{\sqrt{T}} \big) + O\big( \tfrac{1}{T} \sum_t D_{\Phi^*}(\nabla\Phi-c\delta_t) \big) + O(e_t)$

with $e_t$ denoting critic and projection errors, and $J + \frac{1}{\eta}\Phi$ convexity and $L$ -smoothness assumptions.

4. Comparison with Trust-Region and Proximal Policy Optimization Methods

ODAC’s actor surrogate structurally interpolates standard actor-critic and trust-region style updates, such as TRPO and PPO.

TRPO maximizes the relative entropy regularized objective:

$\max_\theta\, E_{s,a \sim \pi_{\text{old}}}\big[ r \log \frac{\pi_\theta}{\pi_{\text{old}}} \big]$

subject to KL trust-region constraints.

PPO uses a clipped surrogate:

$\max\, E[ \min( r(\theta)A, r_\text{clip}(\theta)A ) ]$

ODAC actor surrogate takes the form:

$\ell_t(\theta) = E\big[ w(s,a)(\pi_\theta(a|s) - \pi_t(a|s)) \big] - \tau D_\Phi$

For weighted negative entropy, this reduces to $E[ \hat A^{\pi_t}(s,a) \log \frac{\pi_\theta}{\pi_t} ] - ( \frac{1}{\eta} + \frac{1}{c} ) KL(\pi_\theta \|\pi_t )$ , paralleling TRPO but with a data-adaptive $1/c$ coefficient coupling to critic error.

A plausible implication is that ODAC retains the stability of trust-region methods but offers adaptive error-sensitivity through the critic coupling.

5. Empirical Evaluation and Benchmarks

ODAC’s efficacy was benchmarked on:

Two-armed bandits (deterministic Bernoulli payoff): demonstrated that standard MSE critics may drive policies to sub-optimal arms, while decision-aware critics reliably select optimal actions.
Grid-world RL tasks, including:
- Cliff World (Sutton & Barto)
- Frozen Lake (OpenAI Gym)

Baselines:

Standard MSE critic (TD loss)
Adv-MSE critic (squared advantage loss)

Metrics:

Average return $J(\pi)$ over AC iterations.
95% confidence intervals over 5 random seeds.
Critic expressivity varied by feature dimension ( $d=40, 60, 80$ ).

Key findings:

Critic Expressivity	MSE-Critic	Adv-MSE	Decision-Aware (DA) Critic
High (d ≥ 80)	Optimal return	Optimal return	Optimal return
Moderate (d=40,60)	Sub-optimal, non-monotonic	Nearly always converges	Typically converges faster, higher return

Empirical performance held under both tabular and linear policy parameterizations, for both exact and MC-estimated critic targets. This suggests ODAC’s advantages are robust to policy/critic model class and stochasticity in return estimation.

6. Practical Considerations for Implementation

Key practitioner tips:

Tuning $c$ : Moderate values in $[10^{-3}, 10^{-1}]$ favor balance; lower $c$ suppresses unreliable critic contributions.
Step-sizes ( $\alpha_a$ , $\alpha_c$ ): Use Armijo line-search for both actor and critic losses for adaptive curvature scaling.
Functional step-size ( $\eta$ ): Decoupled from model parameters; select from $\{10^{-3},\ldots,1\}$ .
Inner iterations ( $m_a$ , $m_c$ ): Increase $m_c$ when critic evaluation is cheap to reduce bias; reduce $m_a$ when policy evaluation is costly.
Critic error: Large critic error ( $\| \nabla J - \hat g \|$ ) invalidates improvement conditions; remedy by collecting additional data or enriching critic representation.

A plausible implication is that ODAC's design systematically eliminates the actor–critic mismatch present in standard AC, guarantees monotonic improvement under mild oracle error conditions, and demonstrates superior practical performance, especially for resource-constrained critics (Vaswani et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Outcome-Driven Actor-Critic (ODAC).