Outcome-Driven Actor-Critic (ODAC)
- Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning framework that unifies actor and critic updates via a joint decision-aware objective to maximize expected returns.
- It introduces tailored surrogate losses based on convex mirror maps and Bregman divergences, ensuring coherent updates and improved policy performance.
- Empirical benchmarks and theoretical guarantees show that ODAC achieves faster convergence and more robust performance compared to standard actor-critic methods.
Outcome-Driven Actor-Critic (ODAC) is a reinforcement learning (RL) framework that introduces a principled joint objective for actor-critic (AC) algorithms, ensuring that policy and critic updates are coherently aligned with the goal of maximizing expected return. Unlike standard AC approaches—where the critic is typically trained via a decorrelated temporal-difference (TD) loss—ODAC couples the actor and critic through a decision-aware objective derived from a lower bound on the true return. The formulation is general and supports arbitrary policy and value function parameterizations, convex mirror maps, and provides theoretical guarantees of monotonic policy improvement under mild conditions. The approach subsumes commonly used surrogate actor losses and yields critic learning rules superior to the standard mean squared error (MSE), particularly under restricted critic capacity (Vaswani et al., 2023).
1. Joint Decision-Aware Objective Formulation
ODAC's central contribution is a joint functional lower bound for expected return , evaluated at policy iteration :
where
- is the current policy,
- is a parameterized estimator of the policy gradient,
- is a strictly convex mirror map (e.g., negative entropy or log-sum-exp),
- is the corresponding Bregman divergence,
- controls the actor-critic trade-off,
- is a functional step-size.
Actor and critic updates optimize two surrogates derived from this lower bound. The actor surrogate is given by
Parameter indexes the policy, and maximizing subject to trust regions provably tightens the lower bound on .
The critic is fitted via the loss
using a parametric critic . This loss directly enforces fidelity to the actor’s improvement metric.
ODAC supports closed-form instantiations for common mirror maps:
- Negative entropy: yields weighted KL-regularized surrogates for the actor and critic.
- Log-sum-exp: yields log-weighted advantage-based surrogate functions.
2. Algorithmic Structure and Implementation
The ODAC algorithm alternates actor and critic updates, maintaining joint optimization of their respective decision-aware surrogates. Below is the pseudocode as presented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Input: mirror map Φ, policy π(θ), critic ω, step-sizes η, {α_a}, {α_c}, trade-off c, inner-loops m_a, m_c, AC iterations T. initialize θ₀, ω₀, π₀←π(θ₀). for t=0…T−1 do # Critic update estimate ∇πJ(π_t) via MC or TD → ḡ_t form L_t(ω) = (1/c)·D_{Φ*}(∇Φ(π_t)−c[ḡ_t−\hat g(ω)], ∇Φ(π_t)) ω ← ω₀ repeat m_c times: ω ← ω − α_c·∇_ω L_t(ω) end set \hat g_t ← \hat g(ω) # Actor update form ℓ_t(θ)= ⟨\hat g_t, π(θ)−π_t⟩ − (1/η+1/c)·D_Φ(π(θ),π_t) θ ← θ₀ repeat m_a times: θ ← θ + α_a·∇_θ ℓ_t(θ) end π_{t+1} ← π(θ) end return π_T |
Key hyperparameters include:
- : actor–critic coupling, optimal in range via grid search.
- : functional step size, independent of parameterization, selected from .
- , : dynamic step-sizes, set via Armijo line-search.
- , : number of actor and critic gradient steps, balancing computational budget against bias/variance.
3. Theoretical Guarantees
ODAC is equipped with explicit monotonic improvement criteria (Prop 4). Defining , , guaranteed improvement is obtained if:
In tabular plus Euclidean-KL settings, this reduces to the “relative error < 1” condition of the critic approximation.
ODAC further ensures convergence to a stationary point neighborhood (Prop 5):
with denoting critic and projection errors, and convexity and -smoothness assumptions.
4. Comparison with Trust-Region and Proximal Policy Optimization Methods
ODAC’s actor surrogate structurally interpolates standard actor-critic and trust-region style updates, such as TRPO and PPO.
- TRPO maximizes the relative entropy regularized objective:
subject to KL trust-region constraints.
- PPO uses a clipped surrogate:
- ODAC actor surrogate takes the form:
For weighted negative entropy, this reduces to , paralleling TRPO but with a data-adaptive $1/c$ coefficient coupling to critic error.
A plausible implication is that ODAC retains the stability of trust-region methods but offers adaptive error-sensitivity through the critic coupling.
5. Empirical Evaluation and Benchmarks
ODAC’s efficacy was benchmarked on:
- Two-armed bandits (deterministic Bernoulli payoff): demonstrated that standard MSE critics may drive policies to sub-optimal arms, while decision-aware critics reliably select optimal actions.
- Grid-world RL tasks, including:
- Cliff World (Sutton & Barto)
- Frozen Lake (OpenAI Gym)
Baselines:
- Standard MSE critic (TD loss)
- Adv-MSE critic (squared advantage loss)
Metrics:
- Average return over AC iterations.
- 95% confidence intervals over 5 random seeds.
- Critic expressivity varied by feature dimension ().
Key findings:
| Critic Expressivity | MSE-Critic | Adv-MSE | Decision-Aware (DA) Critic |
|---|---|---|---|
| High (d ≥ 80) | Optimal return | Optimal return | Optimal return |
| Moderate (d=40,60) | Sub-optimal, non-monotonic | Nearly always converges | Typically converges faster, higher return |
Empirical performance held under both tabular and linear policy parameterizations, for both exact and MC-estimated critic targets. This suggests ODAC’s advantages are robust to policy/critic model class and stochasticity in return estimation.
6. Practical Considerations for Implementation
Key practitioner tips:
- Tuning : Moderate values in favor balance; lower suppresses unreliable critic contributions.
- Step-sizes (, ): Use Armijo line-search for both actor and critic losses for adaptive curvature scaling.
- Functional step-size (): Decoupled from model parameters; select from .
- Inner iterations (, ): Increase when critic evaluation is cheap to reduce bias; reduce when policy evaluation is costly.
- Critic error: Large critic error () invalidates improvement conditions; remedy by collecting additional data or enriching critic representation.
A plausible implication is that ODAC's design systematically eliminates the actor–critic mismatch present in standard AC, guarantees monotonic improvement under mild oracle error conditions, and demonstrates superior practical performance, especially for resource-constrained critics (Vaswani et al., 2023).