AlignIQL: Constrained Offline Policy Extraction

Updated 7 January 2026

AlignIQL is a family of offline reinforcement learning algorithms that recasts policy extraction as a constrained optimization problem, unifying IQL and IDQL approaches.
It provides two methods—AlignIQL and AlignIQL-hard—that offer closed-form solutions for implicit policy recovery while maintaining actor–critic decoupling.
The algorithms achieve competitive or superior empirical performance in sparse-reward and high-dimensional tasks, with strong theoretical guarantees.

AlignIQL is a family of offline reinforcement learning (RL) algorithms that address the policy extraction problem in Implicit Q-Learning (IQL) by recasting the policy recovery as a constrained optimization (policy alignment) problem. It provides closed-form solutions—up to low-dimensional statewise multipliers—that strictly characterize the “implicit policy” encoded by a learned Q-value and value pair, giving rise to two practical algorithms: AlignIQL and AlignIQL-hard. Both methods retain IQL’s advantages of actor–critic decoupling and are theoretically grounded, yielding competitive or better empirical performance on challenging, sparse-reward tasks compared to prior state-of-the-art (He et al., 2024).

1. Foundations: IQL and the Implicit Policy Problem

Implicit Q-Learning (IQL) is an offline RL algorithm that enjoys stable training by decoupling actor and critic and never evaluating out-of-distribution (OOD) actions during Bellman updates. Specifically, IQL parameterizes:

The value network $V_\psi$ via expectile ( $\tau$ -quantile) regression:

$L_V(\psi) = \mathbb{E}_{(s, a) \sim D} \Big[ |\tau - 1_{Q(s,a) - V_\psi(s) < 0}| \cdot (\overline{Q}(s,a) - V_\psi(s))^2 \Big]$

The Q-network $Q_\theta$ by MSE regression to $r + \gamma V_\psi(s')$
The actor $\pi_\phi$ via advantage-weighted regression (AWR):

$L_\pi(\phi) = -\mathbb{E}_{(s, a) \sim D}\left[ \exp \left( \alpha \cdot (\overline{Q}(s,a) - V_\psi(s)) \right) \log \pi_\phi(a|s) \right]$

Despite practical success, two fundamental questions arise: What policy is strictly “implied” by a generic learned pair $(Q, V)$ ? And why does the AWR step recover sensible policies even when the critic is non-optimal? Prior work (IDQL) interprets IQL as an actor–critic method, positing that the implicit policy weights $w(s, a)$ only hold when the critic achieves global optimality for convex loss $U$ . AlignIQL circumvents these limitations by directly characterizing the implicit policy via a principled optimization problem that holds for any $(Q, V)$ (He et al., 2024).

2. The Implicit Policy-Finding (IPF) Optimization

AlignIQL formulates “policy alignment” as the solution to the implicit policy-finding (IPF) constrained program:

$\begin{aligned} &\min_{\pi} \; \mathbb{E}_{s \sim d^\pi,\, a \sim \pi(\cdot|s)} \Big[ f\left( \frac{\pi(a|s)}{\mu(a|s)} \right) \Big] \ &\text{s.t.} \quad \pi(a|s) \geq 0, \;\; \int_a \pi(a|s) \, da = 1 \; \,\forall s, \ &\qquad\quad \mathbb{E}_{a \sim \pi(\cdot|s)}[ Q(s, a) ] = V(s) \;\; \forall s \end{aligned}$

Here, $\mu(a|s)$ is the behavioral policy, $f(x)$ is a convex regularizer (e.g., $f(x) = \log x$ induces KL divergence penalties). Slater’s condition applies, implying convexity and feasibility. This framework generalizes IQL and IDQL: IQL’s AWR appears as an approximate solution for simple $\mu$ and constant multipliers, while IDQL’s weight formula emerges from a general $U$ -loss under critic global optimality assumptions. AlignIQL, however, precisely characterizes the solution for any arbitrary critic.

3. Solution Characterization and Algorithms

3.1. AlignIQL-hard (Exact Constraint)

By formulating the Lagrangian with dual multipliers $\alpha(s)$ , $\beta(s)$ for normalization and alignment constraints, and employing KKT conditions, the optimal solution is

$\pi^*(a|s) = \mu(a|s)\,\max\left\{ g_f(-\alpha(s) - \beta(s) Q(s,a)), 0 \right\}$

where $g_f = (h_f')^{-1}$ for $h_f(x) = x f(x)$ . For $f(x) = \log x$ , this reduces to the exponential-family solution

$\pi \propto \mu \cdot \exp(-\beta(s) Q(s,a))$

Typically $\beta(s)$ settles negative, recovering the standard AWR functional form, i.e., $\pi \propto \mu \cdot \exp(|\beta| Q)$ . Determining $\alpha(s), \beta(s)$ requires solving two scalar nonlinear equations per state, for which small neural networks are used as dual function approximators, trained by dual ascent on

$\mathcal{L}_M(\omega, \chi) = -\mathbb{E}_{s, a \sim \mu} \left[ \exp\left(-\alpha_\omega(s) - \beta_\chi(s) Q(s,a) - 1\right) + \alpha_\omega(s) + \beta_\chi(s) V(s) \right]$

This yields the AlignIQL-hard algorithm, enforcing policy alignment exactly (up to dual approximation).

3.2. AlignIQL (Soft Constraint)

To mitigate dual ascent instability, AlignIQL proposes a soft-penalty relaxation:

$\min_{\pi, V} \mathbb{E}_{s, a \sim \pi}\left[ f\left( \frac{\pi}{\mu} \right) + \eta \left( Q(s,a) - V(s) \right)^2 \right]$

with $\eta \in \mathbb{R}\setminus \{0\}$ . The corresponding stationary solution is

$\pi^*(a|s) \propto \mu(a|s)\exp\left(-\eta\left(Q(s,a) - V(s)\right)^2\right)$

For $\eta > 0$ , the weights focus on actions where $Q \approx V$ (strong alignment); $\eta < 0$ yields policies that weight high- $|Q-V|$ actions, heuristically resembling AWR. Proposition 3.10 shows that for $\eta (h^* - k^*) < 0$ (for alignment constants $h^*, k^*$ ), the IPF solution is a strict local minimum for the soft objective. Empirically, $\eta = \pm 1$ suffices across environments.

4. Implementation and Core Algorithmic Aspects

Both variants maintain full actor–critic decoupling: the critic $(Q, V)$ is trained entirely from dataset $D$ using regression losses, without gradient flow from the actor. Policy extraction is performed at test time by evaluating the closed-form weights, with action sampling or maximization based on these weights.

Key implementation details include:

Value network use of expectile-regression loss, $\tau=0.7$ (MuJoCo) or $\tau=0.9$ (AntMaze), matching IQL.
Q-function learning by Bellman-MSE backup.
Actor extraction weights:
- AlignIQL-hard: $w_i = \max\{\exp(-\alpha_\omega(s) - \beta_\chi(s)Q(s,a_i)-1), 0\}$
- AlignIQL: $w_i = \exp(-\eta (Q(s,a_i)-V(s))^2)$ (or $|Q-V|$ variant for $\eta < 0$ )
- Only computed at extraction, not during training.
Learning rates: $3 \times 10^{-4}$ for $(Q, V, \mu)$ , $3 \times 10^{-5}$ for duals $(\alpha, \beta)$ .
Samples per extraction state: 16, 64, or 256.
Diffusion models used for behavioral policy estimation.

A summary pseudocode for training and extraction loops is given directly in the paper (He et al., 2024). All standard RL infrastructure is assumed, including target network updates and experience replay.

5. Theoretical Properties

Both versions of AlignIQL inherit the decoupling properties of IQL:

Critic $(Q, V)$ is trained independently of the actor, ensuring stable and robust convergence.
Actor $\pi$ is always recovered post hoc, in closed form, without gradient entanglement between policy and value.
AlignIQL-hard solves a globally convex IPF under Slater’s condition, attaining the globally optimal implicit policy under dual optimization.
AlignIQL’s soft-penalty version, although nonconvex, ensures (under suitable $\eta$ ) that the IPF policy is a strict local minimum of the relaxed objective, and is empirically more stable (avoiding dual-exponential instability).

This structure guarantees robustness to hyperparameter tuning and extends theoretical guarantees of policy alignment to arbitrary learned critics, regardless of optimality.

6. Empirical Results and Comparative Performance

Comprehensive benchmarks on D4RL locomotion, AntMaze, and Adroit suites establish AlignIQL’s empirical advantages. Key results are summarized below:

Benchmark	IQL	IDQL	CQL	Diff-QL	SfBC	SQL	AlignIQL
Locomotion	76.9	78.0	63.9	87.9	75.6	83.3	75.7
AntMaze	63.0	74.4	—	69.8	74.2	—	79.1

In highly saturated locomotion benchmarks, AlignIQL is on par with SOTA. In sparse AntMaze tasks, it outperforms all baselines by 5–15 points (on 0–100 scale). Adroit task results show similar trends, with AlignIQL matching or exceeding IDQL—despite sharing $\tau$ and model capacity.

Further, as the action-sampling budget $N$ per state increases, IQL and IDQL can degrade due to OOD action overemphasis, while AlignIQL’s robust alignment (especially with $\eta>0$ in sparse domains) ensures sustained or improved convergence speed and final performance. All results are means over ten seeds with standard error reporting; improvements above three points are statistically significant ( $p<0.05$ ).

Strengths of AlignIQL include minimal increase in hyperparameter burden (just $\eta$ for soft variant, typically $\pm1$ suffices), no need for extra networks or large dual updates in the soft variant, and exact alignment in the hard variant (at the expense of sensitivity and stability in very sparse settings).

7. Significance and Research Impact

AlignIQL reframes the core question of policy extraction from learned Q-value and value networks as a convex, constrained optimization that generalizes and subsumes established heuristics (AWR, IDQL). This formalization clarifies when and why the standard exponential-weight approach is justified, provides theoretically guaranteed alignment, and retains IQL’s simplicity and scalability.

Its demonstration of significant improvements in sparse-reward and high-dimensional domains establishes a new paradigm for principled policy recovery in offline RL—particularly in regimes where nuanced alignment of actor and critic is essential (He et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AlignIQL.