Trust Region Q-Adjoint Matching (TRQAM)

Updated 4 July 2026

The paper introduces TRQAM, an off-policy RL algorithm that stabilizes fine-tuning of pretrained flow policies through an adaptive trust-region parameter.
It reformulates fine-tuning as a memoryless stochastic optimal control problem and uses adjoint matching to avoid expensive full-chain backpropagation.
Empirical results on 50 OGBench tasks show a success rate of 68%, significantly outperforming prior methods that achieved 46%.

Trust Region Q-Adjoint Matching (TRQAM) is an off-policy reinforcement-learning algorithm for fine-tuning pretrained flow policies under an explicit trust-region constraint. It is motivated by the observation that pretrained flow policies can represent diverse, high-capacity action distributions, but direct RL improvement is destabilized by the multi-step sampling process and by critic error during off-policy optimization. TRQAM extends Q-Adjoint Matching (QAM) by introducing a trust-region parameter $\lambda>0$ into a memoryless stochastic optimal control formulation and adapting that parameter through projected dual descent, so that the path-space Kullback–Leibler divergence from the pretrained policy can be controlled exactly. In the reported evaluation on 50 OGBench tasks, it achieves an overall offline RL success rate of $68\%$ , compared with $46\%$ for the strongest prior baseline, and is presented as outperforming prior methods in both offline RL and offline-to-online RL (Dong et al., 26 May 2026).

1. Problem formulation and motivation

TRQAM is designed for reinforcement-learning fine-tuning of a pretrained flow policy $\pi_0(a\mid s)$ . The motivating setting is one in which $\pi_0$ has already learned a rich action distribution without RL, but gradient-based policy improvement would ordinarily require backpropagation through a multi-step denoising or flow chain. The source material characterizes this procedure as computationally expensive and unstable. QAM was introduced to avoid that cost by casting fine-tuning as a memoryless stochastic optimal control (SOC) problem with a learned critic $Q(s,a)$ .

In that SOC formulation, the pretrained sampling dynamics are perturbed by an additive control $u(x,\tau)$ , and one solves

$\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$

subject to the controlled stochastic differential equation

$dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$

Here, $b$ and $68\%$ 0 encode the pretrained flow model, while the terminal critic term $68\%$ 1 biases sampling toward high-value actions. QAM further uses an adjoint ordinary differential equation to obtain a lean gradient update for the velocity field, thereby avoiding backpropagation through the entire sampling chain.

The central difficulty is that off-policy critics are biased and noisy. According to the formulation summarized for TRQAM, QAM inherits a fragility of critic-guided improvement: when the critic is ill-conditioned, small critic errors can be amplified, and a fixed temperature setting—described as effectively $68\%$ 2—can lead to model collapse. This instability is the immediate problem that TRQAM addresses.

2. Stochastic optimal control formulation

The derivation of TRQAM begins by replacing the deterministic flow ODE with an equivalent SDE whose marginal laws match at each $68\%$ 3. Under the memoryless optimal-transport schedule

$68\%$ 4

the base and controlled SDEs are written as

$68\%$ 5

and

$68\%$ 6

with $68\%$ 7 and the factor $68\%$ 8 introduced by TRQAM, whereas QAM uses $68\%$ 9.

The associated SOC problem remains

$46\%$ 0

Within this construction, the adjoint state $46\%$ 1 solves a backward ODE and induces an adjoint-matching loss for the fine-tuned velocity field $46\%$ 2 that parameterizes the control. The role of the SOC reformulation is therefore twofold: it converts the original fine-tuning problem into a memoryless control problem, and it provides a way to update the policy through adjoint matching rather than through full backpropagation across the entire flow trajectory.

The notation used in the derivation is standard and explicit. The state variable is $46\%$ 3, the control is $46\%$ 4 for $46\%$ 5, the learned critic is $46\%$ 6, and $46\%$ 7 is the trust-region or dual parameter. This parameterization is essential because, in TRQAM, trust-region control is built into the dynamics rather than added only as an external penalty.

3. Trust-region mechanism and adaptive dual descent

The defining feature of TRQAM is the introduction of $46\%$ 8 as a trust-region parameter inside the SOC dynamics. By Girsanov’s theorem, the path-space KL divergence between the controlled law $46\%$ 9 and the base law $\pi_0(a\mid s)$ 0 satisfies

$\pi_0(a\mid s)$ 1

The derivation summarized for the method states that this identity follows by writing the Radon–Nikodym derivative via Girsanov, expanding the exponential martingale term, and using the zero-mean property of the Itô integral under $\pi_0(a\mid s)$ 2 (Dong et al., 26 May 2026).

A further result establishes that the terminal KL between the fine-tuned policy and the pretrained policy is upper-bounded by the path-space KL:

$\pi_0(a\mid s)$ 3

This gives the trust-region interpretation its operational meaning. If the path-space divergence is controlled, then the deviation of the terminal action distribution from the pretrained flow policy is controlled as well.

TRQAM enforces this constraint by imposing a path-space KL budget $\pi_0(a\mid s)$ 4 and adapting $\pi_0(a\mid s)$ 5 as a dual variable. In discrete time, with $\pi_0(a\mid s)$ 6 Euler steps of size $\pi_0(a\mid s)$ 7, the per-step transition is Gaussian with shared covariance $\pi_0(a\mid s)$ 8, and its KL has the closed form

$\pi_0(a\mid s)$ 9

Summing over steps and averaging over sampled trajectories in batch $\pi_0$ 0 yields the surrogate path-space KL

$\pi_0$ 1

TRQAM then maintains an EMA-smoothed estimate $\pi_0$ 2 and performs projected dual descent:

$\pi_0$ 3

Because the path-space KL appears with an inverse dependence on $\pi_0$ 4, increasing $\pi_0$ 5 when the budget is exceeded tightens the trust region in the SDE itself. This internalization of $\pi_0$ 6 distinguishes TRQAM from a merely external regularization scheme.

4. Stability analysis and theoretical guarantees

The theory reported for TRQAM is organized around three results. The first is Lemma 1, which quantifies the amplification of critic error under exponential tilting. If two $\pi_0$ 7-functions, $\pi_0$ 8 and $\pi_0$ 9, differ uniformly by at most $Q(s,a)$ 0, then the exponentially tilted policies

$Q(s,a)$ 1

satisfy

$Q(s,a)$ 2

In the presentation of TRQAM, this lemma explains why fixed- $Q(s,a)$ 3 adjoint matching is fragile: small critic errors are transformed into potentially large distributional deviations when the implicit temperature is not controlled.

The second result is the path-space KL identity described above. Its significance is not only formal. By scaling the diffusion with $Q(s,a)$ 4, the algorithm makes $Q(s,a)$ 5 appear explicitly as an inverse factor in the KL. This gives a direct control variable for the deviation between controlled and base dynamics.

The third result is the comparison between terminal and path-space KL. Because

$Q(s,a)$ 6

an adaptive mechanism that enforces $Q(s,a)$ 7 also bounds the terminal divergence from the pretrained policy. In the formulation provided for TRQAM, these three results jointly support the claim that adapting $Q(s,a)$ 8 prevents exponential drift due to critic noise and stabilizes off-policy fine-tuning.

A plausible implication is that TRQAM changes the locus of regularization. Rather than only penalizing the resulting policy after improvement, it constrains the entire sampling path. That interpretation is consistent with the emphasis on path-space rather than solely terminal divergences.

5. Algorithmic procedure

The high-level procedure takes as input a pretrained velocity field $Q(s,a)$ 9, a critic $u(x,\tau)$ 0, a replay buffer $u(x,\tau)$ 1, a KL budget $u(x,\tau)$ 2, a dual stepsize $u(x,\tau)$ 3, an EMA rate $u(x,\tau)$ 4, an initial $u(x,\tau)$ 5, an initial $u(x,\tau)$ 6, and flow step size $u(x,\tau)$ 7. Each iteration samples a batch $u(x,\tau)$ 8 from $u(x,\tau)$ 9, updates the critic using a standard TD or expectile loss, and then performs a policy or velocity update for each state in the batch.

The policy update has three stated substeps. First, a trajectory $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 0 is sampled via the controlled SDE using Euler discretization:

$\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 1

Second, a backward adjoint ODE is solved with terminal condition $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 2. Third, the parameters $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 3 are updated by minimizing the adjoint-matching loss

$\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 4

The trust-region update then estimates $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 5, forms the smoothed statistic

$\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 6

and applies the projected dual step

$\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 7

This workflow preserves the central computational advantage inherited from QAM—the use of adjoint matching instead of full-chain backpropagation—while adding an explicit trust-region controller. The presentation of the algorithm identifies the trust-region steps as the distinctive additions relative to QAM.

6. Empirical performance, interpretation, and limitations

The reported empirical evaluation covers two suites. The first is OGBench, described as 50 offline goal-conditioned tasks reformulated as reward-based single tasks across 10 domains: antmaze-large, antmaze-giant, humanoidmaze-medium, humanoidmaze-large, scene, puzzle-3x3, puzzle-4x4, cube-double, cube-triple, and cube-quadruple. In this setting, training consists of 1 million offline fine-tuning steps followed by 0.5 million steps of online RL. The baselines are FQL, CGQL-Linex, DSRL, IFQL, QAM, and QAM-E. The second suite is Robomimic lift, can, and square, used as a stress test for demonstration manipulation (Dong et al., 26 May 2026).

Across 8 seeds, the principal quantitative result is an offline OGBench success rate of $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 8 for TRQAM, versus $\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]$ 9 for the strongest prior baseline. The gains are reported to be largest on long-horizon and combinatorial tasks. A further comparison between scratch and pretrained initialization indicates that only TRQAM effectively leverages the pretrained $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 0, whereas QAM and QAM-E show nearly identical learning-from-scratch curves.

The ablation results are central to the interpretation of the method. They indicate, first, that adaptive $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 1, whether implemented externally or internally, outperforms fixed $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 2. Second, they indicate that internal $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 3—the defining choice of TRQAM—tracks the KL budget tightly, whereas external $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 4 cannot enforce the budget and leads to worse performance. This directly addresses a likely misconception that trust-region behavior can be obtained equivalently by adding an external penalty without modifying the dynamics.

Sensitivity analysis further reports that success rates vary smoothly with $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 5, and that the optimal budget depends on task structure: tight budgets are favored for navigation and manipulation tasks, except for puzzle-4x4, which benefits from larger budgets. This suggests that $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 6 functions as an interpretable control over permissible deviation from the pretrained policy.

The stated advantages of TRQAM are a principled exact trust region, stability against critic-error amplification, and adaptivity of $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 7 to track a prescribed KL target, including across offline-to-online transitions or time-varying budgets. The stated limitations are that adjoint matching requires vector–Jacobian products at each backward ODE step, so cost scales with model size, and that $dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.$ 8 still requires per-domain selection, albeit with a smooth and interpretable effect. Proposed extensions include reducing vector–Jacobian-product overhead through efficient adjoint solvers or velocity-field architectures, extending the method to diffusion policies with many more denoising steps through variance-reduced discretizations, and combining TRQAM with off-policy value regularizers such as Conservative Q-Learning for greater safety under large distribution shift (Dong et al., 26 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Trust Region Q Adjoint Matching (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trust Region Q-Adjoint Matching (TRQAM).