Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trust Region Q-Adjoint Matching (TRQAM)

Updated 4 July 2026
  • The paper introduces TRQAM, an off-policy RL algorithm that stabilizes fine-tuning of pretrained flow policies through an adaptive trust-region parameter.
  • It reformulates fine-tuning as a memoryless stochastic optimal control problem and uses adjoint matching to avoid expensive full-chain backpropagation.
  • Empirical results on 50 OGBench tasks show a success rate of 68%, significantly outperforming prior methods that achieved 46%.

Trust Region Q-Adjoint Matching (TRQAM) is an off-policy reinforcement-learning algorithm for fine-tuning pretrained flow policies under an explicit trust-region constraint. It is motivated by the observation that pretrained flow policies can represent diverse, high-capacity action distributions, but direct RL improvement is destabilized by the multi-step sampling process and by critic error during off-policy optimization. TRQAM extends Q-Adjoint Matching (QAM) by introducing a trust-region parameter λ>0\lambda>0 into a memoryless stochastic optimal control formulation and adapting that parameter through projected dual descent, so that the path-space Kullback–Leibler divergence from the pretrained policy can be controlled exactly. In the reported evaluation on 50 OGBench tasks, it achieves an overall offline RL success rate of 68%68\%, compared with 46%46\% for the strongest prior baseline, and is presented as outperforming prior methods in both offline RL and offline-to-online RL (Dong et al., 26 May 2026).

1. Problem formulation and motivation

TRQAM is designed for reinforcement-learning fine-tuning of a pretrained flow policy π0(as)\pi_0(a\mid s). The motivating setting is one in which π0\pi_0 has already learned a rich action distribution without RL, but gradient-based policy improvement would ordinarily require backpropagation through a multi-step denoising or flow chain. The source material characterizes this procedure as computationally expensive and unstable. QAM was introduced to avoid that cost by casting fine-tuning as a memoryless stochastic optimal control (SOC) problem with a learned critic Q(s,a)Q(s,a).

In that SOC formulation, the pretrained sampling dynamics are perturbed by an additive control u(x,τ)u(x,\tau), and one solves

minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]

subject to the controlled stochastic differential equation

dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.

Here, bb and 68%68\%0 encode the pretrained flow model, while the terminal critic term 68%68\%1 biases sampling toward high-value actions. QAM further uses an adjoint ordinary differential equation to obtain a lean gradient update for the velocity field, thereby avoiding backpropagation through the entire sampling chain.

The central difficulty is that off-policy critics are biased and noisy. According to the formulation summarized for TRQAM, QAM inherits a fragility of critic-guided improvement: when the critic is ill-conditioned, small critic errors can be amplified, and a fixed temperature setting—described as effectively 68%68\%2—can lead to model collapse. This instability is the immediate problem that TRQAM addresses.

2. Stochastic optimal control formulation

The derivation of TRQAM begins by replacing the deterministic flow ODE with an equivalent SDE whose marginal laws match at each 68%68\%3. Under the memoryless optimal-transport schedule

68%68\%4

the base and controlled SDEs are written as

68%68\%5

and

68%68\%6

with 68%68\%7 and the factor 68%68\%8 introduced by TRQAM, whereas QAM uses 68%68\%9.

The associated SOC problem remains

46%46\%0

Within this construction, the adjoint state 46%46\%1 solves a backward ODE and induces an adjoint-matching loss for the fine-tuned velocity field 46%46\%2 that parameterizes the control. The role of the SOC reformulation is therefore twofold: it converts the original fine-tuning problem into a memoryless control problem, and it provides a way to update the policy through adjoint matching rather than through full backpropagation across the entire flow trajectory.

The notation used in the derivation is standard and explicit. The state variable is 46%46\%3, the control is 46%46\%4 for 46%46\%5, the learned critic is 46%46\%6, and 46%46\%7 is the trust-region or dual parameter. This parameterization is essential because, in TRQAM, trust-region control is built into the dynamics rather than added only as an external penalty.

3. Trust-region mechanism and adaptive dual descent

The defining feature of TRQAM is the introduction of 46%46\%8 as a trust-region parameter inside the SOC dynamics. By Girsanov’s theorem, the path-space KL divergence between the controlled law 46%46\%9 and the base law π0(as)\pi_0(a\mid s)0 satisfies

π0(as)\pi_0(a\mid s)1

The derivation summarized for the method states that this identity follows by writing the Radon–Nikodym derivative via Girsanov, expanding the exponential martingale term, and using the zero-mean property of the Itô integral under π0(as)\pi_0(a\mid s)2 (Dong et al., 26 May 2026).

A further result establishes that the terminal KL between the fine-tuned policy and the pretrained policy is upper-bounded by the path-space KL:

π0(as)\pi_0(a\mid s)3

This gives the trust-region interpretation its operational meaning. If the path-space divergence is controlled, then the deviation of the terminal action distribution from the pretrained flow policy is controlled as well.

TRQAM enforces this constraint by imposing a path-space KL budget π0(as)\pi_0(a\mid s)4 and adapting π0(as)\pi_0(a\mid s)5 as a dual variable. In discrete time, with π0(as)\pi_0(a\mid s)6 Euler steps of size π0(as)\pi_0(a\mid s)7, the per-step transition is Gaussian with shared covariance π0(as)\pi_0(a\mid s)8, and its KL has the closed form

π0(as)\pi_0(a\mid s)9

Summing over steps and averaging over sampled trajectories in batch π0\pi_00 yields the surrogate path-space KL

π0\pi_01

TRQAM then maintains an EMA-smoothed estimate π0\pi_02 and performs projected dual descent:

π0\pi_03

Because the path-space KL appears with an inverse dependence on π0\pi_04, increasing π0\pi_05 when the budget is exceeded tightens the trust region in the SDE itself. This internalization of π0\pi_06 distinguishes TRQAM from a merely external regularization scheme.

4. Stability analysis and theoretical guarantees

The theory reported for TRQAM is organized around three results. The first is Lemma 1, which quantifies the amplification of critic error under exponential tilting. If two π0\pi_07-functions, π0\pi_08 and π0\pi_09, differ uniformly by at most Q(s,a)Q(s,a)0, then the exponentially tilted policies

Q(s,a)Q(s,a)1

satisfy

Q(s,a)Q(s,a)2

In the presentation of TRQAM, this lemma explains why fixed-Q(s,a)Q(s,a)3 adjoint matching is fragile: small critic errors are transformed into potentially large distributional deviations when the implicit temperature is not controlled.

The second result is the path-space KL identity described above. Its significance is not only formal. By scaling the diffusion with Q(s,a)Q(s,a)4, the algorithm makes Q(s,a)Q(s,a)5 appear explicitly as an inverse factor in the KL. This gives a direct control variable for the deviation between controlled and base dynamics.

The third result is the comparison between terminal and path-space KL. Because

Q(s,a)Q(s,a)6

an adaptive mechanism that enforces Q(s,a)Q(s,a)7 also bounds the terminal divergence from the pretrained policy. In the formulation provided for TRQAM, these three results jointly support the claim that adapting Q(s,a)Q(s,a)8 prevents exponential drift due to critic noise and stabilizes off-policy fine-tuning.

A plausible implication is that TRQAM changes the locus of regularization. Rather than only penalizing the resulting policy after improvement, it constrains the entire sampling path. That interpretation is consistent with the emphasis on path-space rather than solely terminal divergences.

5. Algorithmic procedure

The high-level procedure takes as input a pretrained velocity field Q(s,a)Q(s,a)9, a critic u(x,τ)u(x,\tau)0, a replay buffer u(x,τ)u(x,\tau)1, a KL budget u(x,τ)u(x,\tau)2, a dual stepsize u(x,τ)u(x,\tau)3, an EMA rate u(x,τ)u(x,\tau)4, an initial u(x,τ)u(x,\tau)5, an initial u(x,τ)u(x,\tau)6, and flow step size u(x,τ)u(x,\tau)7. Each iteration samples a batch u(x,τ)u(x,\tau)8 from u(x,τ)u(x,\tau)9, updates the critic using a standard TD or expectile loss, and then performs a policy or velocity update for each state in the batch.

The policy update has three stated substeps. First, a trajectory minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]0 is sampled via the controlled SDE using Euler discretization:

minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]1

Second, a backward adjoint ODE is solved with terminal condition minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]2. Third, the parameters minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]3 are updated by minimizing the adjoint-matching loss

minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]4

The trust-region update then estimates minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]5, forms the smoothed statistic

minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]6

and applies the projected dual step

minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]7

This workflow preserves the central computational advantage inherited from QAM—the use of adjoint matching instead of full-chain backpropagation—while adding an explicit trust-region controller. The presentation of the algorithm identifies the trust-region steps as the distinctive additions relative to QAM.

6. Empirical performance, interpretation, and limitations

The reported empirical evaluation covers two suites. The first is OGBench, described as 50 offline goal-conditioned tasks reformulated as reward-based single tasks across 10 domains: antmaze-large, antmaze-giant, humanoidmaze-medium, humanoidmaze-large, scene, puzzle-3x3, puzzle-4x4, cube-double, cube-triple, and cube-quadruple. In this setting, training consists of 1 million offline fine-tuning steps followed by 0.5 million steps of online RL. The baselines are FQL, CGQL-Linex, DSRL, IFQL, QAM, and QAM-E. The second suite is Robomimic lift, can, and square, used as a stress test for demonstration manipulation (Dong et al., 26 May 2026).

Across 8 seeds, the principal quantitative result is an offline OGBench success rate of minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]8 for TRQAM, versus minu EXPu[1201u(Xτ,τ)2dτQ(s,X1)]\min_{u}\ \mathbb{E}_{X\sim P^u}\left[\frac{1}{2}\int_0^1 \|u(X_\tau,\tau)\|^2\, d\tau - Q(s,X_1)\right]9 for the strongest prior baseline. The gains are reported to be largest on long-horizon and combinatorial tasks. A further comparison between scratch and pretrained initialization indicates that only TRQAM effectively leverages the pretrained dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.0, whereas QAM and QAM-E show nearly identical learning-from-scratch curves.

The ablation results are central to the interpretation of the method. They indicate, first, that adaptive dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.1, whether implemented externally or internally, outperforms fixed dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.2. Second, they indicate that internal dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.3—the defining choice of TRQAM—tracks the KL budget tightly, whereas external dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.4 cannot enforce the budget and leads to worse performance. This directly addresses a likely misconception that trust-region behavior can be obtained equivalently by adding an external penalty without modifying the dynamics.

Sensitivity analysis further reports that success rates vary smoothly with dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.5, and that the optimal budget depends on task structure: tight budgets are favored for navigation and manipulation tasks, except for puzzle-4x4, which benefits from larger budgets. This suggests that dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.6 functions as an interpretable control over permissible deviation from the pretrained policy.

The stated advantages of TRQAM are a principled exact trust region, stability against critic-error amplification, and adaptivity of dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.7 to track a prescribed KL target, including across offline-to-online transitions or time-varying budgets. The stated limitations are that adjoint matching requires vector–Jacobian products at each backward ODE step, so cost scales with model size, and that dXτ=b(Xτ,τ)dτ+σ(τ)u(Xτ,τ)dτ+σ(τ)dBτ.dX_\tau = b(X_\tau,\tau)\, d\tau + \sigma(\tau) u(X_\tau,\tau)\, d\tau + \sigma(\tau)\, dB_\tau.8 still requires per-domain selection, albeit with a smooth and interpretable effect. Proposed extensions include reducing vector–Jacobian-product overhead through efficient adjoint solvers or velocity-field architectures, extending the method to diffusion policies with many more denoising steps through variance-reduced discretizations, and combining TRQAM with off-policy value regularizers such as Conservative Q-Learning for greater safety under large distribution shift (Dong et al., 26 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trust Region Q-Adjoint Matching (TRQAM).