Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward-Aware Consistency Trajectory Distillation

Updated 21 January 2026
  • The paper introduces RACTD, which embeds explicit reward optimization into consistency distillation to accelerate diffusion-based generative models for high-reward outcomes.
  • The methodology combines CTM, DSM, and a reward-aware objective to achieve significant inference speed-ups while preserving generative diversity.
  • Empirical results in offline RL and text-to-video tasks exhibit improved sample quality, faster inference, and robust control over high-reward mode selection.

Reward-Aware Consistency Trajectory Distillation (RACTD) is a framework for accelerating and improving the performance of diffusion-based generative models in reinforcement learning and structured generative modeling domains by embedding explicit reward optimization into consistency distillation. RACTD addresses the inefficiency of standard diffusion models—which require many denoising steps for sampling—by distilling multi-step generative processes into fast, reward-sensitive student models that retain the generative diversity of the teacher but focus probability mass on high-reward outcomes. RACTD has been developed and empirically validated in both offline reinforcement learning (Duan et al., 9 Jun 2025) and text-to-video generation (Ding et al., 2024).

1. Conceptual Foundations and Motivation

Diffusion-based models algorithmically reverse a forward noising process to sample structured outputs, such as future action trajectories or high-dimensional data. These models represent p(x0cond)p(\mathbf{x}_0 | \mathrm{cond}) by iteratively denoising from a noisy initial state, capturing complex, multi-modal behaviors, and enabling strong empirical performance. However, the iterative nature typically entails tens to hundreds of neural function evaluations (NFEs) per sample, severely limiting practical inference speed.

Consistency models, including Consistency Trajectory Models (CTM), sidestep repeated denoising by learning student mappings Gθ(xt,tu)G_\theta(\mathbf{x}_t, t \to u) that transport data between noise levels in a single step. While CTM-based distillation enables single- or few-step generation and substantial acceleration, standard distillation approaches naively clone all modes of the teacher’s policy, inheriting undesirable low-reward or suboptimal behaviors—especially problematic with offline or low-quality demonstration data.

RACTD augments the distillation process by guiding the student to select and amplify high-reward modes via a dedicated reward component. This integration achieves improved sample quality, performance metrics, and dramatic speed-ups by enabling reward-aware one-step generation.

2. Mathematical Framework

2.1. Diffusion Processes

Diffusion models are trained by learning to invert a stochastic differential equation (SDE):

dx=f(x,t)dt+g(t)dwd\mathbf{x} = f(\mathbf{x}, t)\, dt + g(t)\, d\mathbf{w}

The reverse dynamics are simulated via a probability flow ODE:

dx=[f(x,t)12g(t)2xlogpt(x)]dtd\mathbf{x} = \left[ f(\mathbf{x}, t) - \tfrac{1}{2} g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] dt

Training is performed on an EDM-style loss:

LEDM=Et,x0,xtx0[d(x0,Dϕ(xt,t))]\mathcal{L}_{\text{EDM}} = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_t|\mathbf{x}_0}\left[ d(\mathbf{x}_0, D_\phi(\mathbf{x}_t, t)) \right]

2.2. Consistency Distillation

A student GθG_\theta is trained to perform trajectory denoising in a single or small number of steps, aligning distributions produced by a teacher DϕD_\phi and itself. CTM and Denoising Score Matching (DSM) losses are employed:

  • CTM Loss:

LCTM=E[d(Gsg(θ)(x^k(t),k0),Gsg(θ)(xk(t,u),k0))]\mathcal{L}_{\text{CTM}} = \mathbb{E} \left[ d\left( G_{\text{sg}(\theta)}(\hat{\mathbf{x}}_k^{(t)}, k \to 0),\, G_{\text{sg}(\theta)}(\mathbf{x}_k^{(t,u)}, k \to 0) \right) \right]

where x^k(t)=Gθ(xt,tk)\hat{\mathbf{x}}_k^{(t)} = G_\theta(\mathbf{x}_t, t \to k) and xk(t,u)=Gsg(θ)(Solver(xt,tu;ϕ),uk)\mathbf{x}_k^{(t,u)} = G_{\text{sg}(\theta)}(\mathrm{Solver}(\mathbf{x}_t, t \to u; \phi), u \to k).

  • DSM Loss:

LDSM=Et,x0,xtx0[d(x0,Gθ(xt,t0))]\mathcal{L}_{\text{DSM}} = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_t|\mathbf{x}_0}\left[ d(\mathbf{x}_0, G_\theta(\mathbf{x}_t, t \to 0)) \right]

2.3. Reward-Aware Objective

A return-to-go reward model RψR_\psi provides a dense, differentiable reward for predicted (denoised) configurations:

LReward=Rψ(sn,a^n)\mathcal{L}_{\text{Reward}} = -R_\psi(s_n, \hat{a}_n)

where (sn,a^n)(s_n, \hat{a}_n) corresponds to the initial state and first generated action. In the context of high-dimensional generative modeling (e.g., video), a latent reward model RφlR^l_\varphi matches a reference reward metric in latent space and may be non-differentiable with respect to pixel-level outputs (Ding et al., 2024).

2.4. Combined Training Loss

The aggregate RACTD objective is:

LRACTD=αLCTM+βLDSM+σLReward\mathcal{L}_{\text{RACTD}} = \alpha\, \mathcal{L}_{\text{CTM}} + \beta\, \mathcal{L}_{\text{DSM}} + \sigma\, \mathcal{L}_{\text{Reward}}

Hyperparameters α,β,σ>0\alpha, \beta, \sigma > 0 regulate trade-offs between mode coverage, data fidelity, and reward maximization. Typical settings use α=β=σ=1\alpha=\beta=\sigma=1, though σ\sigma is ablated for stability and dataset quality.

In DOLLAR (Ding et al., 2024), the loss is a triplet with VSD (Variational Score Distillation), CD (Consistency Distillation), and reward terms:

L(θ)=λVSDLVSD(θ)+λCDLCD(θ)+λrewardLreward(θ;φ)L(\theta) = \lambda_{\text{VSD}} L_{\text{VSD}}(\theta) + \lambda_{\text{CD}} L_{\text{CD}}(\theta) + \lambda_{\text{reward}} L_{\text{reward}}(\theta; \varphi)

3. Training Algorithms and Implementation

The RACTD workflow is modular, involving separately trained components:

  1. Teacher Training: An unconditional diffusion model DϕD_\phi is trained on offline trajectories or a large real/generated dataset using the standard EDM loss.
  2. Reward Model: Independently train a reward predictor in the clean data space (state-action pairs in RL, latent encodings for video) to regress discounted return or surrogate rewards.
  3. Student Distillation:
    • Noisy action/state segments (or latents) are corrupted by the forward diffusion process.
    • The student GθG_\theta is trained via minibatched distillation, using the CTM, DSM, and reward-aware losses as detailed above.
    • The reward network supervises the student only on denoised (clean) outputs, ensuring stability and avoiding sensitivity to noise.
    • For video, reward fine-tuning in latent space is possible even with non-differentiable external rewards due to the compact surrogate model (Ding et al., 2024).

Pseudocode and exact implementation steps, including hyperparameter settings and update sequences, are found in the respective manuscripts’ algorithm boxes.

One-step or few-step student inference enables direct sampling: For RL, a single call Gθ(xT,T0)G_\theta(\mathbf{x}_T, T\to 0) generates an action sequence; for video, KK-step student sampling is determined by the granularity needed for target quality/speed.

4. Theoretical Rationale and Interpretation

RACTD’s primary theoretical innovation is the direct steering of consistency-distilled students toward high-reward modes, rather than naively covering all modes present in demonstration data. In standard settings, consistency models match the teacher’s output distribution, inheriting suboptimal and low-return behaviors from imperfect datasets. The reward-aware term amplifies gradients toward configurations with high predicted or surrogate reward, effecting a form of reward-guided mode selection.

Stability is achieved by decoupling the reward model from the noisy input: reward guidance is applied to denoised samples only, reducing the need for noise-aware reward approximators and simplifying optimization. Theoretical convergence is supported by empirical ablations: moderate reward weight (σ\sigma) yields stable training and improved returns, while overly large σ\sigma can destabilize loss landscapes (Duan et al., 9 Jun 2025).

In DOLLAR, extensions to non-differentiable reward metrics in video generation are obtained by learning a differentiable latent reward surrogate, retaining the ability to backpropagate reward gradients for fine-tuning and enabling reward-aware distillation for arbitrary metrics (Ding et al., 2024).

5. Empirical Performance, Ablations, and Scope

5.1. Offline Reinforcement Learning

Benchmarks include D4RL Gym-MuJoCo (hopper, walker2d, halfcheetah) and Maze2d (long-horizon planning) tasks (Duan et al., 9 Jun 2025). Performance indicators:

  • Sample Efficiency: RACTD achieves 97.6 average score on MuJoCo, exceeding the prior best Consistency-AC at 89.8 (+8.7% relative).
  • Inference Speed: For hopper-medium-replay, RACTD enables one-step sampling (1 NFE, 0.015s), compared to teacher EDM (80 NFE, 2.13s) and prior Diffuser (20 NFE, 0.64s). This corresponds to up to 142x speed-up over EDM and 43x over Diffuser.
  • Long-Horizon Planning: On Maze2d Large, RACTD outperforms Diffuser by +11.6% in average reward, with a 114x speedup (0.049s vs 5.57s/sample).
  • Ablations: Varying σ\sigma shows optimal performance at moderate values (\sim0.7); excessive reward weighting (σ=1.5\sigma=1.5) can cause oscillations.

5.2. Text-to-Video Generation (DOLLAR)

  • Video Quality: Four-step RACTD student exceeds the teacher in VBench score (82.57 vs 80.25) and surpasses leading T2V systems across 9/16 submetrics.
  • Generation Speed: Student 1-step inference achieves 278x acceleration over teacher 50-step DDIM sampling; four-step achieves 7.7x total wallclock speedup.
  • Reward Tuning: Latent reward model (LRM) uses 1.3x less memory than direct policy optimization and achieves higher VBench scores. Reward fine-tuning in latent space is preferred by human evaluators over DDPO-based approaches.
  • Diversity and Mode Coverage: Ablations show VSD-only students suffer mode collapse; the combination of VSD, CD, and reward-aware tuning achieves both high diversity and high quality.
Domain Teacher Steps / Time RACTD Steps / Time Relative Speedup Performance Delta
RL (MuJoCo, hopper) 80 NFE / 2.13s 1 NFE / 0.015s 142x +8.7%
Planning (Maze2d Large) 256 NFE / 5.57s (Diffuser) 1 NFE / 0.049s 114x +11.6%
Video (DOLLAR, VBench) 50 NFE (normalized 100%) 1-step: 7.45% 13.4x–278x +2–3 pts VBench

RACTD differs from prior actor-critic or guided-diffusion approaches in several ways:

  • Training Decoupling: The reward model and diffusion components are trained independently, avoiding the instability and complexity of actor-critic loops or noise-aware reward regressors (Duan et al., 9 Jun 2025).
  • Ultimate Speed: Consistency-based student generation collapses sampling to a single or modest number of steps, vastly improving wallclock throughput.
  • Reward Integration: The addition of a reward-aware loss steers generation toward meaningful outcomes, as opposed to unfiltered behavior cloning.

DOLLAR’s reward-aware consistency extends these benefits to generative modeling domains with complex, often non-differentiable, evaluation metrics. Compared to policy optimization (e.g., DDPO), the latent reward approach yields lower memory usage and superior qualitative performance (Ding et al., 2024).

Limitations include the need for separate training of teacher, student, and reward models and sensitivity to reward loss weighting, with excessively high reward weight potentially destabilizing optimization. Open directions include more robust distillation with alternative reward structures and application to multitask or online contexts.

7. Applications, Limitations, and Future Directions

RACTD has proven effective in:

  • Offline Reinforcement Learning: Empowering fast, high-performing action sequence generation in long-horizon planning and control tasks by distilling reward-aimed policies from generalist, multi-modal demonstration distributions.
  • Structured Generative Modeling: Enabling high-quality, few-step text-to-video synthesis with explicit control over reward metrics, including human preferences or heuristics mapped via surrogate latent reward models (Ding et al., 2024).

Limitations include the complexity of tuning reward scaling, the necessity for three-network training (teacher, student, reward model), and potential instability if σ\sigma is set too large. For policy learning, a plausible implication is that extending RACTD to online or multi-task reinforcement learning will require advancements in reward guidance strategies and distillation robustness. For generative models, integrating non-differentiable reward metrics at scale and aligning generated samples with diverse, evolving quality criteria remains an area for continued research.

Proposed future directions emphasize stabilizing distillation under strong reward guidance, expanding compatibility with non-differentiable or user-centric reward signals, and scaling RACTD to broader classes of tasks with heterogeneous or weak supervision (Duan et al., 9 Jun 2025, Ding et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-Aware Consistency Trajectory Distillation (RACTD).