DRaFT: Direct Reward Fine-Tuning
- DRaFT is a fine-tuning paradigm that directly integrates scalar reward signals into training to maximize the expected quality of model outputs.
- It employs differentiable, reward-weighted loss reparameterizations and techniques like clipping and KL regularization to stabilize training and prevent gradient issues.
- DRaFT has been successfully applied across diverse domains—including LLMs, diffusion models, and control systems—demonstrating significant improvements in performance and sample efficiency.
Direct Reward Fine-Tuning (DRaFT) is a general fine-tuning paradigm for generative models that directly incorporates scalar reward signals—potentially learned or hand-crafted—into the training objective, thereby maximizing the expected reward of model outputs. Unlike reinforcement learning algorithms that rely on policy gradients or actor-critic structures, and unlike supervised fine-tuning that ignores reward magnitude, DRaFT leverages pathwise (differentiable) updates whenever possible and explicitly reweights losses according to the reward, allowing both "good" and "bad" samples to drive model optimization via appropriately stabilized losses or gradient flows. This approach is instantiated across LLMs, diffusion models for images and video, scientific models, and control systems.
1. Motivation, Fundamental Challenges, and Unifying Principles
The motivation for DRaFT arises from the inefficiency and information wastage found in traditional supervised fine-tuning (SFT) and rejection-sampling fine-tuning (RFT). SFT treats all observed examples equally, requiring expensive data collection and discarding valuable information from negative samples or errors. RFT admits only positive, high-reward samples and discards the rest, which leads to data inefficacy and limited exploitation of available information (Liu et al., 14 Jan 2026).
A naïve reward-weighted loss of the form unifies positive reinforcement and suppression but introduces two pathologies:
- Gradient explosion/collapse: For negative-reward samples , pushing causes the gradient as .
- Unbounded objective: The loss on such samples diverges to , destabilizing optimization (Liu et al., 14 Jan 2026).
These issues are resolved by loss reparameterizations (e.g., separating log-probability for positive rewards and linear penalty for negatives), as well as by careful clipping, KL regularization, and truncated backpropagation in continuous domains (Clark et al., 2023, Yuan et al., 24 Sep 2025, Hu et al., 21 Jan 2026).
2. DRaFT for LLMs: Reward-Informed Fine-Tuning
In language modeling, DRaFT is operationalized through Reward-Informed Fine-Tuning (RIFT), which stabilizes the collapsed reward-weighted loss by partitioning data into positive and negative samples (Liu et al., 14 Jan 2026). The formal stabilized loss is: with (positive, ) and (negative, ) defined as above. This ensures boundedness and prevents training pathologies by limiting the negative sample term to lie within due to (Liu et al., 14 Jan 2026).
Empirical studies across mathematical reasoning datasets (e.g., GSM8K, MATH) show that RIFT consistently attains higher top- accuracy (Mean@8, Pass@8) than RFT and even DPO, with main gains of up to 19.1% absolute improvement in Pass@8 (Liu et al., 14 Jan 2026). Optimal performance is found with moderate negative reward scaling (e.g., ) and a roll-out size in the range 4–8 to balance diversity and sample efficiency.
Key best practices include:
- Retaining log-likelihood weighting for positive rewards, linearizing penalties for negatives.
- Using bounded negative reward magnitudes.
- Reward normalization or constant scaling for stability.
- Modularity: RIFT can be appended after SFT or RFT, rapidly improving policies independent of initial conditions.
3. DRaFT in Diffusion and Flow-based Generative Models
DRaFT extends to continuous domains by direct backpropagation of reward gradients through the sampling process (Clark et al., 2023, Yuan et al., 24 Sep 2025, Potaptchik et al., 26 Dec 2025). The canonical objective is to maximize the expected final-state reward,
with denoting the sampled output from the (possibly conditional) diffusion process, and a differentiable reward.
Core algorithmic variants include:
- Full-chain backpropagation (Clark et al., 2023): Propagate gradients through all denoising steps, demanding memory but yielding optimal reward maximization.
- DRaFT-K: Truncate to the last steps to reduce compute and mitigate vanishing/exploding gradients. Typically, provides rapid convergence.
- DRaFT-LV: For , leverage multiple noisy completions to form a low-variance Monte Carlo estimator of the reward gradient.
Augmentations such as LoRA adapters reduce memory and parameter footprint. Layerwise or time-windowed truncated backpropagation (Yuan et al., 24 Sep 2025) is highly effective for scientific or physics-informed generative tasks, e.g., enforcing PDE residual minimization.
Empirically, DRaFT improves aesthetic and human-preference metrics (e.g., LAION aesthetic score, HPSv2), physical fidelity (>10x reduction in residuals vs. DPS guidance), and sample efficiency (up to 2× cost reduction with layerwise truncation) (Clark et al., 2023, Yuan et al., 24 Sep 2025).
4. Reward Structure, Surrogate Losses, and Bandit Perspectives
When the reward is deterministic and outcome-level (e.g., binary correctness in LLMs), DRaFT is theoretically underpinned by multi-armed bandit learning in extremely large discrete action spaces (Hu et al., 21 Jan 2026). Each model output is viewed as an arm, and direct reward signals suffice to drive the policy towards the optimal set, with provable regret bounds.
Policy updates can adopt either:
- Unclipped Monte Carlo policy gradients: .
- PPO/GRPO-type clipped surrogates to stabilize updates.
Empirical ablations reveal:
- Baselines or advantage estimators are unnecessary in moderate-difficulty regimes.
- Negative reward scaling () can destabilize learning, with better sample efficiency achieved by sticking to (Hu et al., 21 Jan 2026).
- The simplistic “one rollout per batch, raw reward” regime works optimally except in extremely hard data, where advantage-based methods marginally help (Hu et al., 21 Jan 2026).
5. Task-specific and Domain-focused Instantiations
DRaFT is now used in reward-based fine-tuning for complex control, vision, and scientific generative modeling:
- Diffusion-based Control: Reward gradients propagate through the DDPM sampling chain, optionally combined with supervised losses or KL/entropy regularization. Empirical gains on D4RL, MetaWorld, and 1D navigation tasks consistently show 20–86% improvement in return over base or SFT/DPO models (Huh et al., 16 Feb 2025).
- Video Consistency and 3D Pose Reconstruction: In video diffusion, DRaFT with temporal-frequency domain rewards (VCD) fine-tunes models for significantly reduced temporal flicker and higher consistency to conditioning images (e.g., 0.5–1.0% absolute gain on benchmark metrics) (Aoshima et al., 22 Oct 2025). For 3D human pose, the DrPose algorithm optimizes a differentiable PoseScore, yielding superior geometry and appearance scores across standard and in-the-wild datasets (Do et al., 3 Mar 2026).
- Physics-Informed Generative Modeling: Physics-Informed Reward Fine-Tuning (PIRF) implements DRaFT for scientific diffusion models, restricting backpropagation to spatiotemporally local windows and updating only top U-Net layers, thus enhancing physical enforcement without test-time cost (Yuan et al., 24 Sep 2025).
- Tilt Matching for Flow Models: Tilt Matching presents a DRaFT variant for flow matching, realizing reward-tilted velocities using cumulant expansions and conditional covariances, eschewing backprop through entire trajectories or reward gradients (Potaptchik et al., 26 Dec 2025).
6. Comparison with Alternative Methods and Empirical Outcomes
Relative to standard SFT and preference methods (e.g., DPO) (Mukherjee et al., 8 Jun 2025), DRaFT directly optimizes a lower bound on the expected reward, uses a single scalar weight per trajectory, and avoids the variance and instability of importance weights or ratio penalties. Across LLM and RL benchmarks (ARC, OpenBookQA, MMLU, etc.), DRaFT outperforms best-of- SFT and DPO in optimized reward and accuracy, often with no implementation overhead (Mukherjee et al., 8 Jun 2025).
In controlled ablation studies, truncation schedules and reward normalization further improve stability and convergence. However, careful balancing of reward weightings (e.g., , in diffusion objectives) and entropy/regularization is essential to prevent overspecialization or mode collapse.
7. Limitations, Best Practices, and Future Directions
Key limitations are:
- For LLMs, negative rewards generally destabilize learning unless stabilized as in RIFT (Liu et al., 14 Jan 2026, Hu et al., 21 Jan 2026).
- For diffusion models, DRaFT currently requires differentiable rewards; black-box objectives are not directly tractable without RL estimators (Clark et al., 2023, Yuan et al., 24 Sep 2025).
- Mode collapse and reward hacking can occur, particularly if reward models are adversarially optimized or diversity is ignored (Clark et al., 2023).
Best practices include delayed reward updates, stop-gradient scheduling in diffusion, modular loss design, parameter-efficient updates (e.g., LoRA), and layerwise/windowed gradients in high-dimensional U-Nets.
Anticipated developments focus on adaptive truncation schedules, robust variance reduction (control variates), extension to black-box reward signals, and compositional reward aggregation (e.g., model soups, multi-objective fine-tuning) (Clark et al., 2023).
Direct Reward Fine-Tuning continues to be a unifying framework for stable, principled, and data-efficient reward-driven adaptation across generative domains, with broad empirical validation and ongoing extension into new scientific, vision, and control applications (Liu et al., 14 Jan 2026, Clark et al., 2023, Huh et al., 16 Feb 2025, Yuan et al., 24 Sep 2025, Potaptchik et al., 26 Dec 2025).