Combining Reward Fine-Tuning and Inference-Time Guidance

Determine a principled methodology that combines reward fine-tuning and inference-time reward alignment (guidance) for flow matching and diffusion generative models, so that the resulting approach inherits the advantages of both families—flexibility across arbitrary rewards without retraining and accurate, efficient sampling.

Background

The paper distinguishes two dominant approaches to preference conditioning in generative models: (1) reward fine-tuning, which adapts a model to a specific reward via an additional training stage but must be repeated for new rewards, and (2) inference-time reward alignment (guidance), which adjusts sampling without retraining but often relies on approximations that can be inaccurate and computationally costly.

Given these complementary strengths and weaknesses, a central unresolved question is how to unify or combine these approaches to obtain both adaptability (no retraining per reward) and accurate, efficient alignment—particularly for flow matching and diffusion models used in high-dimensional generation.

References

It remains an open question how to combine the merits of both approaches.

— Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps (2602.05993 - Holderrieth et al., 5 Feb 2026) in Introduction (Section 1)

Combining Reward Fine-Tuning and Inference-Time Guidance

Background

References

Related Problems