Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Published 5 Feb 2026 in cs.LG and cs.AI | (2602.05993v1)

Abstract: Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

Summary

  • The paper introduces Diamond Maps to achieve fast and accurate reward alignment via stochastic flow maps for generative modeling.
  • It leverages posterior and weighted designs to enable efficient value function estimation and unbiased Monte Carlo sampling.
  • Empirical results on benchmarks like CIFAR-10 and ImageNet demonstrate improved scalability, fidelity, and computational efficiency.

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Motivation and Problem Statement

Reward alignment in generative modeling with diffusion and flow-based models remains a significant challenge, especially when adapting to arbitrary preferences or constraints post-training. Existing approaches fall into reward fine-tuning—costly and inflexible for new rewards—and inference-time guidance, which is computationally expensive and frequently yields biased results due to inaccurate value function estimation. This paper introduces Diamond Maps, a new class of stochastic flow map models designed to enable fast and accurate reward alignment through efficient value function estimation, thereby enabling scalable guidance, search, and sequential Monte Carlo (SMC) at inference time. Figure 1

Figure 1: Diamond Maps overview illustrating stochastic flow maps for high-reward endpoints, with improved alignment and sampling efficiency across multiple tasks.

Stochastic Flow Maps and Value Function Estimation

Diamond Maps leverage recent advances in flow map distillation, specifically the ability to amortize the simulation of flow and diffusion models into a single-step neural network evaluation. The central insight is that stochasticity—absent in conventional (deterministic) flow maps—is essential for consistent value function estimation, which underpins reward-aligned sampling. Two designs are presented:

  • Posterior Diamond Maps: Stochastic flow maps distilled from GLASS Flows, directly sampling from the posterior p1t(xt)p_{1|t}(\cdot|x_t). These enable unbiased, sample-efficient Monte Carlo estimation of the value function Vtr(xt)V_{t}^{r}(x_t) and its gradient for guidance. The distillation process leverages analytic transformations from a pre-trained flow-matching model.
  • Weighted Diamond Maps: An approach to induce stochasticity in standard flow maps by applying a renoising procedure and weighting via a recovery reward, correcting the otherwise biased posterior approximation and enabling consistent value function gradient estimation for guidance. Figure 2

    Figure 2: Comparison of error accumulation for iterative denoising/noising vs. Diamond Early Stop DDPM sampling, demonstrating reduced amortization gap and improved fidelity.

Posterior Diamond Maps and Efficient Guidance

Posterior Diamond Maps enable one-step "look-ahead" sampling for value function estimation. The model is trained via distillation from GLASS Flows; stochastic transitions allow exploration (critical for SMC and search), and exact guidance can be performed using consistent Monte Carlo estimators for the value function and its gradient.

Additionally, Posterior Diamond Maps are demonstrated to encapsulate DDPM time-reversal transitions, allowing efficient iterative sampling that avoids error accumulation seen with conventional iterative denoising and noising. The method is validated quantitatively against standard and distilled flow maps, showing superior performance in Faithful ImageNet, CIFAR-10, and CelebA-64 benchmarks. Figure 3

Figure 3: Guidance effect on blueness reward—Weighted Diamond Maps incorporate regularization, preventing drift from the data manifold.

Figure 4

Figure 4: Posterior Diamond Map sampling: High-quality, stochastic samples from posterior; error accumulation avoided.

Weighted Diamond Maps and Plug-and-Play Scalability

Weighted Diamond Maps enable reward-aligned sampling using off-the-shelf distilled flow maps (e.g., SANA-Sprint) without retraining. Stochasticity is injected via a renoising map, and unbiased estimation is achieved by incorporating recovery rewards and score-based corrections. Monte Carlo estimators are presented for both value function and its gradient, with effective sample size (ESS) considerations influencing inference-time compute. This approach is tested at scale in high-resolution text-to-image (T2I) settings.

Weighted Diamond Maps outperform Best-of-N, prompt optimization, and Reward-based Noise Optimization methods in efficiency and reward alignment (GenEval and ImageReward metrics), demonstrating tight Pareto frontiers for guidance steps versus compute. Trajectories show improved adherence to prompts and reduced artifacts. Figure 5

Figure 5: Illustration of guidance trajectories with Weighted Diamond Maps, highlighting prompt fidelity and artifact reduction after iterative steps.

Figure 6

Figure 6: Pareto frontier for guidance—Weighted Diamond Maps achieve superior scaling and reward alignment compared to Best-of-N selection.

Experiments and Empirical Results

Experiments cover posterior distillation, reward-guided inference in linear inverse problems, text-to-image alignment with human preference rewards, and SMC with CLIP reward adaptation. Posterior Diamond Maps are distilled from flow-matching models on CIFAR-10/CelebA-64 and evaluated, demonstrating competitive FID and robustness to high reward scales due to stochasticity. Weighted Diamond Maps are applied to SANA-Sprint in 1024×10241024\times1024 T2I, outperforming baselines and achieving state-of-the-art metrics in GenEval benchmarks for alignment and efficiency. Figure 7

Figure 7: Pareto frontier for Gaussian deblurring—Posterior Diamond Maps are more robust to high reward scales, outperforming naive flow map guidance.

Figure 8

Figure 8: CLIP-reward SMC with Posterior Diamond Maps—successful adaptation to text prompts using stochastic search.

Practical and Theoretical Implications

Diamond Maps provide a practical inference-time approach for reward alignment in generative modeling, bridging the gap between sample efficiency, flexibility, and accuracy. The stochastic flow map paradigm supports rapid adaptation to arbitrary reward functions and constraints without retraining, enabling scalable search and guidance. The theoretical framing clarifies the role of stochasticity for value function estimation and positions Diamond Maps as a core architectural advance for plug-and-play reward alignment.

The results open pathways for RLHF-driven RL-guided sampling, efficient conditional generative modeling, and search-based inference over high-dimensional probabilistic manifolds. Extensions may include multi-modal reward composition, exploration-driven search trees, and integration with large-scale generators or alignment frameworks.

Conclusion

Diamond Maps represent a technically rigorous and empirically validated solution to the reward alignment problem in diffusion and flow-based generative models, enabling efficient value function estimation and scalable guidance at inference time. The stochastic flow map architecture unlocks practical and versatile adaptation to arbitrary preferences and constraints, and is poised for extensive application in scalable, reward-driven generative modeling paradigms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 22 likes about this paper.