Inference-Time Optimization in Visual Generation

Updated 26 October 2025

Inference-time optimization in visual generation is a paradigm that refines model outputs post-training using methods like search, reward modeling, and token pruning.
Techniques such as token pruning, key-value caching, and beam search yield significant speedups (up to 9.17x) and improved compositional accuracy in image and video generation.
The approach leverages reinforcement learning, external reasoning, and memory mechanisms to tailor outputs for safety, structure, and custom user objectives without updating model parameters.

Inference-time optimization in visual generation encompasses a broad suite of algorithmic approaches that adjust, guide, or accelerate the generative process post-training, with the aim of improving sample quality, computational efficiency, alignment to custom objectives, or specific content constraints—all without updating model parameters. This paradigm is crucial for scaling visual generative models in practical deployments, enabling them to deliver higher fidelity outputs, align with downstream task requirements, or obey safety and compositionality constraints, often with training-free or post-hoc techniques. The field leverages innovations in autoregressive transformers, diffusion models, flow-based models, reinforcement learning, reward modeling, token pruning, memory mechanisms, and sophisticated search strategies, each contributing unique perspectives and benefits.

1. Computational Efficiency: Architectural and Algorithmic Acceleration

Inference-time efficiency is a principal motivation. Architectures like FlashVideo employ RetNet—a recurrent, retention-style transformer—for video generation, replacing the quadratic time complexity of classical self-attention with a linear approach, i.e., reducing from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for sequence length $L$ . This is achieved through a sequentially updated hidden state that obviates the need for repeatedly attending over the entire token history (Lei et al., 2023). Redundant-free frame interpolation further enhances efficiency by updating only tokens that change between keyframes, bypassing unnecessary recomputation in static regions. These innovations deliver $\times 9.17$ speedups over standard autoregressive transformers and attain performance on-par with BERT-style transformer video models while maintaining high output quality.

Token pruning offers another axis for acceleration in vision-LLMs. Methods like TopV cast the identification of informative visual tokens as an optimal transport problem based on a visual-aware cost function integrating feature similarity, spatial arrangement, and centrality (Yang et al., 24 Mar 2025). TopV is compatible with efficient attention implementations (e.g., FlashAttention), and by pruning tokens only during the prefilling stage, it concurrently shrinks the key-value cache and memory footprint, yielding inference speedups up to $2.1\times$ and substantial reductions in dynamic memory usage, all with negligible or even positive impact on accuracy.

Autoregressive image generators such as SimpleAR also adopt inference accelerators including key-value caching, vLLM-based serving for fast paged attention, and speculative decoding strategies, enabling the generation of $1024 \times 1024$ images in approximately 14 seconds with models as small as 0.5B parameters (Wang et al., 15 Apr 2025).

2. Search and Sampling: Discrete, Sequential, and Stochastic Optimization

Substantial advances arise from integrating search techniques into the inference process. In autoregressive frameworks, beam search capitalizes on the discrete, sequential structure by expanding and pruning candidate token paths efficiently. Because visual tokens are discrete, this enables early pruning and effective re-use of cached computations for shared prefixes—a property leveraged to great effect in recent work (Riise et al., 19 Oct 2025). For example, a 2B-parameter autoregressive model with beam search outperforms a 12B-parameter diffusion model in compositionality and visual quality benchmarks due to the amenability of beam search to discrete token sequences, where each decision point enables selective expansion based on verifier models (e.g., CLIPScore, ImageReward, LLaVA-OneVision).

Diffusion models present greater challenges for search due to their continuous latent space. Algorithms such as Diffusion Latent Beam Search (DLBS) (Oshima et al., 31 Jan 2025) adapt the notion of beam search to diffusion by maintaining and propagating multiple latent trajectories, employing a lookahead estimator (short deterministic DDIM rollouts at each denoising step) to better approximate final rewards and calibrating the alignment metric as a weighted sum of semantics, dynamics, and aesthetics. Compared to greedy or best-of- $N$ sampling, DLBS shows improved alignment to multi-faceted reward functions and better matches human/VLM feedback, while optimizing the allocation of compute resources to lookahead estimation provides superior gains over simply increasing candidate pool size or denoising steps.

For flow-based and flow-matching models, the deterministic nature historically precluded efficient sampling-like diffusion models. New work overcomes this by converting the ODE sampling process to a stochastic SDE (enabling particle sampling), adopting variance-preserving (VP) interpolants to expand the search space, and introducing adaptive Rollover Budget Forcing (RBF) to optimally allocate computational resources across timesteps, thus improving compositional alignment and fidelity of generated images while preserving the underlying model's sampling efficiency (Kim et al., 25 Mar 2025, Stecklov et al., 20 Oct 2025).

3. Alignment and Reward-Guided Generation

Alignment of outputs to task-specific, content-specific, or user-specified objectives—especially those difficult to enforce with standard training loss functions—is an increasingly central function of inference-time optimization. Direct Noise Optimization (DNO) for diffusion models defines alignment as direct optimization of the injected noise during sampling: maximizing a reward function $r(M_\theta(z))$ by updating the noise vector $z$ with respect to gradients of $r$ through the pretrained model $M_\theta$ (Tang et al., 29 May 2024). This tuning-free approach permits the alignment of outputs to continuous or even non-differentiable rewards (e.g., aesthetics, color attributes) by gradient-based or zeroth-order methods, and circumvents the need for retraining parameters.

A frequent failure of direct optimization is "reward hacking," where optimization drifts the noise far from the training distribution, resulting in high-reward but unrealistic images. This is addressed with probability regularization—quantifying how likely the optimized noise is under the original Gaussian prior and penalizing rare (low-probability) regions to maintain output realism.

RewardDance further advances inference-time reward modeling by reformulating the reward signal as the VLM's probability of predicting a "yes" token, thereby directly intertwining the reward objective into the model's next-token prediction. This generative reward paradigm, scalable up to 26B parameters and enriched with task-aware instructions and reference examples, not only improves prompt adherence and perceptual quality but also robustly mitigates mode collapse and reward hacking by maintaining high reward variance throughout optimization (Wu et al., 10 Sep 2025). This approach is validated on text-to-image, text-to-video, and image-to-video tasks, delivering notable performance increases on standard metrics.

4. Customization for Safety, Structure, and Compositionality

Safeguarding generative models and promoting compositional generalization extend the scope of inference-time optimization. Prompt-Noise Optimization (PNO) for text-to-image diffusion models jointly optimizes both the continuous prompt embedding and noise trajectory to minimize a toxicity loss, as measured by an independent classifier, while constraining the noise trajectory to remain "Gaussian-like" (Peng et al., 5 Dec 2024). This joint optimization offers enhanced robustness to adversarial prompt attacks and achieves state-of-the-art safety-to-alignment trade-offs without requiring retraining. Visualization allows the tracing of prompt embedding drift and assessment of the alignment-to-safety Pareto front.

For compositionality and spatial precision, methods deploy learned, data-driven loss functions at inference. Learn-to-Steer trains lightweight classifiers to decode spatial relationships directly from cross-attention maps within diffusion models (Yiflach et al., 2 Sep 2025). These classifiers, trained with dual-inversion (positive and negative relation prompts on the same image) to avoid reliance on linguistic features, deliver marked improvements in spatial accuracy (e.g., FLUX.1-dev: 0.20 → 0.61, SD2.1: 0.07 → 0.54), and generalize to complex multi-object, multi-relation scenes.

Masking-Augmented Diffusion with Inference-Time Scaling (MADI) augments diffusion model training via dual corruption (noise + masking) and deploys Pause Tokens at inference within prompts, allowing the model to dynamically scale computational attention to complex, structure-aware editing tasks without retraining (Kadambi et al., 16 Jul 2025). Expressive, subdivision-based prompts further condition models for fine-grained edits, improving CLIP-DIR and DINO-based structure metrics.

Inference-time structure imposition, such as decomposing images into crops and captions into segments for match-aggregated VLM scoring, reveals that compositional alignment can be substantially improved without training, benefiting attribute–object and multi-relation binding in retrieval and, plausibly, generation (Miranda et al., 11 Jun 2025).

5. Reinforcement Learning and Memory in Inference-Time Adaptation

Reinforcement learning approaches, notably Group Relative Policy Optimization (GRPO), yield robust inference optimization in visual generation (Xue et al., 12 May 2025, Fang et al., 23 May 2025). In frameworks like DanceGRPO, both diffusion and rectified flow models are cast as Markov Decision Processes, with sampling formulated as SDEs and group-level advantage computation stabilizing updates and supporting best-of- $N$ inference scaling across static and video domains. Reward models span aesthetics, alignment, motion quality, and binary signals, allowing adaptivity to both dense and sparse feedback signals.

InfLVG applies GRPO to context selection for long video generation, maintaining fixed compute by dynamically selecting semantically relevant tokens from the generative history, balancing semantic consistency and prompt alignment for cross-scene narratives (Fang et al., 23 May 2025). Hybrid reward functions integrate content similarity, prompt alignment (CLIP), and artifact penalties. This hierarchical memory approach—where previous optimization strategies (e.g., context ranked tokens or TTOM's parametric memory) are stored and retrieved based on prompt similarity—enables continual improvement of compositional consistency and generalization to new prompts (Qu et al., 9 Oct 2025).

6. Optimal Stopping, Compute Budget Allocation, and Search Trade-Offs

Principles from optimal stopping theory establish formal frameworks for deciding when to halt sampling. For example, the Pandora's Box algorithm applies confidence-bound-based stopping to maximize utility per compute cost, adaptively terminating generation when the observed maximum surpasses a learned acceptance threshold (Kalayci et al., 1 Oct 2025). Reward normalization (via Bradley–Terry transformations) further allows cross-prompt comparability, significantly lowering the number of required generations (by 15–35%) compared to fixed best-of- $N$ sampling at matched output quality.

In compute scaling for flow matching, strategies inject noise orthogonal to the score while preserving the linear interpolant, permitting exploration and sample diversity without the expense (in both trajectory curvature and step count) of variance-preserving schedules (Stecklov et al., 20 Oct 2025). Two-stage search algorithms (random search over initialization, followed by noise search along the trajectory) further exploit compute budgets to maximize output diversity and quality while retaining efficient, straight-line sampling.

7. Reasoning, Memory, and Visual Chain-of-Thought

Going beyond search and alignment, new approaches incorporate external reasoning signals. VChain leverages large multimodal models (e.g., GPT-4o) to produce a chain of "visual thoughts" (keyframes and textual descriptions), which serve as anchor points for sparse, LoRA-based fine-tuning of pre-trained video generators at inference time (Huang et al., 6 Oct 2025). This method introduces strong, physically and causally grounded supervision at critical junctures without requiring dense annotation or retraining, enhancing multi-step event synthesis, temporal coherence, and causal consistency in video generation.

Techniques like TTOM (Test-Time Optimization and Memorization) augment this approach with a streaming, parametric memory mechanism to store and reuse per-prompt optimizations, further disambiguating compositional world knowledge from the generative process. In compositional scenarios—spanning motion, numeracy, and spatial relation—TTOM improves key benchmarks and demonstrates the feasibility of online adaptation and continual refinement (Qu et al., 9 Oct 2025).

Inference-time optimization in visual generation integrates model-centric architectural advances, search-based computational strategies, reward-guided alignment, reinforcement learning, and memory mechanisms to deliver outputs that are higher in quality, better aligned to complex constraints, safer, and more computationally efficient. Its methods are highly modular, training- and model-agnostic, and often compatible with a wide range of visual generation paradigms, including but not limited to autoregressive transformers, diffusion models, and flows. The field underscores a critical shift from static, parameter-centric improvement to dynamic, adaptable, and context-sensitive optimization at generation time, bridging the gap between theoretical capability and practical deployment—across image, video, and multimodal content.