Scaling Image and Video Generation via Test-Time Evolutionary Search (2505.17618v1)

Published 23 May 2025 in cs.CV, cs.AI, and cs.LG

Abstract: As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.

Summary

The paper introduces EvoSearch, a test-time scaling framework that uses evolutionary search to improve generation quality and diversity in diffusion and flow models.
The method employs selection, mutation, and dynamic population initialization to refine denoising trajectories, outperforming existing best-of-N sampling approaches.
Experimental results show that EvoSearch enables smaller models to exceed the performance of larger ones, achieving consistent improvements across various metrics.

Scaling Image and Video Generation with Evolutionary Search

The paper "Scaling Image and Video Generation via Test-Time Evolutionary Search" (2505.17618) introduces a novel test-time scaling (TTS) framework called Evolutionary Search (EvoSearch) for enhancing image and video generation using diffusion and flow-based generative models. EvoSearch addresses limitations in existing TTS methods by reformulating the scaling problem as an evolutionary search, strategically allocating computation during inference to improve sample quality and diversity.

Problem Definition and Limitations of Existing TTS Methods

The paper addresses the challenge of improving generative model performance by allocating additional computation at inference time, given the increasing marginal cost of scaling computation during model pre-training. The goal is to sample from a target distribution that optimizes a reward function while staying close to the pre-trained distribution. Existing TTS methods, like best-of-N sampling and particle sampling, suffer from inefficiency and limited exploration capabilities, failing to fully capture the potential of test-time computation.

Evolutionary Search (EvoSearch) Framework

EvoSearch draws inspiration from biological evolution to enhance the denoising trajectory of diffusion and flow models. The method incorporates selection and mutation mechanisms tailored to the stochastic differential equation (SDE) denoising process. EvoSearch iteratively generates higher-quality offspring while preserving population diversity. The key components of the framework include:

Evolution Schedule: Specifies the timesteps at which EvoSearch is conducted, optimizing initial noise and intermediate states. (Figure 1) shows an ablation study on the evolution schedule $\mathcal{T}$ .
Population Initialization: Defines the initial size of sampled Gaussian noises and the children population size for each generation. (Figure 2) shows an ablation study on the population size schedule $\mathcal{K}$ .
Fitness Evaluation: Evaluates the quality of each parent using a reward model at each evolution timestep.
Selection: Employs tournament selection to sample parents while maintaining population diversity.
Mutation: Introduces specialized mutation strategies for initial noises and intermediate denoising states, leveraging the structure of the latent space.
Figure 3: Overview of the EvoSearch method, illustrating the progressive refinement and exploration of new states along the denoising trajectory.

EvoSearch transforms the deterministic sampling process of flow models (ODE) into a stochastic process (SDE) to enable test-time scaling. This transformation broadens the generation space and allows for a unified framework for inference-time optimization.

Experimental Results and Analysis

The paper presents extensive experiments on text-conditioned image and video generation tasks using models such as Stable Diffusion 2.1, Flux.1-dev, HunyuanVideo, and Wan. The results demonstrate that EvoSearch consistently outperforms existing approaches in terms of sample quality, diversity, and alignment with human preferences.

Figure 4: A visual overview of EvoSearch, highlighting its ability to enhance sample quality and enable smaller models to outperform larger ones.

Specifically, EvoSearch enables SD2.1 to exceed GPT4o and allows the Wan 1.3B model to achieve competitive performance with the 10x larger Wan 14B model. Moreover, EvoSearch shows strong generalization to unseen evaluation metrics, mitigating reward hacking and maintaining population diversity. (Figure 5) illustrates the generalization to unseen metrics where ImageReward is set as the guidance reward function during the search.

Figure 6: Visualization of a test-time alignment experiment, showing that EvoSearch can effectively capture all modes of a multimodal target distribution.

Conclusion

EvoSearch presents a novel and effective approach to test-time scaling for image and video generation. By reformulating TTS as an evolutionary search problem and incorporating specialized selection and mutation mechanisms, EvoSearch achieves significant improvements in sample quality, diversity, and generalization. The results suggest that EvoSearch can enable smaller-scale models to outperform larger-scale models, offering a promising direction for future research in generative modeling.