From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning (2504.16080v1)

Published 22 Apr 2025 in cs.CV

Abstract: Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in LLMs, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.

PDF Abstract

This paper introduces ReflectionFlow, an inference-time framework designed to enhance the quality of images generated by text-to-image (T2I) diffusion models through an iterative self-refinement process. Unlike traditional methods that focus on scaling training data and model parameters, ReflectionFlow aims to leverage additional computation at inference time to improve results, particularly for complex scenes and fine-grained details where models often struggle.

The core idea is inspired by the self-reflection capabilities observed in LLMs. ReflectionFlow enables a diffusion model to iteratively reflect on a generated image, identify its flaws, and refine it based on explicit textual feedback. The framework explores three complementary axes for scaling inference-time computation:

Noise-Level Scaling: This involves exploring multiple initial noise samples (search width $N$ ) for a single prompt, similar to existing noise-space optimization techniques.
Prompt-Level Scaling: This refines the input prompt during the iterative process using a Multimodal LLM (MLLM) to provide more precise semantic guidance based on the generated images and their evaluations.
Reflection-Level Scaling: This is the key novelty, where the model generates explicit textual reflections (critiques) of the previous generation and uses these reflections as instructions to guide the refinement process for a specified number of iterations (reflection depth $M$ ).

To facilitate reflection-level scaling, the authors created GenRef, the first large-scale dataset specifically for image reflection. GenRef contains 1 million triplets, each comprising a flawed image, an enhanced image, and a textual reflection detailing how to transform the flawed image into the enhanced one. The dataset was constructed using a scalable, automated pipeline drawing from four sources: rule-based verification of object attributes, reward-based ranking using ensemble metrics (HPSv2, CLIPScore, PickScore), comparing images from long vs. short prompts, and adapting existing image editing datasets (OmniEdit). A subset, GenRef-CoT (227K samples), uses a Chain-of-Thought (CoT) process with closed-source MLLMs (GPT-4o, Gemini 2.0) to generate detailed pairwise analyses, preferences, and reflections. This data was used to fine-tune an image reflector model (based on Qwen2.5-VL-7B) and an image reward model (verifier, based on Qwen2.5-VL-3B) using a Bradley–Terry (BT) pairwise comparison loss.

For implementation, ReflectionFlow treats self-refinement as a conditional generation task, analogous to image editing. They efficiently fine-tune a state-of-the-art Diffusion Transformer (DiT), FLUX.1-dev [flux2024], to act as a corrector model ( $C_\phi$ ). This is done without adding extra modules like ControlNet [zhang2023adding] or IP-Adapter [ye2023ip]. Instead, the original prompt, reflection prompt, flawed image, and refined image are concatenated into a single sequence, allowing for joint multimodal attention within the DiT architecture. The flawed image is downsampled (e.g., from 1024 to 512) and not noised to improve efficiency during training and inference. The training objective uses a standard diffusion or flow-matching loss on the refined target image. An important training strategy is random dropping of conditioning inputs (prompts, flawed image) and a "task warm-up" strategy, prioritizing editing data initially, to prevent distributional drift from the pretrained base model. LoRA [hu2022lora] with a rank of 256 is used for efficient fine-tuning.

The ReflectionFlow inference process combines these components. Starting with $N$ initial images generated by the base model, in each iteration (up to depth $M$ ), an MLLM verifier evaluates the current images, generates textual reflections, and refines the prompts. The fine-tuned corrector model then uses these refined prompts and reflections, conditioned on the current images, to produce refined images for the next iteration. Finally, the verifier selects the best image across all generated candidates from all iterations and chains. The framework offers flexibility by adjusting the search width ( $N$ ) and reflection depth ( $M$ ) to balance performance and computational budget.

Experimental results on the GenEval benchmark [ghosh2024geneval] demonstrate the effectiveness of ReflectionFlow (Zhuo et al., 22 Apr 2025 ). With a total budget equivalent to 32 samples, ReflectionFlow achieves a GenEval score of 0.91, significantly outperforming the baseline FLUX.1-dev (0.67), noise-level scaling alone (0.85), and combined noise+prompt scaling (0.87). Ablation studies show that performance scales with the quality of the verifier used and improves consistently with increased inference budget. Analysis of different refinement strategies (varying $N$ and $M$ for a fixed budget) suggests that greater reflection depth is generally more beneficial than wider search, highlighting the power of iterative correction. Crucially, ReflectionFlow provides larger performance gains on prompts initially classified as "hard" (achieving 0.81 score from a 0.10 baseline), demonstrating a similar characteristic to reflection mechanisms in LLMs which excel on challenging tasks. Qualitative results show the iterative process of correcting errors based on the generated reflections.

In summary, ReflectionFlow provides a practical framework for enhancing T2I model performance at inference time by leveraging iterative self-refinement guided by textual reflections. The accompanying GenRef dataset and the efficient fine-tuning approach for DiTs enable this capability, offering a flexible way to trade computation for quality, especially on complex generation tasks. The code, checkpoints, and datasets are made publicly available.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Le Zhuo (25 papers)
Liangbing Zhao (4 papers)
Sayak Paul (18 papers)
Yue Liao (35 papers)
Renrui Zhang (100 papers)
Yi Xin (28 papers)
Peng Gao (402 papers)
Mohamed Elhoseiny (102 papers)
Hongsheng Li (340 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1915421680480002243