Introduction
Recent years have witnessed a significant surge in the capabilities of text-to-image generative models. These models have become adept at creating images that are both high-fidelity and semantically related to the corresponding text prompts. However, a primary challenge for these systems is to alight model outputs with human preferences, as the training distribution often does not reflect the true distribution of user-generated prompts.
ImageReward
Addressing the need for enhanced alignment with human preference, ImageReward emerges as a pioneering general-purpose reward model for text-to-image synthesis. It incorporates human preferences effectively, trained on a substantial dataset consisting of 137k pairs of expert comparisons. The training benefits from a meticulously crafted annotation pipeline that encompasses rating and ranking. The stringent annotation process involves prompt categorization, problem identification, and multi-dimensional scoring based on alignment, fidelity, and harmlessness. This endeavor has necessitated months of effort in establishing labeling criteria, training experts, and ensuring the reliability of responses.
Evaluating ImageReward's Efficacy
ImageReward demonstrates superiority over existing methodologies such as CLIP, Aesthetic, and BLIP scoring models. It outperforms these models by significant margins (38.6% over CLIP, 39.6% over Aesthetic, and 31.6% over BLIP) in grasping human preference in synthesized images. The capability to align closely with human preferences is further validated through exhaustive analysis and experiments. Additionally, ImageReward shows notable potential as an automatic evaluation metric for text-to-image generation tasks.
ReFL: Reward Feedback Learning
ReFL, Reward Feedback Learning, is introduced as an innovation to directly fine-tune diffusion generative models in light of a reward scorer's feedback. This unique approach leverages insights gained from the evaluation of image quality at late denoising steps of the generative process. Empirical evaluations acknowledge the superiority of ReFL, notably over alternative approaches, such as data augmentation and loss reweighing.
Conclusion and Broader Impact
ImageReward and ReFL collectively represent a significant stride in aligning generative models with human values and preferences. While there are acknowledged limitations, including the scale and diversity of annotation data, the use of a single reward model might not encapsulate the multiplicity of human aesthetics. Nevertheless, the advantages, such as mitigating the over-reliance on data with copyright issues for training and conforming to social norms, significantly outweigh the downsides.