The paper "Rich Human Feedback for Text-to-Image Generation" (Liang et al., 2023 ) introduces a novel approach to enhance Text-to-Image (T2I) generation by incorporating rich human feedback. The authors address the limitations of current T2I models, which often produce images with artifacts, text misalignment, and low aesthetic quality, and the shortcomings of existing evaluation metrics that fail to capture nuanced image quality aspects.
To address these issues, the authors make the following contributions:
- RichHF-18K Dataset: They created a dataset of rich human feedback on 18,000 images, termed RichHF-18K, annotated with:
- Point annotations marking implausibility, artifacts, and text-image misalignment.
- Labels on text prompts identifying misrepresented or missing concepts.
- Fine-grained scores assessing image plausibility, text-image alignment, aesthetics, and overall quality.
- RAHF (Rich Automatic Human Feedback) Model: The authors designed a multimodal transformer model, RAHF (Rich Automatic Human Feedback), to predict the rich human annotations. This model predicts implausibility and misalignment regions, misaligned keywords, and fine-grained scores, offering detailed insights into image quality.
- Improving Image Generation: The predicted rich human feedback from RAHF is leveraged to enhance image generation through:
- Inpainting problematic image regions using predicted heatmaps as masks.
- Finetuning image generation models by selecting high-quality training data based on predicted scores. The authors demonstrate improvements on the Muse model [chang2023muse], even though it was not used to generate the images in the training set, indicating good generalization.
The paper details the data collection process for RichHF-18K, where annotators marked implausibility and misalignment regions on images, labeled misaligned keywords in the prompts, and assigned scores for various quality aspects. To ensure reliability, each image-text pair was annotated by three annotators, and the annotations were consolidated through averaging scores, majority voting for keywords, and averaging heatmaps.
The architecture of the RAHF model consists of a vision stream (ViT) and a text stream. The image tokens and embedded text tokens are concatenated and encoded by a Transformer self-attention encoder. The model employs predictors for heatmap prediction (convolution and deconvolution layers), score prediction (convolution and linear layers), and keyword misalignment sequence prediction (Transformer decoder). Two model variants are explored: a multi-head version with separate prediction heads for each output and an augmented prompt version that prepends a task string to the prompt.
The experimental results demonstrate that the RAHF model can predict scores, implausibility heatmaps, misalignment heatmaps, and misalignment keyword sequences with reasonable accuracy. The augmented prompt version generally performs better than the multi-head version, as it allows the model to adapt to each specific task. Qualitative examples illustrate the model's ability to identify artifact regions and objects misaligned with the prompt.
The authors further demonstrate that the predicted rich human feedback can be used to improve image generation. Finetuning the Muse model [chang2023muse] with examples selected based on predicted plausibility scores leads to images with fewer artifacts. Using the RAHF aesthetic score as classifier guidance for Latent Diffusion also improves the generated images. Additionally, the predicted heatmaps are used to perform region inpainting, resulting in more plausible images with fewer artifacts.
The loss function for training the model is a weighted combination of the heatmap Mean Squared Error (MSE) loss, score MSE loss, and the sequence teacher-forcing cross-entropy loss.
Where:
- is the total loss
- is the weight for the heatmap loss
- is the mean squared error for heatmap prediction
- is the weight for the score loss
- is the mean squared error for score prediction
- is the weight for the sequence loss
- is the cross-entropy loss for sequence prediction
The authors acknowledge limitations, including the lower performance on misalignment heatmap prediction and the over-annotation issue in artifact region annotation. They suggest future research directions such as improving misalignment label quality, collecting more data on diverse generative models, and exploring other ways to leverage rich human feedback to enhance T2I generation.