Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rich Human Feedback for Text-to-Image Generation (2312.10240v2)

Published 15 Dec 2023 in cs.CV

Abstract: Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for LLMs, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.

The paper "Rich Human Feedback for Text-to-Image Generation" (Liang et al., 2023 ) introduces a novel approach to enhance Text-to-Image (T2I) generation by incorporating rich human feedback. The authors address the limitations of current T2I models, which often produce images with artifacts, text misalignment, and low aesthetic quality, and the shortcomings of existing evaluation metrics that fail to capture nuanced image quality aspects.

To address these issues, the authors make the following contributions:

  • RichHF-18K Dataset: They created a dataset of rich human feedback on 18,000 images, termed RichHF-18K, annotated with:
    • Point annotations marking implausibility, artifacts, and text-image misalignment.
    • Labels on text prompts identifying misrepresented or missing concepts.
    • Fine-grained scores assessing image plausibility, text-image alignment, aesthetics, and overall quality.
  • RAHF (Rich Automatic Human Feedback) Model: The authors designed a multimodal transformer model, RAHF (Rich Automatic Human Feedback), to predict the rich human annotations. This model predicts implausibility and misalignment regions, misaligned keywords, and fine-grained scores, offering detailed insights into image quality.
  • Improving Image Generation: The predicted rich human feedback from RAHF is leveraged to enhance image generation through:
    • Inpainting problematic image regions using predicted heatmaps as masks.
    • Finetuning image generation models by selecting high-quality training data based on predicted scores. The authors demonstrate improvements on the Muse model [chang2023muse], even though it was not used to generate the images in the training set, indicating good generalization.

The paper details the data collection process for RichHF-18K, where annotators marked implausibility and misalignment regions on images, labeled misaligned keywords in the prompts, and assigned scores for various quality aspects. To ensure reliability, each image-text pair was annotated by three annotators, and the annotations were consolidated through averaging scores, majority voting for keywords, and averaging heatmaps.

The architecture of the RAHF model consists of a vision stream (ViT) and a text stream. The image tokens and embedded text tokens are concatenated and encoded by a Transformer self-attention encoder. The model employs predictors for heatmap prediction (convolution and deconvolution layers), score prediction (convolution and linear layers), and keyword misalignment sequence prediction (Transformer decoder). Two model variants are explored: a multi-head version with separate prediction heads for each output and an augmented prompt version that prepends a task string to the prompt.

The experimental results demonstrate that the RAHF model can predict scores, implausibility heatmaps, misalignment heatmaps, and misalignment keyword sequences with reasonable accuracy. The augmented prompt version generally performs better than the multi-head version, as it allows the model to adapt to each specific task. Qualitative examples illustrate the model's ability to identify artifact regions and objects misaligned with the prompt.

The authors further demonstrate that the predicted rich human feedback can be used to improve image generation. Finetuning the Muse model [chang2023muse] with examples selected based on predicted plausibility scores leads to images with fewer artifacts. Using the RAHF aesthetic score as classifier guidance for Latent Diffusion also improves the generated images. Additionally, the predicted heatmaps are used to perform region inpainting, resulting in more plausible images with fewer artifacts.

The loss function for training the model is a weighted combination of the heatmap Mean Squared Error (MSE) loss, score MSE loss, and the sequence teacher-forcing cross-entropy loss.

Loss=λheatmapMSEheatmap+λscoreMSEscore+λsequenceCrossEntropysequenceLoss = \lambda_{heatmap} * MSE_{heatmap} + \lambda_{score} * MSE_{score} + \lambda_{sequence} * CrossEntropy_{sequence}

Where:

  • LossLoss is the total loss
  • λheatmap\lambda_{heatmap} is the weight for the heatmap loss
  • MSEheatmapMSE_{heatmap} is the mean squared error for heatmap prediction
  • λscore\lambda_{score} is the weight for the score loss
  • MSEscoreMSE_{score} is the mean squared error for score prediction
  • λsequence\lambda_{sequence} is the weight for the sequence loss
  • CrossEntropysequenceCrossEntropy_{sequence} is the cross-entropy loss for sequence prediction

The authors acknowledge limitations, including the lower performance on misalignment heatmap prediction and the over-annotation issue in artifact region annotation. They suggest future research directions such as improving misalignment label quality, collecting more data on diverse generative models, and exploring other ways to leverage rich human feedback to enhance T2I generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Universal guidance for diffusion models, 2023.
  3. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  5. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  6. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  7. Pali: A jointly-scaled multilingual language-image model, 2022.
  8. Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In arXiv:2310.18235, 2023a.
  9. Visual programming for text-to-image generation and evaluation. In NeurIPS, 2023b.
  10. Visual programming for text-to-image generation and evaluation. In NeurIPS, 2023c.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  13. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. In Advances in Neural Information Processing Systems, 2023.
  16. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  17. Matryoshka diffusion models, 2023.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2022.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  21. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  24. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In ICCV, 2023.
  25. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.
  26. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  27. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  28. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  29. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  30. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  31. Holistic evaluation of text-to-image models. In Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  32. Controllable text-to-image generation. Advances in Neural Information Processing Systems, 32, 2019.
  33. Spotlight: Mobile ui understanding using vision-language models with a focus. In International Conference on Learning Representations, 2023.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  35. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455, 2023.
  36. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  37. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14277–14286, 2023.
  38. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1505–1514, 2019.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  43. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  44. Emu edit: Precise image editing via recognition and generation tasks, 2023.
  45. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
  46. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  47. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
  48. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, 2023.
  51. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a.
  52. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023b.
  53. A survey on video diffusion models. arXiv preprint arXiv:2310.10647, 2023.
  54. Imagereward: Learning and evaluating human preferences for text-to-image generation. In Neural Information Processing Systems, 2023.
  55. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  56. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  57. What you see is what you read? improving text-image alignment evaluation, 2023.
  58. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  59. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023a.
  60. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  61. Perceptual artifacts localization for image synthesis tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7579–7590, 2023b.
  62. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Youwei Liang (16 papers)
  2. Junfeng He (66 papers)
  3. Gang Li (579 papers)
  4. Peizhao Li (18 papers)
  5. Arseniy Klimovskiy (1 paper)
  6. Nicholas Carolan (1 paper)
  7. Jiao Sun (29 papers)
  8. Jordi Pont-Tuset (38 papers)
  9. Sarah Young (5 papers)
  10. Feng Yang (147 papers)
  11. Junjie Ke (13 papers)
  12. Krishnamurthy Dj Dvijotham (11 papers)
  13. Katie Collins (2 papers)
  14. Yiwen Luo (3 papers)
  15. Yang Li (1140 papers)
  16. Kai J Kohlhoff (2 papers)
  17. Deepak Ramachandran (28 papers)
  18. Vidhya Navalpakkam (5 papers)
Citations (24)
Youtube Logo Streamline Icon: https://streamlinehq.com