Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 174 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

FERGI: Automatic Scoring of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction (2312.03187v4)

Published 5 Dec 2023 in cs.CV, cs.AI, cs.HC, and cs.LG

Abstract: Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation based on their facial expression reactions, which is complementary to the pre-trained scoring models based on the input text prompts and generated images. Integrating our FAU-Net valence score with the pre-trained scoring models improves their consistency with human preferences. This method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Killing two birds with one stone: Efficient and robust training of face recognition CNNs by partial FC. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4042–4051, 2022.
  2. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
  3. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. Inferring psychological significance from physiological signals. American psychologist, 45(1):16, 1990.
  5. Electromyographic activity over facial muscle regions can differentiate the valence and intensity of affective reactions. Journal of personality and social psychology, 50(2):260, 1986.
  6. Cunjian Chen. PyTorch Face Landmark: A fast and accurate facial landmark detector, 2021. Open-source software available at https://github.com/cunjian/pytorch_face_landmark.
  7. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  8. Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv preprint arXiv:2011.10650, 2020.
  9. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  10. Learning spatial and temporal cues for multi-label facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 25–32. IEEE, 2017.
  11. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  12. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  13. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  14. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  15. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  16. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  17. Paul Ekman. Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique. Psychological Bulletin, 115(2):268–287, 1994.
  18. Constants across cultures in the face and emotion. Journal of personality and social psychology, 17(2):124, 1971.
  19. Facial action coding system. Environmental Psychology & Nonverbal Behavior, 1978.
  20. Image inpainting: A review. Neural Processing Letters, 51:2007–2028, 2020.
  21. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
  22. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  23. The face of negative affect: trial-by-trial corrugator responses to negative pictures are positively associated with amygdala and negatively associated with ventromedial prefrontal cortex activity. Journal of cognitive neuroscience, 26(9):2102–2110, 2014.
  24. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  25. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  26. Facial action unit detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680–7689, 2021.
  27. Multiview facial expression recognition, a survey. IEEE Transactions on Affective Computing, 13(4):2086–2105, 2022.
  28. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  29. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  30. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  31. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  32. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  33. Dimitrios Kollias. ABAW: valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2328–2336, 2022.
  34. ABAW: valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5888–5897, 2023.
  35. A compact deep learning model for robust facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2121–2129, 2018.
  36. Looking at pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30(3):261–273, 1993.
  37. Effects of positive and negative affect on electromyographic activity over zygomaticus major and corrugator supercilii. Psychophysiology, 40(5):776–785, 2003.
  38. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  39. AGIQA-3K: an open database for AI-generated image quality assessment. arXiv preprint arXiv:2306.04717, 2023.
  40. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  41. Deep facial expression recognition: A survey. IEEE transactions on affective computing, 13(3):1195–1215, 2020.
  42. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1841–1850, 2017.
  43. Eac-net: Deep nets with enhancing and cropping for facial action unit detection. IEEE transactions on pattern analysis and machine intelligence, 40(11):2583–2596, 2018.
  44. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  45. Automatic analysis of facial actions: A survey. IEEE transactions on affective computing, 10(3):325–347, 2017.
  46. David Matsumoto. More evidence for the universality of a contempt expression. Motivation and Emotion, 16:363–368, 1992.
  47. Extended DISFA dataset: Investigating posed and spontaneous facial expressions. In proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1–8, 2016.
  48. Automatic detection of non-posed facial action units. In 2012 19th IEEE International Conference on Image Processing, pages 1817–1820. IEEE, 2012.
  49. DISFA: a spontaneous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013.
  50. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  51. Ordinal regression with multiple output CNN for age estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4920–4928, 2016.
  52. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14277–14286, 2023.
  53. Image-to-image translation: Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.
  54. SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  56. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  57. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  58. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  59. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  60. LAION-5B: an open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  61. Facial action unit detection using attention and relation learning. IEEE transactions on affective computing, 13(3):1274–1289, 2019.
  62. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  63. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  64. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667–19679, 2020.
  65. FERA 2015-second facial expression recognition and analysis challenge. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–8. IEEE, 2015.
  66. FERA 2017-addressing head pose in the third facial expression recognition and analysis challenge. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 839–847. IEEE, 2017.
  67. Conditional image generation with PixelCNN decoders. Advances in neural information processing systems, 29, 2016.
  68. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016.
  69. Deep structured learning for facial action unit intensity estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3405–3414, 2017.
  70. Deep learning for image super-resolution: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(10):3365–3387, 2020.
  71. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a.
  72. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023b.
  73. ImageReward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  74. Xiaojing Xu and Virginia R. de Sa. Exploring multidimensional measurements for pain evaluation using facial action units. In 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 559–565, 2020.
  75. Automated pain detection in facial videos of children using human-assisted transfer learning. In Lecture Notes in Artificial Intelligence 11326 Artificial Intelligence in Health Revised Selected Papers  from the First International Workshop, AIH 2018, pages 162–180, 2018.
  76. Pain evaluation in video using extended multitask learning from multidimensional measurements. In Proceedings of Machine Learning Research, (Machine Learning for Health ML4H at NeurIPS 2019), 2019.
  77. Facial expression recognition based on facial action unit. In 2019 Tenth International Green and Sustainable Computing Conference (IGSC), pages 1–6. IEEE, 2019.
  78. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  79. Facial expression analysis under partial occlusion: A survey. ACM Computing Surveys (CSUR), 51(2):1–49, 2018.
  80. A perceptual quality assessment exploration for aigc images. arXiv preprint arXiv:2303.12618, 2023.
  81. Joint patch and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2207–2216, 2015.
  82. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3391–3399, 2016.
  83. A comprehensive survey on automatic facial action unit analysis. The Visual Computer, 36:1067–1093, 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube