Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation (2312.14867v2)

Published 22 Dec 2023 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal LLMs (MLLMs) as the backbone and does not require training or fine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45. (2) VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images. (3) VIEScore achieves a correlation on par with human ratings in the generation tasks but struggles in editing tasks. With these results, we believe VIEScore shows its great potential to replace human judges in evaluating image synthesis tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  4. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723.
  5. Introducing our multimodal models.
  6. Instructpix2pix: Learning to follow image editing instructions. In CVPR.
  7. Language models are few-shot learners. ArXiv, abs/2005.14165.
  8. Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640.
  9. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186.
  10. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. arXiv preprint arXiv:2310.18235.
  11. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  13. deep floyd.ai. 2023. If by deepfloyd lab at stabilityai.
  14. Deep generative image models using a laplacian pyramid of adversarial networks. Advances in neural information processing systems, 28.
  15. Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc.
  16. Alpacafarm: A simulation framework for methods that learn from human feedback.
  17. Gptscore: Evaluate as you desire. ArXiv, abs/2302.04166.
  18. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.
  19. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations.
  20. Deep learning, volume 1. MIT Press.
  21. Photoswap: Personalized subject swapping in images. arXiv preprint arXiv:2305.18286.
  22. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528.
  23. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  24. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc.
  25. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  26. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. ArXiv, abs/2307.06350.
  27. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134.
  28. Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
  29. Mutual information divergence: A unified metric for multimodal generative models. Advances in Neural Information Processing Systems, 35:35072–35086.
  30. Bias-to-text: Debiasing unknown visual biases through language interpretation. ArXiv.
  31. Imagenhub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596.
  32. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941.
  33. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32.
  34. Holistic evaluation of text-to-image models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  35. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720.
  36. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  37. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  38. Dreamedit: Subject-driven image editing. arXiv preprint arXiv:2306.12624.
  39. Visual instruction tuning. ArXiv, abs/2304.08485.
  40. What makes good in-context examples for gpt-3? In Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out.
  41. Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327.
  42. Dreamcom: Finetuning text-guided inpainting model for image composition. ArXiv, abs/2309.15508.
  43. Vim: Probing multimodal large language models for visual embedded instruction following. ArXiv, abs/2311.17647.
  44. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. arXiv preprint arXiv:2305.11116.
  45. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. ArXiv, abs/2305.11116.
  46. Repaint: Inpainting using denoising diffusion probabilistic models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11451–11461.
  47. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations.
  48. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047.
  49. OpenAI. 2023. Gpt-4 technical report.
  50. openjourney.ai. 2023. Openjourney is an open source stable diffusion fine tuned model on midjourney images.
  51. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  52. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11.
  53. Archit Parnami and Minwoo Lee. 2022. Learning from few examples: A summary of approaches to few-shot learning. ArXiv, abs/2203.04291.
  54. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824.
  55. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147.
  56. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
  57. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
  58. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510.
  59. runwayml. 2023. Stable diffusion inpainting.
  60. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
  61. Assessing generative models via precision and recall. Advances in neural information processing systems, 31.
  62. Improved techniques for training gans. Advances in neural information processing systems, 29.
  63. Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry…for now. ArXiv, abs/2311.17138.
  64. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089.
  65. stability.ai. 2023. Stable diffusion xl.
  66. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  67. Cogvlm: Visual expert for pretrained language models. ArXiv, abs/2311.03079.
  68. Chen Henry Wu and Fernando De la Torre. 2023. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV.
  69. Instructscore: Towards explainable text generation evaluation with automatic feedback. ArXiv, abs/2305.14282.
  70. The dawn of lmms: Preliminary explorations with gpt-4v(ision).
  71. A survey on multimodal large language models. ArXiv, abs/2306.13549.
  72. Magicbrush: A manually annotated dataset for instruction-guided image editing. NeurIPS dataset and benchmark track.
  73. Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543.
  74. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
  75. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361.
  76. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Max Ku (11 papers)
  2. Dongfu Jiang (14 papers)
  3. Cong Wei (16 papers)
  4. Xiang Yue (72 papers)
  5. Wenhu Chen (134 papers)
Citations (28)