ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts (2210.15257v2)

Published 27 Oct 2022 in cs.CV and cs.AI

Abstract: Recent progress in diffusion models has revolutionized the popular technology of text-to-image generation. While existing approaches could produce photorealistic high-resolution images with text conditions, there are still several open problems to be solved, which limits the further improvement of image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a large-scale Chinese text-to-image diffusion model, to progressively upgrade the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. With the proposed mechanisms, ERNIE-ViLG 2.0 not only achieves a new state-of-the-art on MS-COCO with zero-shot FID score of 6.75, but also significantly outperforms recent models in terms of image fidelity and image-text alignment, with side-by-side human evaluation on the bilingual prompt set ViLG-300.

Authors (15)

Zhida Feng (2 papers)
Zhenyu Zhang (250 papers)
Xintong Yu (7 papers)
Yewei Fang (7 papers)
Lanxin Li (3 papers)
Xuyi Chen (9 papers)
Yuxiang Lu (26 papers)
Jiaxiang Liu (39 papers)
Weichong Yin (8 papers)
Shikun Feng (37 papers)
Yu Sun (226 papers)
Li Chen (590 papers)
Hao Tian (146 papers)
Hua Wu (191 papers)
Haifeng Wang (194 papers)

Citations (104)

View on Semantic Scholar

Summary

An Expert Overview of the ERNIE-ViLG 2.0 Text-to-Image Diffusion Model

The paper "ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts" explores advancements in text-to-image synthesis, particularly through enhancing diffusion models with knowledge integration and expert-driven denoising strategies. ERNIE-ViLG 2.0 specifically addresses issues of image fidelity and text relevancy that have persisted in previous models.

Model Innovations

ERNIE-ViLG 2.0 integrates two primary innovations: knowledge-enhanced learning and a mixture-of-denoising-experts (MoDE) strategy.

Knowledge-Enhanced Learning: To improve semantic alignment, the model incorporates fine-grained textual and visual knowledge. Textual knowledge emphasizes important semantics by parsing part-of-speech information to adjust the focus within textual prompts. Visual knowledge is enriched by employing object detection to highlight and correctly align salient image regions.
Mixture-of-Denoising-Experts: Recognizing that different denoising timesteps have varying requirements, ERNIE-ViLG 2.0 divides the denoising process into specialized stages, each handled by an expert network. This strategy permits the model to adapt better to the distinct challenges of different denoising phases without increasing computational demands during inference.

Experimental Results

ERNIE-ViLG 2.0 was tested on the MS-COCO dataset, achieving impressive results with a zero-shot FID-30k score of 6.75, significantly outperforming comparable models like Imagen and Parti. The model’s performance benefits from both fine-tuning in denoising through MoDE and enriched alignment with textual intent facilitated by knowledge integration.

Human evaluations demonstrate ERNIE-ViLG 2.0's proficiency in generating high-quality images that maintain robust semantic alignment, evidenced by its superior results compared to DALL-E 2 and Stable Diffusion across diverse prompts in the bilingual dataset ViLG-300.

Practical and Theoretical Implications

The practical implications of this work extend into improved capabilities for applications requiring precise text-to-image translations, enhancing creative image generation and artistic endeavors. From a theoretical standpoint, the hybrid approach of integrating knowledge with diffusion processes suggests beneficial pathways for future exploration into refined denoising techniques.

Future Directions

There is potential to further scale model parameters by leveraging more denoising experts or incorporating multiple text encoders, maintaining efficiency through inference while enhancing model specificity and alignment capabilities. Additionally, expanding external knowledge resources may continue to advance the semantic fidelity of generated images.

In summary, ERNIE-ViLG 2.0 represents a noteworthy step in text-to-image generation, successfully integrating domain-specific knowledge into the diffusion model framework and offering insightful strategies for disentangling complex denoising tasks. The implications of these methodological advancements hold promise for further research and application within the AI community.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos