Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding (2405.08748v1)

Published 14 May 2024 in cs.CV

Abstract: We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal LLM to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

Citations (34)

View on Semantic Scholar

Summary

The paper introduces Hunyuan-DiT, a bilingual diffusion transformer integrating CLIP and T5 encoders for fine-grained image generation from Chinese and English prompts.
It employs an advanced data pipeline with iterative optimization and innovations like Rotary Positional Encoding, achieving 74.2% text-image consistency.
The study demonstrates interactive multi-turn dialogue capabilities that refine image outputs, paving the way for practical multimodal applications.

A Deep Dive into Hunyuan-DiT: A Bilingual Text-to-Image Diffusion Transformer

Hunyuan-DiT Overview

Hunyuan-DiT is an innovative text-to-image diffusion transformer designed to handle English and Chinese prompts with fine-grained understanding. This model stands out in its ability to generate high-quality images across multiple resolutions and engage in multi-turn multimodal dialogues, which are evaluated by a meticulous human evaluation protocol with the input of over 50 professional evaluators.

Key Components of Hunyuan-DiT

1. Transformer Architecture and Text Encoders

Hunyuan-DiT features a new transformer-based network that leverages two powerful text encoders: a bilingual CLIP and a multilingual T5 encoder. This setup aims to improve language understanding and extend the context length, providing a robust foundation for image generation from both English and Chinese text prompts.

2. Data Pipeline and Iterative Optimization

The construction of Hunyuan-DiT involves an elaborate data processing pipeline that includes:

Data acquisition from various sources.
Tagging and layering of data based on quality and purpose.
Iterative optimization through a mechanism called ‘data convoy.’

The data convoy mechanism continuously updates the model by evaluating the impact of new data, ensuring the model’s performance is continually enhanced.

3. Multimodal LLM (MLLM) for Caption Refinement

To improve image-text pairs, an MLLM is trained to generate structured captions with world knowledge. This refinement step is crucial for producing high-quality images and enhances the model’s understanding of fine-grained details in both languages.

4. Interactivity with Multi-Turn Dialogue

Hunyuan-DiT isn't just a passive generator; it can modify images interactively through multi-turn dialogues with users. This feature allows users to iteratively refine their prompts and get closer to their desired image outputs.

Evaluation and Results

Quantitative Performance

Hunyuan-DiT achieves impressive results, especially with Chinese-to-image generation. Key highlights from the evaluations include:

Text-Image Consistency: Hunyuan-DiT outperforms other open-source models, scoring 74.2% in aligning generated images accurately with the provided prompts.
Subject Clarity and Aesthetics: The model also excels in rendering clear and aesthetically pleasing images, comparable to top-tier closed-source models like DALL-E 3 and MidJourney v6.

Evaluation Protocol

An extensive framework involving various metrics, prompt difficulties, and diverse categories ensures a thorough evaluation. The team uses prompts that range from simple to complex, and the evaluation involves detailed scoring across multiple dimensions:

Text-Image Consistency
AI Artifacts Exclusion
Subject Clarity
Overall Aesthetics

Ablation Studies and Technical Improvements

The authors conducted various experiments to optimize the model’s performance:

Model Structure: Integrating long skip connections and using Rotary Positional Encoding (RoPE) significantly improved model performance.
Text Encoding: Combining the bilingual CLIP and multilingual T5 encoders led to superior semantic understanding and text-image generation quality.
Position Encoding: Centralized Interpolative Positional Encoding proved more efficient for multi-resolution image generation than Extended Positional Encoding.

Practical and Theoretical Implications

The advancements presented in Hunyuan-DiT have several implications:

Practical Use: The ability to generate detailed, high-quality images from bilingual prompts opens up new possibilities for creative industries, multilingual user interfaces, and educational tools.
Theoretical Progress: The integration of RoPE and multi-turn dialogue capabilities into diffusion models represents a step forward in enhancing transformer-based architectures’ flexibility and interactivity.

Future Directions

Hunyuan-DiT sets the stage for future research in several areas:

Enhanced Data Processing: Exploring better data layering techniques and more sophisticated data evaluation protocols.
Algorithmic Enhancements: Developing more efficient training and inference algorithms to further optimize performance and reduce computational costs.
User Interaction: Refining multi-turn dialogue mechanisms to make AI interactions more natural and intuitive.

In summary, Hunyuan-DiT represents a well-rounded advancement in the text-to-image generation space, notably for Chinese and English text prompts. Its design and evaluation illustrate the potential enhancements in interactive, multimodal AI systems, paving the way for future innovations in this exciting field.