Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

HunyuanImage 3.0 Technical Report (2509.23951v1)

Published 28 Sep 2025 in cs.CV

Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

Summary

  • The paper introduces HunyuanImage 3.0, a unified multimodal autoregressive model that advances state-of-the-art image understanding and generation.
  • It employs a rigorous data curation pipeline and a hierarchical bilingual captioning schema to achieve high-quality semantic diversity.
  • The model leverages an innovative Mixture-of-Experts architecture with advanced attention mechanisms and Chain-of-Thought training, outperforming prior versions.

HunyuanImage 3.0: A Native Multimodal Autoregressive Model for Unified Image Understanding and Generation

Model Motivation and Data Curation

HunyuanImage 3.0 addresses the lack of large-scale, open-source, state-of-the-art text-to-image models by introducing a unified multimodal architecture capable of both image understanding and generation. The model is built upon a rigorous data curation pipeline, starting from over 10 billion raw images and filtering down to a high-quality, semantically diverse dataset of nearly 5 billion images. The filtering process combines technical quality checks, learning-based detectors for watermarks, logos, and AI-generated content, and subject-scoring models for clarity and aesthetics. Specialized datasets, including knowledge-augmented, text-related, and stylized images, are incorporated to enhance semantic breadth.

A hierarchical, bilingual (English/Chinese) captioning schema is employed, decomposing image content into descriptive levels, stylistic attributes, and factual entities. Compositional caption synthesis augments data diversity, while specialized agents (OCR and named entity recognition) and a bidirectional verification loop ensure factual grounding. For paired and multi-image data, an image difference captioner is used to generate detailed change descriptions, supporting editing and reasoning tasks.

To enable advanced reasoning, the dataset includes both text-to-text (T2T) and text-to-text-to-image (T2TI) reasoning data, facilitating Chain-of-Thought (CoT) training. This allows the model to autonomously interpret prompts, refine concepts, and synthesize images through intermediate reasoning steps.

Architecture and Multimodal Integration

HunyuanImage 3.0 is based on the Hunyuan-A13B Mixture-of-Experts (MoE) LLM, with over 80B total parameters and 13B activated per token. The architecture integrates a VAE for image encoding (32-dimensional latent space, 16x downsampling) and a vision encoder, with dual projectors aligning their features into the transformer's latent space. This dual-encoder strategy enables unified multimodal representation, supporting both understanding and generation within a single sequence, in contrast to prior models that segregate features by task.

A key innovation is the Generalized Causal Attention mechanism, which restricts text tokens to attend only to previous multimodal tokens, while image tokens can attend to all previous multimodal tokens and all subsequent image tokens within the same segment. This design preserves autoregressive properties for text while enabling global spatial dependencies for images.

Position encoding is handled via a Generalized 2D RoPE, maintaining backward compatibility with pretrained LLMs. Image tokens are assigned 2D positional encodings, while text tokens retain 1D RoPE, ensuring seamless integration and minimizing disruption to linguistic capabilities.

The model supports automatic resolution and aspect ratio selection via special tokens, allowing it to infer appropriate image shapes from context or explicit user cues.

Training Regimen and Post-Training Optimization

Training is organized into four progressive stages, with increasing image resolution and dataset quality. The initial stage aligns text and image modalities at low resolution, followed by fine-tuning the vision encoder, joint training at higher resolutions, and finally, high-resolution training with reasoning data for CoT capabilities. Instruction tuning is performed post-pretraining, using instruction templates for text-to-image, LLMing, and CoT data.

Post-training involves a multi-stage process:

  • SFT on curated, high-quality human-annotated data.
  • DPO to suppress structural deficiencies using preference signals from paired high/low-quality samples.
  • MixGRPO for online RL optimization, improving aesthetics, realism, and text-image alignment.
  • SRPO for gradient-guided realism and aesthetic enhancement, leveraging differentiable reward signals.
  • ReDA, a novel reward distribution alignment algorithm, minimizes divergence from a high-reward distribution, further improving visual quality.

Evaluation and Empirical Results

HunyuanImage 3.0 is evaluated using both automatic and human-centric metrics. The Structured Semantic Alignment Evaluation (SSAE) metric leverages LLMs and MLLMs to assess text-image alignment across 12 fine-grained semantic fields, using 500 diverse prompts and 3,500 key points. HunyuanImage 3.0 achieves performance on par with leading closed-source models in all fields.

In the GSB (Good/Same/Bad) evaluation, with 1,000 prompts and over 100 professional evaluators, HunyuanImage 3.0 demonstrates a 14.10% win rate over HunyuanImage 2.1 and positive margins over Seedream 4.0, Nano Banana, and GPT-Image, establishing it as the most powerful open-source text-to-image model to date.

Expert activation analysis reveals increasing specialization of MoE experts for different modalities in deeper layers, suggesting that MoE architectures can enhance multimodal modeling by distributing responsibilities among specialized experts.

Implications and Future Directions

HunyuanImage 3.0 demonstrates that large-scale, open-source multimodal models can achieve parity with leading closed-source systems in both image generation quality and semantic alignment. The unified autoregressive framework, dual-encoder integration, and advanced attention and position encoding mechanisms provide a robust foundation for future multimodal research.

The model's native support for Chain-of-Thought reasoning and automatic resolution selection opens avenues for more controllable, context-aware image generation. The comprehensive data curation and captioning pipeline sets a new standard for dataset quality in generative modeling.

Practically, HunyuanImage 3.0 enables researchers and practitioners to build upon a state-of-the-art foundation for tasks such as text-to-image, image editing, and multimodal reasoning. The open release of code and weights is poised to foster further innovation and reproducibility in the field.

Future work includes extending the model to support image-to-image tasks and further enhancing its reasoning and editing capabilities. The demonstrated effectiveness of MoE specialization and advanced RL-based post-training suggests promising directions for scaling and refining multimodal generative models.

Conclusion

HunyuanImage 3.0 represents a significant advancement in open-source multimodal modeling, unifying image understanding and generation within a scalable, autoregressive MoE framework. Through meticulous data curation, innovative architecture, and rigorous training and evaluation, the model achieves state-of-the-art performance in text-to-image generation. Its release is expected to catalyze further research and application development in multimodal AI.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 posts and received 28 likes.