Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model (2503.07703v1)

Published 10 Mar 2025 in cs.CV

Abstract: Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual LLM as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Summary

Seedream 2.0: A Bilingual Image Generation Foundation Model

The evolution of diffusion models has significantly advanced the field of image generation, with models such as Flux, SD3.5, and Midjourney leading the charge in commercial applications. Yet, these models exhibit certain limitations, including biases towards aesthetics over functional correctness, inadequate text rendering, and a poor grasp of cultural nuances, particularly those rooted in Chinese traditions. Seedream 2.0 emerges as a solution to these constraints, presenting an adept bilingual text-to-image foundation model capable of generating high-fidelity images with both Chinese and English prompts.

Seedream 2.0 integrates a self-developed bilingual LLM as a text encoder, endowing it with native support for understanding and rendering text accurately in both languages. This capability stems from its proficiency in learning directly from vast datasets encompassing culturally relevant content. The model's architecture includes several innovations: a Glyph-Aligned ByT5 mechanism for precise character-level rendering, a Scaled ROPE for resolution generalization, and a robust correctional phase involving supervised fine-tuning (SFT) and reinforcement learning through human feedback (RLHF) to ensure alignment with human artistic preferences.

Extensive experiments highlight Seedream 2.0's state-of-the-art proficiency across several domains, such as prompt-following, aesthetics, text rendering, and structural correctness. The model notably excels in generating content that reflects nuanced Chinese cultural elements and demonstrates superior performance in text rendering, particularly in complex scenarios involving intricate Chinese characters. An outstanding ELO score further attests to its alignment with human preferences, considering a comprehensive measure of aesthetic quality, text-image alignment, and structural integrity.

Further, Seedream 2.0's extensibility is showcased through its adaptation into instruction-based image editing models like SeedEdit. This adaptation exploits the model's underlying textual conditioning capabilities, enhancing its utility for professionals and creatives working with image manipulation tasks.

In terms of practical implications, Seedream 2.0's bilingual efficiency and cultural understanding promote its applicability in diverse fields, including design, art, media, and advertising. The model, incorporated into platforms such as Doubao and Dreamina, offers tools for enhancing productivity and creativity in both professional and everyday settings, with potential ongoing developments in image editing and customization predicated on user instructions.

Looking ahead, the adoption and further evolution of Seedream 2.0 could promote its integration into more specialized artistic and design processes, exploring deeper interactions between image generation models and local cultural content. The foundation set by Seedream 2.0 for understanding bilingual prompts and rendering complex text structures points towards a future where AI systems can intuitively grasp and represent diverse cultural contexts with increasing accuracy.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (28)

First 10 authors:

Tweets

https://twitter.com/Xianbao_QIAN/status/1899663369206182334

https://twitter.com/arxivsanitybot/status/1899817102527369522

YouTube

Show All Videos