Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

143 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

392

CosmicMan: A Text-to-Image Foundation Model for Humans (2404.01294v1)

Published 1 Apr 2024 in cs.CV

Abstract: We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

References (70)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces CosmicMan, a specialized T2I model that generates high-fidelity human images with enhanced text-image alignment.
It leverages the innovative Annotate Anyone paradigm to produce the CosmicMan-HQ dataset with 6M images and 115M attribute annotations.
The Daring training framework, featuring data discretization and HOLA loss, significantly improves output quality and alignment for human-centric tasks.

CosmicMan: Pioneering the Specialization of Text-to-Image Models in Human Image Generation

Introduction to CosmicMan

The advent of text-to-image (T2I) foundation models like DALLE, Imagen, and Stable Diffusion (SD) has significantly advanced the capabilities in image generation tasks. These models, benefiting from extensive image-text datasets and sophisticated generative algorithms, have showcased impressive ability in generating images with remarkable fidelity and detail. However, their application in human-centric content generation exhibits a critical limitation: the lack of a specialized foundation model focusing exclusively on human subjects.

To address this, we introduce CosmicMan, a T2I foundation model dedicated to generating high-fidelity human images. CosmicMan outperforms general-purpose models by ensuring meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions for human images.

CosmicMan-HQ Dataset Construction

The effectiveness of CosmicMan stems from the CosmicMan-HQ dataset, constructed via a novel data production paradigm named Annotate Anyone, emphasizing human-AI collaboration. This paradigm ensures the ongoing creation of high-quality human-centric data, aligning with the complex requirements of human image generation.

Annotate Anyone Paradigm

Annotate Anyone introduces a systematic, scalable approach to data collection and annotation that leverages both human expertise and AI capabilities. This paradigm involves two primary stages:

Flowing Data Sourcing: By continuously monitoring a broad spectrum of internet sources alongside recycling academic datasets such as LAION-5B, SHHQ, and DeepFashion, Annotate Anyone ensures a diverse and expansive data pool.
Human-in-the-loop Data Annotation: This iterative process involves human annotators refining AI-generated labels, focusing on attributes that fail to meet a predefined accuracy threshold, significantly reducing manual annotation costs while improving label quality.

The outcome is the CosmicMan-HQ dataset, which comprises 6 million high-resolution images annotated with $115$ million attributes, providing a robust foundation for the CosmicMan model.

Decomposed-Attention-Refocusing (Daring) Training Framework

CosmicMan leverages the Daring training framework, which is designed to be both effective and straightforward to integrate into downstream tasks. Key innovations of Daring include:

Data Discretization: By decomposing dense text descriptions into fixed groups aligned with human body structure, CosmicMan can more effectively learn the intricate relationships between textual concepts and their corresponding visual representations.
HOLA Loss: The Human Body and Outfit Guided Loss for Alignment (HOLA) focuses on improving text-image alignment at the group level, enhancing the model's ability to generate images conforming to detailed descriptions.

Evaluation and Applications

In comparing CosmicMan to state-of-the-art foundation models, it demonstrates superior capabilities in generating human images with improved fidelity and alignment. Extensive ablation studies validate the contributions of the Annotate Anyone paradigm and the Daring training framework to the model's performance.

Furthermore, application tests in 2D human image editing and 3D human reconstruction highlight the practical advantages of CosmicMan as a specialized foundation model for human-centric tasks.

Conclusion and Future Directions

CosmicMan represents a significant step forward in the specialization of text-to-image foundation models for human-centered applications. By addressing the unique challenges of human image generation, CosmicMan sets a new benchmark for future research in this domain.

As part of our long-term commitment, we plan to continually update both the CosmicMan-HQ dataset and the CosmicMan model, ensuring they remain at the forefront of advancements in human image generation technology.

Tweets

https://twitter.com/_akhaliq/status/1775176061220712815

https://twitter.com/arankomatsuzaki/status/1775157234864996683

https://twitter.com/javaeeeee1/status/1776604542257299764

https://twitter.com/yukitaylor00/status/1775321337277579426

https://twitter.com/arxivsanitybot/status/1775877828632072516