Emergent Mind

CosmicMan: A Text-to-Image Foundation Model for Humans

(2404.01294)
Published Apr 1, 2024 in cs.CV

Abstract

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

Comparison between CosmicMan-SDXL and SDXL pretrained model in 2D human editing using T2I-Adapter.

Overview

  • CosmicMan is a new text-to-image (T2I) foundation model specialized in generating high-fidelity human images, outperforming general-purpose models in terms of appearance, structure, and text-image alignment.

  • The model is powered by the CosmicMan-HQ dataset, built on the Annotate Anyone paradigm which combines human expertise and AI for continuous, high-quality human-centric data creation.

  • CosmicMan employs the Decomposed-Attention-Refocusing (Daring) training framework, featuring Data Discretization and HOLA Loss techniques for enhanced learning and alignment in image generation.

  • The model shows superior performance in generating human images over existing foundation models and offers practical advantages in applications like 2D image editing and 3D human reconstruction.

CosmicMan: Pioneering the Specialization of Text-to-Image Models in Human Image Generation

Introduction to CosmicMan

The advent of text-to-image (T2I) foundation models like DALLE, Imagen, and Stable Diffusion (SD) has significantly advanced the capabilities in image generation tasks. These models, benefiting from extensive image-text datasets and sophisticated generative algorithms, have showcased impressive ability in generating images with remarkable fidelity and detail. However, their application in human-centric content generation exhibits a critical limitation: the lack of a specialized foundation model focusing exclusively on human subjects.

To address this, we introduce CosmicMan, a T2I foundation model dedicated to generating high-fidelity human images. CosmicMan outperforms general-purpose models by ensuring meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions for human images.

CosmicMan-HQ Dataset Construction

The effectiveness of CosmicMan stems from the CosmicMan-HQ dataset, constructed via a novel data production paradigm named Annotate Anyone, emphasizing human-AI collaboration. This paradigm ensures the ongoing creation of high-quality human-centric data, aligning with the complex requirements of human image generation.

Annotate Anyone Paradigm

Annotate Anyone introduces a systematic, scalable approach to data collection and annotation that leverages both human expertise and AI capabilities. This paradigm involves two primary stages:

  1. Flowing Data Sourcing: By continuously monitoring a broad spectrum of internet sources alongside recycling academic datasets such as LAION-5B, SHHQ, and DeepFashion, Annotate Anyone ensures a diverse and expansive data pool.
  2. Human-in-the-loop Data Annotation: This iterative process involves human annotators refining AI-generated labels, focusing on attributes that fail to meet a predefined accuracy threshold, significantly reducing manual annotation costs while improving label quality.

The outcome is the CosmicMan-HQ dataset, which comprises 6 million high-resolution images annotated with $115$ million attributes, providing a robust foundation for the CosmicMan model.

Decomposed-Attention-Refocusing (Daring) Training Framework

CosmicMan leverages the Daring training framework, which is designed to be both effective and straightforward to integrate into downstream tasks. Key innovations of Daring include:

  • Data Discretization: By decomposing dense text descriptions into fixed groups aligned with human body structure, CosmicMan can more effectively learn the intricate relationships between textual concepts and their corresponding visual representations.
  • HOLA Loss: The Human Body and Outfit Guided Loss for Alignment (HOLA) focuses on improving text-image alignment at the group level, enhancing the model's ability to generate images conforming to detailed descriptions.

Evaluation and Applications

In comparing CosmicMan to state-of-the-art foundation models, it demonstrates superior capabilities in generating human images with improved fidelity and alignment. Extensive ablation studies validate the contributions of the Annotate Anyone paradigm and the Daring training framework to the model's performance.

Furthermore, application tests in 2D human image editing and 3D human reconstruction highlight the practical advantages of CosmicMan as a specialized foundation model for human-centric tasks.

Conclusion and Future Directions

CosmicMan represents a significant step forward in the specialization of text-to-image foundation models for human-centered applications. By addressing the unique challenges of human image generation, CosmicMan sets a new benchmark for future research in this domain.

As part of our long-term commitment, we plan to continually update both the CosmicMan-HQ dataset and the CosmicMan model, ensuring they remain at the forefront of advancements in human image generation technology.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957
  2. Improving image generation with better captions. 2023.
  3. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset

  4. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299
  5. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10
  6. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis
  7. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19982–19993
  8. Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023a
  10. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
  11. DeepFloyd. Deepfloyd-if
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee
  13. Taming transformers for high-resolution image synthesis
  14. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations
  15. Flickr. Flickr application programming interface (api), 2023. Accessed: 2023-11-18.
  16. Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, pages 1–19. Springer
  17. GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
  18. Orthoplanes: A novel representation for better 3d-awareness of gans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22996–23007
  19. Prompt-to-Prompt Image Editing with Cross Attention Control
  20. CLIPScore: A Reference-free Evaluation Metric for Image Captioning
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30
  22. Denoising Diffusion Probabilistic Models
  23. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
  24. Imagededup. https://github.com/idealo/imagededup

  25. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 316–332. Springer
  26. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11
  27. Ifqa: Interpretable face quality assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3444–3453
  28. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 618–629, 2023a.
  29. HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
  30. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR
  31. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271
  32. Divide and bind your attention for improved generative semantic nursing
  33. Fashiontex: Controllable virtual try-on with text and texture. In ACM SIGGRAPH 2023 Conference Proceedings
  34. Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning
  35. Microsoft coco: Common objects in context. In ECCV 2014
  36. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309
  37. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104
  38. Decoupled Weight Decay Regularization
  39. Midjourney. Midjourney
  40. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM International Conference on Multimedia
  41. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
  42. Fice: Text-conditioned fashion image editing with guided gan inversion
  43. Pixabay. Pixabay application programming interface (api), 2023. Accessed: 2023-11-18.
  44. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models
  45. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
  46. DreamFusion: Text-to-3D using 2D Diffusion
  47. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors
  48. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, pages 8821–8831
  49. Hierarchical text-conditional image generation with clip latents
  50. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695
  52. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510
  53. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems
  54. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization
  55. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294
  56. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL
  57. Textcaps: a dataset for image captioningwith reading comprehension. 2020.
  58. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  59. Unsplash. Unsplash application programming interface (api), 2023. Accessed: 2023-11-18.
  60. Neural discrete representation learning. In Advances in Neural Information Processing Systems
  61. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023a.
  62. Decompose and realign: Tackling condition misalignment in text-to-image diffusion models, 2023b
  63. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
  64. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
  65. 4k4d: Real-time 4d view synthesis at 4k resolution. 2023.
  66. 3dhumangan: 3d-aware human image generation with 3d pose mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23008–23019
  67. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023a.
  68. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023b.
  69. Diffcloth: Diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23154–23163, 2023c.
  70. Generative adversarial network for text-to-face synthesis and manipulation with pretrained bert model. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 01–08

Show All 70