Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CosmicMan: A Text-to-Image Foundation Model for Humans (2404.01294v1)

Published 1 Apr 2024 in cs.CV

Abstract: We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.
  2. Improving image generation with better captions. 2023.
  3. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  4. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  5. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  6. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  7. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19982–19993, 2023.
  8. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. arXiv preprint arXiv:2310.18235, 2023.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023a.
  10. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023b.
  11. DeepFloyd. Deepfloyd-if, 2023.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. Taming transformers for high-resolution image synthesis, 2020.
  14. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  15. Flickr. Flickr application programming interface (api), 2023. Accessed: 2023-11-18.
  16. Stylegan-human: A data-centric odyssey of human generation. In European Conference on Computer Vision, pages 1–19. Springer, 2022.
  17. Geneval: An object-focused framework for evaluating text-to-image alignment. arXiv preprint arXiv:2310.11513, 2023.
  18. Orthoplanes: A novel representation for better 3d-awareness of gans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22996–23007, 2023.
  19. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  20. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  22. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
  23. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.
  24. Imagededup. https://github.com/idealo/imagededup, 2019.
  25. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 316–332. Springer, 2020.
  26. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
  27. Ifqa: Interpretable face quality assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3444–3453, 2023.
  28. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 618–629, 2023a.
  29. Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269, 2023b.
  30. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  31. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271, 2020.
  32. Divide and bind your attention for improved generative semantic nursing, 2023.
  33. Fashiontex: Controllable virtual try-on with text and texture. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
  34. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. arXiv preprint arXiv:2111.10603, 2021.
  35. Microsoft coco: Common objects in context. In ECCV 2014, 2014.
  36. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  37. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Midjourney. Midjourney, 2023.
  40. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM International Conference on Multimedia, 2023.
  41. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  42. Fice: Text-conditioned fashion image editing with guided gan inversion, 2023.
  43. Pixabay. Pixabay application programming interface (api), 2023. Accessed: 2023-11-18.
  44. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
  45. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  46. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  47. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  48. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, pages 8821–8831, 2021.
  49. Hierarchical text-conditional image generation with clip latents, 2022.
  50. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment, 2023.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  52. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023.
  53. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  54. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172, 2019.
  55. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  56. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  57. Textcaps: a dataset for image captioningwith reading comprehension. 2020.
  58. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  59. Unsplash. Unsplash application programming interface (api), 2023. Accessed: 2023-11-18.
  60. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
  61. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023a.
  62. Decompose and realign: Tackling condition misalignment in text-to-image diffusion models, 2023b.
  63. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
  64. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  65. 4k4d: Real-time 4d view synthesis at 4k resolution. 2023.
  66. 3dhumangan: 3d-aware human image generation with 3d pose mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23008–23019, 2023.
  67. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023a.
  68. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023b.
  69. Diffcloth: Diffusion based garment synthesis and manipulation via structural cross-modal semantic alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23154–23163, 2023c.
  70. Generative adversarial network for text-to-face synthesis and manipulation with pretrained bert model. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pages 01–08, 2021.
Citations (7)

Summary

  • The paper introduces CosmicMan, a specialized T2I model that generates high-fidelity human images with enhanced text-image alignment.
  • It leverages the innovative Annotate Anyone paradigm to produce the CosmicMan-HQ dataset with 6M images and 115M attribute annotations.
  • The Daring training framework, featuring data discretization and HOLA loss, significantly improves output quality and alignment for human-centric tasks.

CosmicMan: Pioneering the Specialization of Text-to-Image Models in Human Image Generation

Introduction to CosmicMan

The advent of text-to-image (T2I) foundation models like DALLE, Imagen, and Stable Diffusion (SD) has significantly advanced the capabilities in image generation tasks. These models, benefiting from extensive image-text datasets and sophisticated generative algorithms, have showcased impressive ability in generating images with remarkable fidelity and detail. However, their application in human-centric content generation exhibits a critical limitation: the lack of a specialized foundation model focusing exclusively on human subjects.

To address this, we introduce CosmicMan, a T2I foundation model dedicated to generating high-fidelity human images. CosmicMan outperforms general-purpose models by ensuring meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions for human images.

CosmicMan-HQ Dataset Construction

The effectiveness of CosmicMan stems from the CosmicMan-HQ dataset, constructed via a novel data production paradigm named Annotate Anyone, emphasizing human-AI collaboration. This paradigm ensures the ongoing creation of high-quality human-centric data, aligning with the complex requirements of human image generation.

Annotate Anyone Paradigm

Annotate Anyone introduces a systematic, scalable approach to data collection and annotation that leverages both human expertise and AI capabilities. This paradigm involves two primary stages:

  1. Flowing Data Sourcing: By continuously monitoring a broad spectrum of internet sources alongside recycling academic datasets such as LAION-5B, SHHQ, and DeepFashion, Annotate Anyone ensures a diverse and expansive data pool.
  2. Human-in-the-loop Data Annotation: This iterative process involves human annotators refining AI-generated labels, focusing on attributes that fail to meet a predefined accuracy threshold, significantly reducing manual annotation costs while improving label quality.

The outcome is the CosmicMan-HQ dataset, which comprises 6 million high-resolution images annotated with $115$ million attributes, providing a robust foundation for the CosmicMan model.

Decomposed-Attention-Refocusing (Daring) Training Framework

CosmicMan leverages the Daring training framework, which is designed to be both effective and straightforward to integrate into downstream tasks. Key innovations of Daring include:

  • Data Discretization: By decomposing dense text descriptions into fixed groups aligned with human body structure, CosmicMan can more effectively learn the intricate relationships between textual concepts and their corresponding visual representations.
  • HOLA Loss: The Human Body and Outfit Guided Loss for Alignment (HOLA) focuses on improving text-image alignment at the group level, enhancing the model's ability to generate images conforming to detailed descriptions.

Evaluation and Applications

In comparing CosmicMan to state-of-the-art foundation models, it demonstrates superior capabilities in generating human images with improved fidelity and alignment. Extensive ablation studies validate the contributions of the Annotate Anyone paradigm and the Daring training framework to the model's performance.

Furthermore, application tests in 2D human image editing and 3D human reconstruction highlight the practical advantages of CosmicMan as a specialized foundation model for human-centric tasks.

Conclusion and Future Directions

CosmicMan represents a significant step forward in the specialization of text-to-image foundation models for human-centered applications. By addressing the unique challenges of human image generation, CosmicMan sets a new benchmark for future research in this domain.

As part of our long-term commitment, we plan to continually update both the CosmicMan-HQ dataset and the CosmicMan model, ensuring they remain at the forefront of advancements in human image generation technology.