Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Magic Clothing: Controllable Garment-Driven Image Synthesis (2404.09512v2)

Published 15 Apr 2024 in cs.CV

Abstract: We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers. 1–12.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402.
  4. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570.
  5. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR.
  6. Magicdance: Realistic human dance video generation with motions & facial expressions transfer. arXiv preprint arXiv:2311.12052 (2023).
  7. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–10.
  8. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14131–14140.
  9. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023).
  10. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024).
  11. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12873–12883.
  12. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  13. Expressive text-to-image generation with rich text. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7545–7556.
  14. Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow. In Proceedings of the 31st ACM International Conference on Multimedia. 7599–7607.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  16. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  18. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117 (2023).
  19. ReVersion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495 (2023).
  20. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1931–1941.
  21. High-resolution virtual try-on with misalignment and occlusion-handled conditions. In European Conference on Computer Vision. Springer, 204–219.
  22. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems 36 (2024).
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
  24. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
  25. Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327 (2023).
  26. Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
  27. LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM International Conference on Multimedia. 8580–8589.
  28. Dress code: high-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2231–2235.
  29. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
  30. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  33. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  34. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
  35. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  36. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949 (2023).
  37. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
  38. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023).
  39. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In CVPR.
  40. Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems 36 (2023), 1363–1389.
  41. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921–1930.
  42. Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–13.
  43. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024).
  44. Convolutional pose machines. In CVPR.
  45. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23550–23559.
  46. Inversion-Free Image Editing with Natural Language. arXiv preprint arXiv:2312.04965 (2023).
  47. Versatile diffusion: Text, images and variations all in one diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7754–7765.
  48. OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on. arXiv preprint arXiv:2403.01779 (2024).
  49. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498 (2023).
  50. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023).
  51. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  52. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  53. Better Fit: Accommodate Variations in Clothing Types for Virtual Try-on. arXiv preprint arXiv:2403.08453 (2024).
  54. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6027–6037.
  55. Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems 36 (2024).
  56. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. Advances in Neural Information Processing Systems 36 (2024).
Citations (8)

Summary

  • The paper’s main contribution is a garment extractor module that integrates fine-grained apparel features into latent diffusion denoising via self-attention fusion.
  • It introduces joint classifier-free guidance to balance text prompt fidelity with precise garment detail preservation, resulting in state-of-the-art image synthesis.
  • The plug-in compatibility of Magic Clothing enables seamless integration with various finetuned LDMs, enhancing versatility in e-commerce and virtual environment applications.

Magic Clothing: An Advanced Approach for Garment-Driven Image Synthesis Using Latent Diffusion Models

Introduction

The paper presents Magic Clothing, a novel approach leveraging Latent Diffusion Models (LDMs) for garment-driven image synthesis—a challenging task with considerable potential in fields such as e-commerce and virtual environments. Unlike traditional subject-driven image synthesis that primarily focuses on general conditions like pose or appearance, garment-driven synthesis demands a fine-grained detail preservation of apparel within generated images, adhering to specific text prompts. Addressing this, the paper introduces a garment extractor that efficiently integrates detailed garment features into the denoising process of LDMs through self-attention fusion. Furthermore, it proposes the use of joint classifier-free guidance to enhance control over the synthesis process, balancing between text prompt fidelity and garment feature preservation.

Methodology

Garment Extractor and Fusion Mechanism:

  • The paper's core contribution lies in its garment extractor module designed to capture and incorporate detailed garment features into the synthesis process. Utilizing the architecture of UNet, the garment extractor ensures the seamless integration of these details into the LDM's denoising process via self-attention layers.
  • This method ensures the fine-grained features of the garment are preserved in the generated character images, allowing for high fidelity to both the input garment images and associated text prompts.

Joint Classifier-free Guidance (CFG):

  • To address the challenge of maintaining balance between garment detail fidelity and adherence to text prompts, the paper proposes a sophisticated joint CFG technique, enhancing the traditional classifier-free guidance approach.
  • By simultaneously considering text and garment features with a joint distribution during training, the model effectively harmonizes these elements, leading to improved synthesis quality.

Plug-in Module Compatibility and Extensibility

A significant advantage of Magic Clothing is its compatibility as a plug-in module. This design choice allows it to be combined with various finetuned LDMs and extensions such as ControlNet and IP-Adapter, enabling additional control over factors like style, pose, and facial features. This extensibility showcases the method's versatility in generating diverse character images under varied conditional controls without compromising garment detail quality.

Experiments and Results

Comprehensive evaluations demonstrate Magic Clothing's superior performance over existing methods in garment-driven image synthesis. Through qualitative and quantitative metrics, including the novel Matched-Points-LPIPS (MP-LPIPS) metric designed for this task, the paper highlights its method's ability to produce state-of-the-art results under various conditional controls.

MP-LPIPS Metric:

  • The introduction of the MP-LPIPS metric addresses the need for a robust evaluation standard for garment consistency in generated images, emphasizing its effectiveness in capturing garment fidelity without being unduly influenced by unrelated factors like pose or background.

Conclusion and Future Directions

Magic Clothing's approach to garment-driven image synthesis marks a significant advancement in the field, particularly in addressing the nuanced demand for detail preservation in synthesized images. Its plug-in compatibility further enhances its application potential, promising exciting developments in personalized virtual dressing and avatar creation. Looking forward, refining this model with more extensive training datasets and exploring its integration with more advanced LDMs could unlock even higher-quality synthesis capabilities, broadening its utility across various domains.

Youtube Logo Streamline Icon: https://streamlinehq.com