Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis (2407.02329v2)

Published 2 Jul 2024 in cs.CV

Abstract: We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text & images and position control through boxes & masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity. Project page: https://github.com/limuloo/MIGC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Z. Ding, X. Zhang, Z. Xia, L. Jebe, Z. Tu, and X. Zhang, “Diffusionrig: Learning personalized priors for facial appearance editing,” in CVPR, 2023, pp. 12 736–12 746.
  2. R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in ICCV, 2023.
  3. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
  4. C. Liang, F. Ma, L. Zhu, Y. Deng, and Y. Yang, “Caphuman: Capture your moments in parallel universes,” in CVPR, 2024.
  5. R. Quan, W. Wang, Z. Tian, F. Ma, and Y. Yang, “Psychometry: An omnifit model for image reconstruction from human brain activity,” in CVPR, 2024.
  6. Z. Zhang, Z. Yang, and Y. Yang, “Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction,” in CVPR, 2024.
  7. Z. Yang, G. Chen, X. Li, W. Wang, and Y. Yang, “Doraemongpt: Toward understanding dynamic scenes with large language models,” ICML, 2024.
  8. Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: Harnessing diffusion models for interactive point-based image editing,” CVPR, 2024.
  9. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in CVPR, 2023, pp. 22 500–22 510.
  10. S. Huang, Z. Yang, L. Li, Y. Yang, and J. Jia, “Avatarfusion: Zero-shot generation of clothing-decoupled 3d avatars using 2d diffusion,” in ACM MM, 2023, pp. 5734–5745.
  11. Y. Xu, Z. Yang, and Y. Yang, “Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance,” arXiv preprint arXiv:2312.08889, 2023.
  12. X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” arXiv preprint arXiv:2307.09481, 2023.
  13. D. Epstein, A. Jabri, B. Poole, A. A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,” 2023.
  14. J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” 2023.
  15. W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” 2023.
  16. B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” 2023.
  17. A. Karnewar, A. Vedaldi, D. Novotny, and N. J. Mitra, “Holodiffusion: Training a 3d diffusion model using 2d images,” in CVPR, 2023, pp. 18 423–18 433.
  18. H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” 2023.
  19. W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in ICLR, 2023.
  20. Y. Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023, 2023.
  21. T. H. S. Meral, E. Simsar, F. Tombari, and P. Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9005–9014.
  22. D. Zhou, Y. Li, F. Ma, Z. Yang, and Y. Yang, “Migc: Multi-instance generation controller for text-to-image synthesis,” CVPR, 2024.
  23. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022.
  24. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  26. X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra, “Instancediffusion: Instance-level control for image generation,” 2024.
  27. Z. Yang, J. Wang, Z. Gan, L. Li, K. Lin, C. Wu, N. Duan, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Reco: Region-controlled text-to-image generation,” in CVPR, 2023.
  28. Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” CVPR, 2023.
  29. Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848, 2023.
  30. Y. Zhang, J. Liu, Y. Song, R. Wang, H. Tang, J. Yu, H. Li, X. Tang, Y. Hu, H. Pan et al., “Ssr-encoder: Encoding selective subject representation for subject-driven generation,” 2024.
  31. O. Bar-Tal, L. Yariv, Y. Lipman, and T. Dekel, “Multidiffusion: Fusing diffusion paths for controlled image generation,” arXiv preprint arXiv:2302.08113, 2023.
  32. J. Xie, Y. Li, Y. Huang, H. Liu, W. Zhang, Y. Zheng, and M. Z. Shou, “Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion,” ICCV, 2023.
  33. G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion: Controllable diffusion model for layout-to-image generation,” in CVPR, 2023, pp. 22 490–22 499.
  34. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” MICCAI, vol. abs/1505.04597, 2015.
  35. T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015.
  36. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in NIPS, 2022.
  37. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
  38. D. Zhou, Z. Yang, and Y. Yang, “Pyramid diffusion models for low-light image enhancement,” in IJCAI, 2023.
  39. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. of ICLR, 2020.
  40. T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” 2022.
  41. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” arXiv preprint arXiv:2206.00927, 2022.
  42. A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” in NeurIPS, 2019.
  43. W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” ArXiv, vol. abs/2104.10157, 2021.
  44. R. Lopez, P. Boyeau, N. Yosef, M. I. Jordan, and J. Regier, “Auto-encoding variational bayes,” 2014.
  45. P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” CVPR, pp. 12 868–12 878, 2020.
  46. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” 2016.
  47. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” 2017.
  48. H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” 2022.
  49. C. Zhao, W. Cai, C. Hu, and Z. Yuan, “Cycle contrastive adversarial learning with structural consistency for unsupervised high-quality image deraining transformer,” Neural Networks, 2024.
  50. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” ICML, 2022.
  51. W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” 2023.
  52. Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y. Liu, “ediff-i: Text-to-image diffusion models with ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
  53. A. R. et al, “Hierarchical text-conditional image generation with clip latents,” 2022.
  54. J. Ho, “Classifier-free diffusion guidance,” ArXiv, vol. abs/2207.12598, 2022.
  55. C. Zhao, W. Cai, C. Dong, and C. Hu, “Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration,” CVPR, 2024.
  56. C. Zhao, C. Dong, and W. Cai, “Learning a physical-aware diffusion model based on transformer for underwater image enhancement,” arXiv preprint arXiv:2403.01497, 2024.
  57. S. Lu, Y. Liu, and A. W.-K. Kong, “Tf-icon: Diffusion-based training-free cross-domain image composition,” in ICCV, 2023.
  58. S. Lu, Z. Wang, L. Li, Y. Liu, and A. W.-K. Kong, “Mace: Mass concept erasure in diffusion models,” CVPR, 2024.
  59. X. Shen, J. Ma, C. Zhou, and Z. Yang, “Controllable 3d face generation with conditional style code diffusion,” in AAAI, 2024.
  60. H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. P. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, and D. Krishnan, “Muse: Text-to-image generation via masked generative transformers,” in ICML, 2023.
  61. M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang, “Cogview: Mastering text-to-image generation via transformers,” arXiv preprint arXiv:2105.13290, 2021.
  62. J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu, “Scaling autoregressive models for content-rich text-to-image generation,” 2022.
  63. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” 2021.
  64. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
  65. L. Qu, S. Wu, H. Fei, L. Nie, and T. seng Chua, “Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation,” ACM MM, 2023.
  66. M. Chen, I. Laina, and A. Vedaldi, “Training-free layout control with cross-attention guidance,” WACV, 2024.
  67. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, pp. 770–778, 2016.
  68. H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” 2023.
  69. J. Mao, X. Wang, and K. Aizawa, “Guided image synthesis via initial image editing in diffusion model,” in Proceedings of the 31st ACM International Conference on Multimedia.   ACM, oct 2023.
  70. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  71. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” 2018.
  72. O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in CVPR, 2022, pp. 18 208–18 218.
  73. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023.
  74. X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra, “Instancediffusion: Instance-level control for image generation,” CVPR, 2024.
  75. P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python natural language processing toolkit for many human languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
  76. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  77. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
  78. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
  79. W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” NeurIPS, 2023.
  80. OpenAI, “Gpt-4 technical report,” 2023.
  81. N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dewei Zhou (6 papers)
  2. You Li (58 papers)
  3. Fan Ma (26 papers)
  4. Zongxin Yang (51 papers)
  5. Yi Yang (856 papers)
Citations (6)

Summary

An Examination of "MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis"

Introduction

"MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis" presents a compelling approach for addressing the Multi-Instance Generation (MIG) task in image synthesis, a domain that significantly enriches the quality and control of generative models. The paper introduces a suite of novel methods including MIGC, MIGC++, and the Consistent-MIG algorithm that advance the capabilities of existing image generation models by addressing critical challenges such as attribute leakage, restricted instance description formats, and limited iterative generation capabilities.

Core Contributions

The paper proposes the Multi-Instance Generation Controller (MIGC) and its enhanced version, MIGC++, to address the challenges in MIG. The fundamental principle behind MIGC is the divide-and-conquer strategy, which breaks down the complex task of multi-instance shading into manageable single-instance sub-tasks. Each of these tasks is handled independently to prevent attribute leakage and then aggregated to produce a coherent multi-instance output. MIGC++ extends this framework by incorporating more flexible instance descriptions and introducing a Refined Shader for detailed attribute control.

Methodological Overview

MIGC

MIGC comprises the Instance Shader, which includes three main components: Enhance Attention (EA), Layout Attention (LA), and the Shading Aggregation Controller (SAC).

  1. Enhance Attention (EA): Designed to address issues of instance merging and missing instances by enhancing attribute embedding with positional information.
  2. Layout Attention (LA): Functions like a self-attention mechanism but restricts interactions to within instance regions, ensuring accurate shading templates.
  3. Shading Aggregation Controller (SAC): Dynamically merges shading results from multiple instances and the overall shading template, adapting aggregation weights across different blocks and sample steps.

MIGC deploys these components strategically within the U-net architecture at mid-blocks and deep up-blocks to maximize their effectiveness during high-noise-level sampling steps.

MIGC++

MIGC++ builds on MIGC by allowing instance attributes to be specified through both text and image modalities and positions through both boxes and masks. The model employs Multimodal Enhance Attention (MEA) and a Refined Shader:

  1. Multimodal Enhance Attention (MEA): Extends EA by allowing shading for instances described by different modalities simultaneously.
  2. Refined Shader: Enhances detailed shading by leveraging pre-trained cross-attention layers and image projectors to refine instances based on their detailed attributes.

Consistent-MIG Algorithm

The Consistent-MIG algorithm augments the iterative generation capabilities of both MIGC and MIGC++ by ensuring:

  1. Consistency of Unmodified Areas: By replacing unmodified areas with results from previous iterations to maintain background consistency.
  2. Consistency of Identity: By utilizing the self-attention mechanism to preserve instance identity across iterations.

Experimental Results

The efficacy of the proposed methods was evaluated using the new COCO-MIG benchmark (with COCO-MIG-BOX and COCO-MIG-MASK variants) and the Multimodal-MIG benchmark. The tests across these benchmarks, as well as the existing COCO-Position and DrawBench benchmarks, showed that MIGC and MIGC++ significantly outperform state-of-the-art methods in terms of positional accuracy, attribute adherence, and overall image quality.

  1. COCO-MIG-BOX: MIGC and MIGC++ demonstrated significant improvements in both Instance Success Ratio and Mean Intersection over Union (MIoU). MIGC++ further refined the control of attributes and positional accuracy.
  2. COCO-MIG-MASK: MIGC++ surpassed previous methods, particularly when generating images with a higher number of instances.
  3. COCO-Position: MIGC exhibited superior positional accuracy, approaching the performance of real images, while MIGC++ achieved the highest overall score.
  4. DrawBench: MIGC++ achieved exceptional performance in attribute control accuracy, significantly better than current state-of-the-art methods.

Implications and Future Directions

The advancements introduced in this paper offer substantial practical and theoretical implications for the field of image synthesis. Practically, the enhanced control over instance attributes and positions can greatly benefit applications in gaming, digital art, and automated design. Theoretically, the modular design and the divide-and-conquer approach of MIGC and MIGC++ serve as a robust framework for further innovations in multi-instance image generation.

Future research could extend these methods to more complex and dynamic scenarios, such as video generation, where maintaining temporal consistency and handling occlusions are critical. Additionally, exploring the integration of these techniques with other generative models, such as GANs or transformers, could yield even more versatile and powerful image synthesis capabilities.

Conclusion

The paper "MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis" marks a significant step forward in the domain of image synthesis by addressing key challenges in the Multi-Instance Generation task. Through the introduction of MIGC, MIGC++, and the Consistent-MIG algorithm, the authors have provided a robust framework that sets new standards in controlling the attributes, positions, and iterative generation of instances in synthetic images. These contributions are poised to drive further research and applications in the field.