MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis (2407.02329v2)
Abstract: We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text & images and position control through boxes & masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity. Project page: https://github.com/limuloo/MIGC.
- Z. Ding, X. Zhang, Z. Xia, L. Jebe, Z. Tu, and X. Zhang, “Diffusionrig: Learning personalized priors for facial appearance editing,” in CVPR, 2023, pp. 12 736–12 746.
- R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in ICCV, 2023.
- L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
- C. Liang, F. Ma, L. Zhu, Y. Deng, and Y. Yang, “Caphuman: Capture your moments in parallel universes,” in CVPR, 2024.
- R. Quan, W. Wang, Z. Tian, F. Ma, and Y. Yang, “Psychometry: An omnifit model for image reconstruction from human brain activity,” in CVPR, 2024.
- Z. Zhang, Z. Yang, and Y. Yang, “Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction,” in CVPR, 2024.
- Z. Yang, G. Chen, X. Li, W. Wang, and Y. Yang, “Doraemongpt: Toward understanding dynamic scenes with large language models,” ICML, 2024.
- Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: Harnessing diffusion models for interactive point-based image editing,” CVPR, 2024.
- N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in CVPR, 2023, pp. 22 500–22 510.
- S. Huang, Z. Yang, L. Li, Y. Yang, and J. Jia, “Avatarfusion: Zero-shot generation of clothing-decoupled 3d avatars using 2d diffusion,” in ACM MM, 2023, pp. 5734–5745.
- Y. Xu, Z. Yang, and Y. Yang, “Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance,” arXiv preprint arXiv:2312.08889, 2023.
- X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” arXiv preprint arXiv:2307.09481, 2023.
- D. Epstein, A. Jabri, B. Poole, A. A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,” 2023.
- J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” 2023.
- W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” 2023.
- B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” 2023.
- A. Karnewar, A. Vedaldi, D. Novotny, and N. J. Mitra, “Holodiffusion: Training a 3d diffusion model using 2d images,” in CVPR, 2023, pp. 18 423–18 433.
- H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” 2023.
- W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in ICLR, 2023.
- Y. Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023, 2023.
- T. H. S. Meral, E. Simsar, F. Tombari, and P. Yanardag, “Conform: Contrast is all you need for high-fidelity text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9005–9014.
- D. Zhou, Y. Li, F. Ma, Z. Yang, and Y. Yang, “Migc: Multi-instance generation controller for text-to-image synthesis,” CVPR, 2024.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra, “Instancediffusion: Instance-level control for image generation,” 2024.
- Z. Yang, J. Wang, Z. Gan, L. Li, K. Lin, C. Wu, N. Duan, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Reco: Region-controlled text-to-image generation,” in CVPR, 2023.
- Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” CVPR, 2023.
- Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848, 2023.
- Y. Zhang, J. Liu, Y. Song, R. Wang, H. Tang, J. Yu, H. Li, X. Tang, Y. Hu, H. Pan et al., “Ssr-encoder: Encoding selective subject representation for subject-driven generation,” 2024.
- O. Bar-Tal, L. Yariv, Y. Lipman, and T. Dekel, “Multidiffusion: Fusing diffusion paths for controlled image generation,” arXiv preprint arXiv:2302.08113, 2023.
- J. Xie, Y. Li, Y. Huang, H. Liu, W. Zhang, Y. Zheng, and M. Z. Shou, “Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion,” ICCV, 2023.
- G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion: Controllable diffusion model for layout-to-image generation,” in CVPR, 2023, pp. 22 490–22 499.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” MICCAI, vol. abs/1505.04597, 2015.
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in NIPS, 2022.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
- D. Zhou, Z. Yang, and Y. Yang, “Pyramid diffusion models for low-light image enhancement,” in IJCAI, 2023.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. of ICLR, 2020.
- T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” 2022.
- C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” arXiv preprint arXiv:2206.00927, 2022.
- A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” in NeurIPS, 2019.
- W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” ArXiv, vol. abs/2104.10157, 2021.
- R. Lopez, P. Boyeau, N. Yosef, M. I. Jordan, and J. Regier, “Auto-encoding variational bayes,” 2014.
- P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” CVPR, pp. 12 868–12 878, 2020.
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” 2016.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” 2017.
- H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” 2022.
- C. Zhao, W. Cai, C. Hu, and Z. Yuan, “Cycle contrastive adversarial learning with structural consistency for unsupervised high-quality image deraining transformer,” Neural Networks, 2024.
- A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” ICML, 2022.
- W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” 2023.
- Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y. Liu, “ediff-i: Text-to-image diffusion models with ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
- A. R. et al, “Hierarchical text-conditional image generation with clip latents,” 2022.
- J. Ho, “Classifier-free diffusion guidance,” ArXiv, vol. abs/2207.12598, 2022.
- C. Zhao, W. Cai, C. Dong, and C. Hu, “Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration,” CVPR, 2024.
- C. Zhao, C. Dong, and W. Cai, “Learning a physical-aware diffusion model based on transformer for underwater image enhancement,” arXiv preprint arXiv:2403.01497, 2024.
- S. Lu, Y. Liu, and A. W.-K. Kong, “Tf-icon: Diffusion-based training-free cross-domain image composition,” in ICCV, 2023.
- S. Lu, Z. Wang, L. Li, Y. Liu, and A. W.-K. Kong, “Mace: Mass concept erasure in diffusion models,” CVPR, 2024.
- X. Shen, J. Ma, C. Zhou, and Z. Yang, “Controllable 3d face generation with conditional style code diffusion,” in AAAI, 2024.
- H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. P. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, and D. Krishnan, “Muse: Text-to-image generation via masked generative transformers,” in ICML, 2023.
- M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang, “Cogview: Mastering text-to-image generation via transformers,” arXiv preprint arXiv:2105.13290, 2021.
- J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu, “Scaling autoregressive models for content-rich text-to-image generation,” 2022.
- P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” 2021.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
- L. Qu, S. Wu, H. Fei, L. Nie, and T. seng Chua, “Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation,” ACM MM, 2023.
- M. Chen, I. Laina, and A. Vedaldi, “Training-free layout control with cross-attention guidance,” WACV, 2024.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, pp. 770–778, 2016.
- H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” 2023.
- J. Mao, X. Wang, and K. Aizawa, “Guided image synthesis via initial image editing in diffusion model,” in Proceedings of the 31st ACM International Conference on Multimedia. ACM, oct 2023.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” 2018.
- O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in CVPR, 2022, pp. 18 208–18 218.
- J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023.
- X. Wang, T. Darrell, S. S. Rambhatla, R. Girdhar, and I. Misra, “Instancediffusion: Instance-level control for image generation,” CVPR, 2024.
- P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python natural language processing toolkit for many human languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
- S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
- W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” NeurIPS, 2023.
- OpenAI, “Gpt-4 technical report,” 2023.
- N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” 2023.
- Dewei Zhou (6 papers)
- You Li (58 papers)
- Fan Ma (26 papers)
- Zongxin Yang (51 papers)
- Yi Yang (856 papers)