Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation (2404.02790v1)

Published 3 Apr 2024 in cs.CV

Abstract: Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Image amodal completion: A survey. Computer Vision and Image Understanding, page 103661, 2023.
  2. Editval: Benchmarking diffusion based text-guided image editing methods. arXiv preprint arXiv:2310.02426, 2023.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  4. Exemplar-based inpainting: Technical review and new heuristics for better geometric reconstructions. IEEE transactions on image processing, 24(6):1809–1824, 2015.
  5. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  8. Object-driven multi-layer scene decomposition from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5369–5378, 2019.
  9. Segan: Segmenting and generating the invisible. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6144–6153, 2018.
  10. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  11. User-guided deep human image matting using arbitrary trimaps. IEEE Transactions on Image Processing, 31:2040–2052, 2022.
  12. Sainet: Stereo aware inpainting behind objects with generative networks. arXiv preprint arXiv:2205.07014, 2022.
  13. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  14. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PloS one, 14(10):e0223792, 2019.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  16. Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3105–3115, 2019.
  17. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023a.
  18. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023b.
  19. Annihilating filter-based low-rank hankel matrix approach for image inpainting. IEEE Transactions on Image Processing, 24(11):3498–3511, 2015.
  20. Segment anything in high quality. In NeurIPS, 2023.
  21. Modnet: Real-time trimap-free portrait matting via objective decomposition. In AAAI, 2022.
  22. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  23. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  24. Instance-wise occlusion and depth orders in natural scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21210–21221, 2022.
  25. Deep automatic natural image matting. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 800–806. International Joint Conferences on Artificial Intelligence Organization, 2021. Main Track.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  27. Layerdiffusion: Layered controlled image editing with diffusion models. In SIGGRAPH Asia 2023 Technical Communications, pages 1–4. 2023b.
  28. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  29. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  30. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  31. Layered neural rendering for retiming people in video. arXiv preprint arXiv:2009.07833, 2020.
  32. Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023.
  33. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  34. Simple open-vocabulary object detection with vision transformers, 2022.
  35. Fiftyone. GitHub. Note: https://github.com/voxel51/fiftyone, 2020.
  36. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  37. Long-tail recognition via compositional knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6939–6948, 2022.
  38. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
  41. Walt: Watch and learn 2d amodal representation from time-lapse imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9356–9366, 2022.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  43. Collage diffusion. arXiv preprint arXiv:2303.00262, 2023.
  44. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  45. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in Neural Information Processing Systems, 34:2136–2147, 2021.
  46. Human instance matting via mutual guidance and multi-instance refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  47. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  48. See-through vision with unsupervised scene occlusion reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3779–3790, 2021.
  49. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  50. Deep image matting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2970–2979, 2017.
  51. Visualizing the invisible: Occluded vehicle segmentation and recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7618–7627, 2019.
  52. Vitmatte: Boosting image matting with pretrained plain vision transformers. arXiv preprint arXiv:2305.15272, 2023a.
  53. Matte anything: Interactive natural image matting with segment anything models. arXiv preprint arXiv:2306.04121, 2023b.
  54. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. arXiv preprint arXiv:2209.09407, 2022.
  55. Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23497–23506, 2023c.
  56. Mask guided matting via progressive refinement network. arXiv preprint arXiv:2012.06722, 2020.
  57. Self-supervised scene de-occlusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3784–3792, 2020.
  58. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  59. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  60. Text2layer: Layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781, 2023.
  61. Visiting the invisible: Layer-by-layer completed scene decomposition. International Journal of Computer Vision, 129:3195–3215, 2021.
  62. Human de-occlusion: Invisible perception and recovery for humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3691–3701, 2021.
  63. Segment everything everywhere all at once. NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Petru-Daniel Tudosiu (18 papers)
  2. Yongxin Yang (73 papers)
  3. Shifeng Zhang (46 papers)
  4. Fei Chen (123 papers)
  5. Steven McDonagh (43 papers)
  6. Gerasimos Lampouras (22 papers)
  7. Ignacio Iacobacci (24 papers)
  8. Sarah Parisot (30 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com