Transparent Image Layer Diffusion using Latent Transparency (2402.17113v4)
Abstract: We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
- Interactive high-quality green-screen keying via color unmixing. ACM Trans. Graph., 35(5):152:1–152:12, 2016.
- Designing effective inter-pixel information flow for natural image matting. In Proc. CVPR, 2017a.
- Unmixing-based soft color segmentation for image manipulation. ACM Trans. Graph., 36(2):19:1–19:19, 2017b.
- Semantic soft segmentation. ACM Trans. Graph. (Proc. SIGGRAPH), 37(4):72:1–72:13, 2018.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Instructpix2pix: Learning to follow image editing instructions, 2022.
- cagliostrolab. animagine-xl-3.0. huggingface, 2024.
- Pp-matting: High-accuracy natural image matting, 04 2022.
- diffusers. stable-diffusion-xl-1.0-inpainting-0.1. diffusers, 2024.
- Image vectorization and editing via linear gradient layer decomposition. ACM Transactions on Graphics (TOG), 42(4), Aug. 2023.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Explaining and harnessing adversarial examples. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Openclip, 2021.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Segment anything. arXiv:2304.02643, 2023.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. 2023.
- Z. Kong and W. Ping. On fast sampling of diffusion probabilistic models. CoRR, 2106, 2021.
- Y. Koyama and M. Goto. Decomposing images into layers with advanced color blending. Computer Graphics Forum, 37(7):397–407, Oct. 2018. ISSN 1467-8659. doi: 10.1111/cgf.13577.
- Matting anything. arXiv: 2306.05399, 2023a.
- Photomaker: Customizing realistic human photos via stacked id embedding. 2023b.
- Visual instruction tuning. In NeurIPS, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. July 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Oct. 2019. doi: 10.48550/ARXIV.1910.10683.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Noise estimation for generative diffusion models. CoRR, 2104, 2021.
- C. Schuhmann and P. Bevan. Laion pop: 600,000 high-resolution images with detailed descriptions. https://huggingface.co/datasets/laion/laion-pop, 2023.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, 1503, 2015.
- Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
- Score-based generative modeling through stochastic differential equations. CoRR, 2011:13456, 2020.
- Stability. Stable diffusion v1.5 model card, https://huggingface.co/runwayml/stable-diffusion-v1-5, 2022a.
- Stability. Stable diffusion v2 model card, stable-diffusion-2-depth, https://huggingface.co/stabilityai/stable-diffusion-2-depth, 2022b.
- Decomposing time-lapse paintings into layers. ACM Transactions on Graphics (TOG), 34(4):61:1–61:10, July 2015. doi: 10.1145/2766960. URL http://doi.acm.org/10.1145/2766960.
- Decomposing images into layers via RGB-space geometry. ACM Transactions on Graphics (TOG), 36(1):7:1–7:14, Nov. 2016. ISSN 0730-0301. doi: 10.1145/2988229. URL http://doi.acm.org/10.1145/2988229.
- Efficient palette-based decomposition and recoloring of images via rgbxy-space geometry. ACM Transactions on Graphics (TOG), 37(6):262:1–262:10, Dec. 2018. ISSN 0730-0301. doi: 10.1145/3272127.3275054.
- Pigmento: Pigment-based image analysis and editing. Transactions on Visualization and Computer Graphics (TVCG), 25(9), 2019. doi: 10.1109/TVCG.2018.2858238.
- Learning-based sampling for natural image matting. In Proc. CVPR, 2019.
- Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
- Invertible grayscale. ACM Transactions on Graphics (SIGGRAPH Asia 2018 issue), 37(6):246:1–246:10, Nov. 2018.
- Invertible Image Rescaling, pages 126–144. Springer International Publishing, 2020. ISBN 9783030584528.
- Deep image matting. Mar. 2017. doi: 10.48550/ARXIV.1703.03872.
- Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023.
- Vitmatte: Boosting image matting with pre-trained plain vision transformers. Information Fusion, 103:102091, 2024.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
- L. Zhang and M. Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.