LooseControl: Lifting ControlNet for Generalized Depth Conditioning (2312.03079v1)
Abstract: We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth-conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables many new content-creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying layout locations of the target objects rather than the exact shape and appearance of the objects. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the style of the image. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Extensive tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. Code and more information are available at https://shariqfarooq123.github.io/loose-control/ .
- https://huggingface.co/.
- https://www.babylonjs.com/.
- https://huggingface.co/docs/diffusers/index.
- https://streamlit.io/.
- Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569, 2019.
- ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
- Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023.
- Diffusion-SDF: Conditional generative modeling of signed distance functions, 2023.
- Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Discovering interpretable directions in the semantic latent space of diffusion models. arXiv preprint arXiv:2303.11073, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers, New York, NY, USA, 2022. Association for Computing Machinery.
- Holofusion: Towards photo-realistic 3d generative modeling. ICCV, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023b.
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023.
- Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
- Key-locked rank one editing for text-to-image personalization. ACM SIGGRAPH 2023 Conference Proceedings, 2023.
- Applying a color palette with local control using diffusion models, 2023.
- Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18359–18369, 2023.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- Convolutional pose machines. In CVPR, 2016.
- Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023.
- Diffusion-based scene graph to image generation with masked contrastive pre-training. arXiv preprint arXiv:2211.11138, 2022.
- 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Trans. Graph., 42(4), 2023a.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
- Locally attentional SDF diffusion for controllable 3d shape generation. ACM Trans. Graph., 42(4):91:1–91:13, 2023.
- Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847, 2018.