CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models (2310.19784v2)
Abstract: Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.
- Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133, 2022.
- Toward realistic image compositing with adversarial learning. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
- Hierarchical dynamic image harmonization. arXiv:2211.08639, 2022.
- Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023b.
- Dovenet: Deep image harmonization via domain verification. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
- High-resolution image harmonization via collaborative dual transformations. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022.
- Intrinsic image harmonization. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Codenerf: Disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12949–12958, 2021.
- Segment anything. arXiv:2304.02643, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv:2305.14720, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023b.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521, 2023c.
- Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023a.
- Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Springer, 2020.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021.
- Transformation-grounded image generation network for novel 3d view synthesis. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 3500–3509, 2017.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
- RunwayML. Stable diffusion. https://github.com/runwayml/stable-diffusion, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 2022.
- Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems, 32, 2019.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Objectstitch: Object compositing with diffusion model. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
- Multi-view to novel view: Synthesizing novel views with self-learned confidence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 155–171, 2018.
- Multi-scale image harmonization. ACM Trans. Graph., 2010.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. In Eur. Conf. Comput. Vis., 2022.
- Paint by example: Exemplar-based image editing with diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587, 2021.
- Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023.
- Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.