ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion (2403.18818v1)
Abstract: Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene, e.g., occlusions, shadows, and reflections. By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a \q{counterfactual} dataset. Our method involves capturing a scene before and after removing a single object, while minimizing other changes. By fine-tuning a diffusion model on this dataset, we are able to not only remove objects but also their effects on the scene. However, we find that applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this challenge, we propose bootstrap supervision; leveraging our object removal model trained on a small counterfactual dataset, we synthetically expand this dataset considerably. Our approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Text2live: Text-driven layered image and video editing. In European conference on computer vision, pages 707–723. Springer, 2022.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
- Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020.
- Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10680–10687, 2020.
- Argan: Attentive recurrent generative adversarial network for shadow detection and removal. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10213–10222, 2019.
- Auto-exposure fusion for single-image shadow removal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10571–10580, 2021.
- Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
- An image is worth more than a thousand words: Towards disentanglement in the wild. Advances in Neural Information Processing Systems, 34:9216–9228, 2021.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Shadowdiffusion: When degradation prior meets diffusion model for shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14049–14058, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Shadow generation for composite image in real-world scenes. In Proceedings of the AAAI conference on artificial intelligence, pages 914–922, 2022.
- Mask-shadowgan: Learning to remove shadows from unpaired data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2472–2481, 2019.
- Image fine-grained inpainting. arXiv preprint arXiv:2002.02609, 2020.
- Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
- Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
- Dc-shadownet: Single-image hard and soft shadow removal using unsupervised domain-classifier guided network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5027–5036, 2021.
- A contrastive objective for learning disentangled representations. In European Conference on Computer Vision, pages 579–595. Springer, 2022.
- Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17089–17099, 2023.
- Shadow removal via shadow image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8578–8587, 2019.
- From shadow segmentation to shadow removal. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 264–281. Springer, 2020.
- David K. Lewis. Counterfactuals. Blackwell, Malden, Mass., 1973.
- Arshadowgan: Shadow generative adversarial network for augmented reality in single light scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8139–8148, 2020a.
- Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
- Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 725–741. Springer, 2020b.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- From shadow generation to shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2021.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124. PMLR, 2019.
- Omnimatte: Associating objects and their effects in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4507–4515, 2021.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Latent feature-guided diffusion models for shadow removal. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4313–4322, 2024.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Aim 2020 challenge on image extreme inpainting. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 716–741. Springer, 2020.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF international conference on computer vision, pages 181–190, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
- Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Objectstitch: Object compositing with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310–18319, 2023.
- Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Style-guided shadow removal. In European Conference on Computer Vision, pages 361–378. Springer, 2022.
- Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1788–1797, 2018.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18359–18369, 2023.
- Instance shadow detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1880–1889, 2020.
- Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814, 2022.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
- Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1486–1494, 2019.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
- Shadowgan: Shadow synthesis for virtual objects with conditional adversarial networks. Computational Visual Media, 5:105–115, 2019.
- Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023.
- Bijective mapping network for shadow removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5627–5636, 2022a.
- Efficient model-driven network for shadow removal. In Proceedings of the AAAI conference on artificial intelligence, pages 3635–3643, 2022b.