Magic Insert: Style-Aware Drag-and-Drop (2407.02489v1)
Abstract: We present Magic Insert, a method for dragging-and-dropping subjects from a user-provided image into a target image of a different style in a physically plausible manner while matching the style of the target image. This work formalizes the problem of style-aware drag-and-drop and presents a method for tackling it by addressing two sub-problems: style-aware personalization and realistic object insertion in stylized images. For style-aware personalization, our method first fine-tunes a pretrained text-to-image diffusion model using LoRA and learned text tokens on the subject image, and then infuses it with a CLIP representation of the target style. For object insertion, we use Bootstrapped Domain Adaption to adapt a domain-specific photorealistic object insertion model to the domain of diverse artistic styles. Overall, the method significantly outperforms traditional approaches such as inpainting. Finally, we present a dataset, SubjectPlop, to facilitate evaluation and future progress in this area. Project page: https://magicinsert.github.io/
- Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In SIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023.
- Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Subject-driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems, 36, 2024.
- Implicit style-content separation using b-lora. arXiv preprint arXiv:2403.14572, 2024.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Swapanything: Enabling arbitrary object swapping in personalized visual editing. arXiv preprint arXiv:2404.05717, 2024.
- Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133, 2023.
- Denoising diffusion probabilistic models. 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Image fine-grained inpainting. arXiv preprint arXiv:2002.02609, 2020.
- Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
- Context-aware synthesis and placement of object instances. ArXiv, abs/1812.02350, 2018.
- Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
- Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 725–741. Springer, 2020.
- Dreamcom: Finetuning text-guided inpainting model for image composition. arXiv preprint arXiv:2309.15508, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Aim 2020 challenge on image extreme inpainting. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 716–741. Springer, 2020.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2349–2359, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF international conference on computer vision, pages 181–190, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Rb-modulation: Training-free personalization of diffusion models using stochastic optimal control. arXiv preprint arXiv:2405.17401, 2024.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510. IEEE, 2023.
- Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023.
- Clic: Concept learning in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6924–6933, 2024.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Collage diffusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4208–4217, January 2024.
- Ziplora: Any subject in any style by effectively merging loras. 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. 2015.
- Styledrop: Text-to-image generation in any style. In 37th Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems Foundation, 2023.
- Measuring style similarity in diffusion models. arXiv preprint arXiv:2404.01292, 2024.
- Denoising diffusion implicit models. 2022.
- Objectstitch: Generative object compositing. arXiv preprint arXiv:2212.00932, 2022.
- Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
- Realfill: Reference-driven generation for authentic image completion. arXiv preprint arXiv:2309.16668, 2023.
- p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
- Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733, 2024.
- Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18359–18369, 2023.
- Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. arXiv preprint arXiv:2403.18818, 2024.
- Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. arXiv preprint arXiv:2207.09814, 2022.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
- Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1486–1494, 2019.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Shadowgan: Shadow synthesis for virtual objects with conditional adversarial networks. Computational Visual Media, 5:105–115, 2019.