Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model (2405.10316v1)
Abstract: Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
- Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv preprint arXiv:2312.00785 (2023).
- Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35 (2022), 25005–25017.
- CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing. ACM Transactions on Graphics 42, 5 (2023), 1–18.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2 (2023), 3.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402.
- Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570.
- MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. arXiv preprint arXiv:2311.12052 (2023).
- DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics 39, 4 (2020), 72–1.
- Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference. British Machine Vision Association.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5828–5839.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248–255.
- Synthesis of Complex Image Appearance from Limited Exemplars. ACM Transactions on Graphics (Mar 2015), 1–14. https://doi.org/10.1145/2699641
- A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- CariMe: Unpaired caricature generation with multiple exaggerations. IEEE Transactions on Multimedia 24 (2021), 2673–2686.
- ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19485–19494.
- Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. https://doi.org/10.1145/383259.383295
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
- Stylizing video by example. ACM Transactions on Graphics (Aug 2019), 1–11. https://doi.org/10.1145/3306346.3323006
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
- Visual atribute transfer through deep image analogy. ACM Transactions on Graphics 36, 4 (2017), 120.
- Context Diffusion: In-Context Aware Image Generation. arXiv preprint arXiv:2312.03584 (2023).
- Visual Instruction Inversion: Image Editing via Image Prompting. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=l9BsCh8ikK
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
- Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- Diffusion Image Analogies. In ACM SIGGRAPH 2023 Conference Proceedings. 1–10.
- ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EmOIP3t9nk
- What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs. arXiv preprint arXiv:2401.02411 (2024).
- Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6830–6839.
- SegGPT: Towards Segmenting Everything in Context. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1130–1140.
- In-Context Learning Unlocked for Diffusion Models. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=6BZS2EAkns
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers. ACM Transactions on Graphics 42, 6 (2023), 1–14.
- Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3121–3138.
- Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848 (2023).
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
- DiffMat: Latent diffusion models for image-guided material generation. Visual Informatics (2024).
- Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019).
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In The Twelfth International Conference on Learning Representations.
- Zheng Gu (7 papers)
- Shiyuan Yang (5 papers)
- Jing Liao (100 papers)
- Jing Huo (45 papers)
- Yang Gao (761 papers)