Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model (2405.10316v1)

Published 16 May 2024 in cs.CV and cs.GR

Abstract: Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv preprint arXiv:2312.00785 (2023).
  2. Visual prompting via image inpainting. Advances in Neural Information Processing Systems 35 (2022), 25005–25017.
  3. CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing. ACM Transactions on Graphics 42, 5 (2023), 1–18.
  4. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2 (2023), 3.
  5. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
  7. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570.
  8. MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. arXiv preprint arXiv:2311.12052 (2023).
  9. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics 39, 4 (2020), 72–1.
  10. Deep Retinex Decomposition for Low-Light Enhancement. In British Machine Vision Conference. British Machine Vision Association.
  11. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5828–5839.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248–255.
  13. Synthesis of Complex Image Appearance from Limited Exemplars. ACM Transactions on Graphics (Mar 2015), 1–14. https://doi.org/10.1145/2699641
  14. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  16. CariMe: Unpaired caricature generation with multiple exaggerations. IEEE Transactions on Multimedia 24 (2021), 2673–2686.
  17. ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19485–19494.
  18. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. https://doi.org/10.1145/383259.383295
  19. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  20. Stylizing video by example. ACM Transactions on Graphics (Aug 2019), 1–11. https://doi.org/10.1145/3306346.3323006
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
  22. Visual atribute transfer through deep image analogy. ACM Transactions on Graphics 36, 4 (2017), 120.
  23. Context Diffusion: In-Context Aware Image Generation. arXiv preprint arXiv:2312.03584 (2023).
  24. Visual Instruction Inversion: Image Editing via Image Prompting. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=l9BsCh8ikK
  25. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
  27. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
  28. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
  29. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  30. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
  31. Diffusion Image Analogies. In ACM SIGGRAPH 2023 Conference Proceedings. 1–10.
  32. ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EmOIP3t9nk
  33. What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs. arXiv preprint arXiv:2401.02411 (2024).
  34. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6830–6839.
  35. SegGPT: Towards Segmenting Everything in Context. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1130–1140.
  36. In-Context Learning Unlocked for Diffusion Models. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=6BZS2EAkns
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  38. IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers. ACM Transactions on Graphics 42, 6 (2023), 1–14.
  39. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3121–3138.
  40. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848 (2023).
  41. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
  42. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
  43. DiffMat: Latent diffusion models for image-guided material generation. Visual Informatics (2024).
  44. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019).
  45. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  46. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In The Twelfth International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zheng Gu (7 papers)
  2. Shiyuan Yang (5 papers)
  3. Jing Liao (100 papers)
  4. Jing Huo (45 papers)
  5. Yang Gao (761 papers)
Citations (4)