Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks (2312.01771v1)

Published 4 Dec 2023 in cs.CV

Abstract: In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Laion-400-million open dataset. URL https://laion.ai/laion-400-open-dataset/.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 1(3):4, 2022.
  4. Detreg: Unsupervised pretraining with region priors for object detection. arXiv preprint arXiv:2106.04550, 2021.
  5. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  6. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
  7. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp.  417–424, 2000.
  8. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  9. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  10. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  11. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition, 7(1):1–34, 1999.
  12. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  13. Maskgit: Masked generative image transformer. arXiv preprint arXiv:2202.04200, 2022.
  14. Muse: Text-to-image generation via masked generative transformers, 2023.
  15. Generative pretraining from pixels. In International Conference on Machine Learning, pp.  1691–1703. PMLR, 2020.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing, 13(9):1200–1212, 2004.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  19. Texture synthesis by non-parametric sampling. In IEEE International Conference on Computer Vision, pp.  1033–1038, Corfu, Greece, September 1999.
  20. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12873–12883, 2021.
  21. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
  22. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  23. Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 2007.
  24. Masked autoencoders are scalable vision learners. CoRR, abs/2111.06377, 2021. URL https://arxiv.org/abs/2111.06377.
  25. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pp.  709–727. Springer, 2022.
  26. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
  27. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8420–8429, 2019.
  28. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/abs/2104.08691.
  29. Uniformer: Unified transformer for efficient spatiotemporal representation learning. CoRR, abs/2201.04676, 2022. URL https://arxiv.org/abs/2201.04676.
  30. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  31. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018a.
  32. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pp.  85–100, 2018b.
  33. Part-aware prototype network for few-shot semantic segmentation. In European Conference on Computer Vision, pp.  142–158. Springer, 2020.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  36. David McNeill. Gesture and thought. University of Chicago press, 2019.
  37. Feature weighting and boosting for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  622–631, 2019.
  38. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2536–2544, 2016.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
  42. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  43. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204.06125.
  44. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
  45. High-resolution image synthesis with latent diffusion models, 2021.
  46. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  47. One-shot learning for semantic segmentation. CoRR, abs/1709.03410, 2017. URL http://arxiv.org/abs/1709.03410.
  48. Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence, 2020.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  50. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  51. Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957, 2020.
  52. Images speak in images: A generalist painter for in-context visual learning. arXiv preprint arXiv:2212.02499, 2022.
  53. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6830–6839, 2023a.
  54. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023b.
  55. In-context learning unlocked for diffusion models, 2023c.
  56. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  57. Holistically-nested edge detection. CoRR, abs/1504.06375, 2015. URL http://arxiv.org/abs/1504.06375.
  58. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  59. Prototype mixture models for few-shot semantic segmentation. In European Conference on Computer Vision, pp.  763–778. Springer, 2020.
  60. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6721–6729, 2017.
  61. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021a.
  62. Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  69–78, 2021b.
  63. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34, 2021.
  64. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  65. What makes good examples for visual in-context learning?, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiarui Xu (33 papers)
  2. Yossi Gandelsman (28 papers)
  3. Amir Bar (31 papers)
  4. Jianwei Yang (93 papers)
  5. Jianfeng Gao (344 papers)
  6. Trevor Darrell (324 papers)
  7. Xiaolong Wang (243 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com