Interfacing Foundation Models' Embeddings (2312.07532v2)
Abstract: Foundation models possess strong capabilities in reasoning and memorizing across modalities. To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity. As shown in the teaser figure, a lightweight transformer interface without tuning any foundation model weights is enough for segmentation, grounding, and retrieval in an interleaved manner. The proposed interface has the following favorable attributes: (1) Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights. (2) Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. (3) Extendable. The proposed interface is adaptive to new tasks, and new models. In light of the interleaved embedding space, we introduce FIND-Bench, which introduces new training and evaluation annotations to the COCO dataset for interleaved segmentation and retrieval. We are the first work aligning foundations models' embeddings for interleave understanding. Meanwhile, our approach achieves state-of-the-art performance on FIND-Bench and competitive performance on standard retrieval and segmentation settings.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Openleaf: Open-domain interleaved image-text generation and evaluation. arXiv preprint arXiv:2310.07749, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- UNITER: universal image-text representation learning. In ECCV, pages 104–120, 2020.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019.
- Hr-nas: Searching efficient high-resolution neural architectures with lightweight transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2982–2992, 2021.
- Davit: Dual attention vision transformers. In European Conference on Computer Vision, pages 74–92. Springer, 2022.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Leo Grady. Random walks for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 28(11):1768–1783, 2006.
- Levit: a vision transformer in convnet’s clothing for faster inference. In ICCV, pages 12259–12269, 2021.
- Rethinking spatial dimensions of vision transformers. In ICCV, pages 11936–11945, 2021.
- Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Grounding language models to images for multimodal inputs and outputs. 2023.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1:2, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pages 121–137, 2020.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
- Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016.
- OpenAI. Improving image generation with better captions. Technical report, OpenAI, 2023a.
- OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023b.
- Scaling vision with sparse mixture of experts. NeurIPS, 34, 2021.
- Tokenlearner: What can 8 learned tokens do for images and videos? arXiv: Computer Vision and Pattern Recognition, 2021.
- Bottleneck transformers for visual recognition. In CVPR, pages 16519–16529, 2021.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Rethinking and improving relative position encoding for vision transformer. In ICCV, pages 10033–10041, 2021.
- Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv preprint arXiv:2311.06242, 2023.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
- FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
- Scaling vision transformers. arXiv: Computer Vision and Pattern Recognition, 2021.
- Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529, 2021.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
- Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.