CaMML: Context-Aware Multimodal Learner for Large Models (2401.03149v3)
Abstract: In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.
- Sharegpt. https://sharegpt.com, 2023.
- CM3: A causal masked multimodal model of the internet. CoRR, abs/2201.07520, 2022.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- VQA: Visual Question Answering. In ICCV, 2015.
- Alessandro Antonietti. Analogy-Based Learning, pages 235–237. Springer US, Boston, MA, 2012.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5558–5570, 2022.
- Microsoft COCO captions: Data collection and evaluation server. CoRR, 2015.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Vizwiz grand challenge: Answering visual questions from blind people. CVPR, 2018.
- Language is not all you need: Aligning perception with language models, 2023.
- GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Perceiver: General perception with iterative attention. In ICML, 2021.
- Perceiver IO: A general architecture for structured inputs & outputs. In ICLR, 2022.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019a.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019b.
- Generalization through memorization: Nearest neighbor language models. The International Conference on Learning Representations, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023a.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, 2023b.
- Evaluating object hallucination in large vision-language models. In EMNLP, 2023c.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- Learning customized visual models with retrieval-augmented knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15148–15158, 2023c.
- Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023d.
- UNIFIED-IO: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023a.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022b.
- Chameleon: Plug-and-play compositional reasoning with large language models, 2023b.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
- Connecting vision and language with localized narratives. In ECCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 2023.
- Naganand Yadati Sanket Shah, Anand Mishra and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In AAAI, 2019.
- A-okvqa: A benchmark for visual question answering using world knowledge. arXiv, 2022.
- Unified model for image, video, audio and language tasks, 2023.
- Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, 2021.
- Textcaps: a dataset for image captioningwith reading comprehension. 2020.
- Towards vqa models that can read. In CVPR, 2019.
- TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. 2021.
- Llama: Open and efficient foundation language models, 2023.
- Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML. PMLR, 2022.
- Re-vilm: Retrieval-augmented visual language model for zero and few-shot image captioning. arXiv preprint arXiv:2302.04858, 2023.
- Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
- Modeling context in referring expressions. In ECCV, 2016.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
- Opt: Open pre-trained transformer language models, 2022.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, 2022.
- Visual7W: Grounded Question Answering in Images. In CVPR, 2016.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.