MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples (2312.06363v3)
Abstract: Although In-Context Learning (ICL) brings remarkable performance gains to LLMs, the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 23716–23736.
- OpenFlamingo.
- VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. In CVPR, 18009–18019.
- Improving In-Context Few-Shot Learning via Self-Supervised Training. In NAACL-HLT, 3558–3573.
- VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. arXiv Preprint. https://arxiv.org/abs/2305.18500.
- Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. arXiv Preprint. https://arxiv.org/abs/2302.08958.
- Scaling Instruction-Finetuned Language Models. arXiv Preprint. https://arxiv.org/abs/2210.11416.
- Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models. arXiv Preprint. https://arxiv.org/abs/2203.06904.
- A Survey on In-context Learning. arXiv Preprint. https://arxiv.org/abs/2301.00234.
- EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arXiv Preprint. https://arxiv.org/abs/2211.07636.
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv Preprint. https://arxiv.org/abs/2304.15010.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 6325–6334.
- Pre-Training to Learn in Context. In ACL, volume 1, 4849–4870.
- Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. arXiv Preprint. https://arxiv.org/abs/2204.07841.
- VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending. arXiv Preprint. https://arxiv.org/abs/2305.13167.
- Parameter-Efficient Transfer Learning for NLP. In ICML, volume 97, 2790–2799.
- LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
- Language Is Not All You Need: Aligning Perception with Language Models. arXiv Preprint. https://arxiv.org/abs/2302.14045.
- The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP, volume 1, 3045–3059.
- MIMIC-IT: Multi-Modal In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2306.05425.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2305.03726.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv Preprint. https://arxiv.org/abs/2301.12597.
- Finding Supporting Examples for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2302.13539.
- Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL/IJCNLP, volume 1, 4582–4597.
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. arXiv Preprint. https://arxiv.org/abs/2303.15647.
- Microsoft COCO: Common Objects in Context. In ECCV, volume 8693, 740–755.
- What Makes Good In-Context Examples for GPT-3? In DeeLIO@ACL, 100–114.
- GPT Understands, Too. arXiv Preprint. https://arxiv.org/abs/2103.10385.
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In ACL, volume 1, 8086–8098.
- Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. In NeurIPS, 1022–1035.
- MetaICL: Learning to Learn In Context. In NAACL-HLT, 2791–2809.
- OpenAI. 2023. GPT-4 Technical Report. arXiv Preprint. https://arxiv.org/abs/2303.08774.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL, 311–318.
- Multimodal Few-Shot Learning with Frozen Language Models. In NeurIPS, 200–212.
- CIDEr: Consensus-based image description evaluation. In CVPR, 4566–4575.
- Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2301.11916.
- Self-Instruct: Aligning Language Models with Self-Generated Instructions. In ACL, volume 1, 13484–13508.
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In EMNLP, 5085–5109.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 24824–24837.
- Symbol tuning improves in-context learning in language models. arXiv Preprint. https://arxiv.org/abs/2305.08298.
- Small Models are Valuable Plug-ins for Large Language Models. arXiv Preprint. https://arxiv.org/abs/2305.08848.
- Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM Multimedia, 1645–1653.
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv Preprint. https://arxiv.org/abs/2302.00402.
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 5288–5296.
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, volume 1, 11445–11465.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv Preprint. https://arxiv.org/abs/2304.14178.
- BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL, volume 3, 1–9.
- Transfer Visual Prompt Generator across LLMs. arXiv Preprint. https://arxiv.org/abs/2305.01278.
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv Preprint. https://arxiv.org/abs/2306.02858.
- OPT: Open Pre-trained Transformer Language Models. arXiv Preprint. https://arxiv.org/abs/2205.01068.
- A Survey of Large Language Models. arXiv Preprint. https://arxiv.org/abs/2303.18223.
- Calibrate Before Use: Improving Few-shot Performance of Language Models. In ICML, volume 139, 12697–12706.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv Preprint. https://arxiv.org/abs/2304.10592.
- Tao Chen (397 papers)
- Enwei Zhang (9 papers)
- Yuting Gao (25 papers)
- Ke Li (722 papers)
- Xing Sun (93 papers)
- Yan Zhang (954 papers)
- Hui Li (1004 papers)
- Rongrong Ji (315 papers)