Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples (2312.06363v3)

Published 11 Dec 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Although In-Context Learning (ICL) brings remarkable performance gains to LLMs, the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 23716–23736.
  2. OpenFlamingo.
  3. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. In CVPR, 18009–18019.
  4. Improving In-Context Few-Shot Learning via Self-Supervised Training. In NAACL-HLT, 3558–3573.
  5. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. arXiv Preprint. https://arxiv.org/abs/2305.18500.
  6. Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. arXiv Preprint. https://arxiv.org/abs/2302.08958.
  7. Scaling Instruction-Finetuned Language Models. arXiv Preprint. https://arxiv.org/abs/2210.11416.
  8. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models. arXiv Preprint. https://arxiv.org/abs/2203.06904.
  9. A Survey on In-context Learning. arXiv Preprint. https://arxiv.org/abs/2301.00234.
  10. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arXiv Preprint. https://arxiv.org/abs/2211.07636.
  11. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv Preprint. https://arxiv.org/abs/2304.15010.
  12. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 6325–6334.
  13. Pre-Training to Learn in Context. In ACL, volume 1, 4849–4870.
  14. Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. arXiv Preprint. https://arxiv.org/abs/2204.07841.
  15. VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending. arXiv Preprint. https://arxiv.org/abs/2305.13167.
  16. Parameter-Efficient Transfer Learning for NLP. In ICML, volume 97, 2790–2799.
  17. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  18. Language Is Not All You Need: Aligning Perception with Language Models. arXiv Preprint. https://arxiv.org/abs/2302.14045.
  19. The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP, volume 1, 3045–3059.
  20. MIMIC-IT: Multi-Modal In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2306.05425.
  21. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv Preprint. https://arxiv.org/abs/2305.03726.
  22. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv Preprint. https://arxiv.org/abs/2301.12597.
  23. Finding Supporting Examples for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2302.13539.
  24. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL/IJCNLP, volume 1, 4582–4597.
  25. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. arXiv Preprint. https://arxiv.org/abs/2303.15647.
  26. Microsoft COCO: Common Objects in Context. In ECCV, volume 8693, 740–755.
  27. What Makes Good In-Context Examples for GPT-3? In DeeLIO@ACL, 100–114.
  28. GPT Understands, Too. arXiv Preprint. https://arxiv.org/abs/2103.10385.
  29. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In ACL, volume 1, 8086–8098.
  30. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. In NeurIPS, 1022–1035.
  31. MetaICL: Learning to Learn In Context. In NAACL-HLT, 2791–2809.
  32. OpenAI. 2023. GPT-4 Technical Report. arXiv Preprint. https://arxiv.org/abs/2303.08774.
  33. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL, 311–318.
  34. Multimodal Few-Shot Learning with Frozen Language Models. In NeurIPS, 200–212.
  35. CIDEr: Consensus-based image description evaluation. In CVPR, 4566–4575.
  36. Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. arXiv Preprint. https://arxiv.org/abs/2301.11916.
  37. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In ACL, volume 1, 13484–13508.
  38. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In EMNLP, 5085–5109.
  39. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 24824–24837.
  40. Symbol tuning improves in-context learning in language models. arXiv Preprint. https://arxiv.org/abs/2305.08298.
  41. Small Models are Valuable Plug-ins for Large Language Models. arXiv Preprint. https://arxiv.org/abs/2305.08848.
  42. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM Multimedia, 1645–1653.
  43. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv Preprint. https://arxiv.org/abs/2302.00402.
  44. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 5288–5296.
  45. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, volume 1, 11445–11465.
  46. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv Preprint. https://arxiv.org/abs/2304.14178.
  47. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL, volume 3, 1–9.
  48. Transfer Visual Prompt Generator across LLMs. arXiv Preprint. https://arxiv.org/abs/2305.01278.
  49. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv Preprint. https://arxiv.org/abs/2306.02858.
  50. OPT: Open Pre-trained Transformer Language Models. arXiv Preprint. https://arxiv.org/abs/2205.01068.
  51. A Survey of Large Language Models. arXiv Preprint. https://arxiv.org/abs/2303.18223.
  52. Calibrate Before Use: Improving Few-shot Performance of Language Models. In ICML, volume 139, 12697–12706.
  53. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv Preprint. https://arxiv.org/abs/2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tao Chen (397 papers)
  2. Enwei Zhang (9 papers)
  3. Yuting Gao (25 papers)
  4. Ke Li (722 papers)
  5. Xing Sun (93 papers)
  6. Yan Zhang (954 papers)
  7. Hui Li (1004 papers)
  8. Rongrong Ji (315 papers)
Citations (2)