Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Robust Instruction Tuning on Multimodal Large Language Models (2402.14492v2)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: Fine-tuning LLMs on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal LLMs (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Medic: a multi-task learning dataset for disaster image classification. Neural Computing and Applications, 35(3):2609–2632.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609.
  6. Introducing our multimodal models.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  12. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
  13. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617.
  14. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
  16. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  17. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624.
  18. Segment anything. arXiv preprint arXiv:2304.02643.
  19. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  20. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
  21. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  24. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  25. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  26. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  28. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685.
  29. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  30. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  31. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  33. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  34. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  35. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
  36. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
  37. Mirac Suzgun and Adam Tauman Kalai. 2024. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954.
  38. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, 3(6):7.
  39. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  42. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR.
  43. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
  44. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  45. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  46. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  47. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  48. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333.
  49. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  50. MultiInstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, Toronto, Canada. Association for Computational Linguistics.
  51. Dataset pruning: Reducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329.
  52. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
  53. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  54. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  55. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  56. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wei Han (202 papers)
  2. Hui Chen (298 papers)
  3. Soujanya Poria (138 papers)