Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning (2403.04343v1)

Published 7 Mar 2024 in cs.AI

Abstract: Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. 2023. Sharegpt. https://sharegpt.com/.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv Preprint ArXiv:2308.12966.
  3. Rich Caruana. 1998. Multitask learning. Springer.
  4. Sharegpt4v: Improving large multi-modal models with better captions. ArXiv Preprint ArXiv:2311.12793.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv Preprint ArXiv:2305.06500.
  6. Improvable gap balancing for multi-task learning. In Uncertainty in Artificial Intelligence, pages 496–506. PMLR.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929.
  8. Mixture of cluster-conditional lora experts for vision-language instruction tuning. ArXiv Preprint ArXiv:2312.12379.
  9. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
  10. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
  11. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 787–798.
  12. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491.
  13. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv Preprint ArXiv:2306.00890.
  14. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research.
  15. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ArXiv Preprint ArXiv:2111.10603.
  16. Improved baselines with visual instruction tuning. ArXiv Preprint ArXiv:2310.03744.
  17. Visual instruction tuning. ArXiv Preprint ArXiv:2304.08485.
  18. Towards impartial multi-task learning. In International Conference on Learning Representations.
  19. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1871–1880.
  20. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1930–1939.
  21. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20.
  22. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv Preprint ArXiv:2203.10244.
  23. Ocr-vqa: Visual question answering by reading text in images. In International Conference on Document Analysis and Recognition, pages 947–952. IEEE.
  24. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003.
  25. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  27. Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. ArXiv Preprint ArXiv:1706.05098.
  28. Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems, 31.
  29. Gemini: a family of highly capable multimodal models. ArXiv Preprint ArXiv:2312.11805.
  30. Llama: Open and efficient foundation language models. ArXiv Preprint ArXiv:2302.13971.
  31. Llama 2: Open foundation and fine-tuned chat models. ArXiv Preprint ArXiv:2307.09288.
  32. Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633.
  33. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
  34. Vigc: Visual instruction generation and correction. ArXiv Preprint ArXiv:2308.12714.
  35. Self-instruct: Aligning language model with self generated instructions. ArXiv Preprint ArXiv:2212.10560.
  36. Finetuned language models are zero-shot learners. ArXiv Preprint ArXiv:2109.01652.
  37. The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv Preprint ArXiv:2309.17421, 9(1).
  38. mplug-owl: Modularization empowers large language models with multimodality. ArXiv Preprint ArXiv:2304.14178.
  39. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
  40. Instruction tuning for large language models: A survey. ArXiv Preprint ArXiv:2308.10792.
  41. Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv Preprint ArXiv:2306.17107.
  42. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
  43. Svit: Scaling up visual instruction tuning. ArXiv Preprint ArXiv:2307.04087.
  44. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv Preprint ArXiv:2306.05685.
  45. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv Preprint ArXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yanqi Dai (5 papers)
  2. Dong Jing (7 papers)
  3. Nanyi Fei (14 papers)
  4. Zhiwu Lu (51 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.