CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning (2403.04343v1)
Abstract: Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.
- 2023. Sharegpt. https://sharegpt.com/.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv Preprint ArXiv:2308.12966.
- Rich Caruana. 1998. Multitask learning. Springer.
- Sharegpt4v: Improving large multi-modal models with better captions. ArXiv Preprint ArXiv:2311.12793.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv Preprint ArXiv:2305.06500.
- Improvable gap balancing for multi-task learning. In Uncertainty in Artificial Intelligence, pages 496–506. PMLR.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929.
- Mixture of cluster-conditional lora experts for vision-language instruction tuning. ArXiv Preprint ArXiv:2312.12379.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 787–798.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. ArXiv Preprint ArXiv:2306.00890.
- Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research.
- Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ArXiv Preprint ArXiv:2111.10603.
- Improved baselines with visual instruction tuning. ArXiv Preprint ArXiv:2310.03744.
- Visual instruction tuning. ArXiv Preprint ArXiv:2304.08485.
- Towards impartial multi-task learning. In International Conference on Learning Representations.
- End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1871–1880.
- Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1930–1939.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv Preprint ArXiv:2203.10244.
- Ocr-vqa: Visual question answering by reading text in images. In International Conference on Document Analysis and Recognition, pages 947–952. IEEE.
- Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003.
- Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. ArXiv Preprint ArXiv:1706.05098.
- Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems, 31.
- Gemini: a family of highly capable multimodal models. ArXiv Preprint ArXiv:2312.11805.
- Llama: Open and efficient foundation language models. ArXiv Preprint ArXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv Preprint ArXiv:2307.09288.
- Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
- Vigc: Visual instruction generation and correction. ArXiv Preprint ArXiv:2308.12714.
- Self-instruct: Aligning language model with self generated instructions. ArXiv Preprint ArXiv:2212.10560.
- Finetuned language models are zero-shot learners. ArXiv Preprint ArXiv:2109.01652.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv Preprint ArXiv:2309.17421, 9(1).
- mplug-owl: Modularization empowers large language models with multimodality. ArXiv Preprint ArXiv:2304.14178.
- Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836.
- Instruction tuning for large language models: A survey. ArXiv Preprint ArXiv:2308.10792.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. ArXiv Preprint ArXiv:2306.17107.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
- Svit: Scaling up visual instruction tuning. ArXiv Preprint ArXiv:2307.04087.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv Preprint ArXiv:2306.05685.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv Preprint ArXiv:2304.10592.