Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection (2402.12501v1)

Published 19 Feb 2024 in cs.CL

Abstract: Data selection in instruction tuning emerges as a pivotal process for acquiring high-quality data and training instruction-following LLMs, but it is still a new and unexplored research area for vision-LLMs (VLMs). Existing data selection approaches on LLMs either rely on single unreliable scores, or use downstream tasks for selection, which is time-consuming and can lead to potential over-fitting on the chosen evaluation datasets. To address this challenge, we introduce a novel dataset selection method, Self-Filter, that utilizes the VLM itself as a filter. This approach is inspired by the observation that VLMs benefit from training with the most challenging instructions. Self-Filter operates in two stages. In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM. In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity. Comprehensive experiments on LLaVA and MiniGPT-4 show that Self-Filter can reach better results compared to full data settings with merely about 15% samples, and can achieve superior performance against competitive baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966.
  3. Instruction mining: High-quality instruction data selection for large language models. CoRR, abs/2307.06290.
  4. Alpagasus: Training A better alpaca with fewer data. CoRR, abs/2307.08701.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500.
  8. Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1080–1089. IEEE Computer Society.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  10. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR, abs/2306.13394.
  11. Self-guided noise-free data generation for efficient zero-shot learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  12. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computational Linguistics.
  13. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR, abs/2307.16125.
  14. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
  15. From quantity to quality: Boosting LLM performance with self-guided data selection for instruction tuning. CoRR, abs/2308.12032.
  16. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association for Computational Linguistics.
  17. One shot learning as instruction data prospector for large language models. CoRR, abs/2312.10302.
  18. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer.
  19. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v(ision), llava-1.5, and other multi-modality models. CoRR, abs/2310.14566.
  20. Visual instruction tuning. CoRR, abs/2304.08485.
  21. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning. CoRR, abs/2312.15685.
  22. Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281.
  23. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 22631–22648. PMLR.
  24. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. CoRR, abs/2310.02255.
  25. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  26. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. CoRR, abs/2310.07931.
  27. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3195–3204. Computer Vision Foundation / IEEE.
  28. Trivial or impossible — dichotomous data difficulty masks model differences (on imagenet and beyond). In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  29. OpenAI. 2023. Gpt-4 technical report.
  30. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  31. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20596–20607.
  32. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  33. Towards VQA models that can read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8317–8326. Computer Vision Foundation / IEEE.
  34. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  35. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  36. An empirical study of example forgetting during deep neural network learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  37. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  38. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. CoRR, abs/2308.12067.
  39. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
  40. Imagereward: Learning and evaluating human preferences for text-to-image generation. CoRR, abs/2304.05977.
  41. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. CoRR, abs/2306.09265.
  42. Dataset pruning: Reducing training data by examining generalization influence. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  43. From recognition to cognition: Visual commonsense reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6720–6731. Computer Vision Foundation / IEEE.
  44. Zizhao Zhang and Tomas Pfister. 2021. Learning fast sample re-weighting without reward data. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 705–714. IEEE.
  45. A preliminary study of the intrinsic relationship between complexity and alignment. CoRR, abs/2308.05696.
  46. LIMA: less is more for alignment. CoRR, abs/2305.11206.
  47. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets