Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models (2405.15574v4)

Published 24 May 2024 in cs.CV

Abstract: The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal LLM (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.

Overview of Mamba-based Traversal of Rationales in LLVMs

The paper under review introduces a novel large language and vision model (LLVM), referred to as Mamba-based traversal of rationales (Meteor), which seeks to enhance the vision-language performance of LLVMs through the integration of multifaceted rationales. This method does not rely on the scaling up of model sizes nor the inclusion of additional vision encoders or computer vision models during the inference phase. The approach is particularly noteworthy for its introduction of the traversal of rationale mechanism, which utilizes the Mamba architecture for efficient handling of lengthy rationales in linear time complexity.

Methodology

1. Data Curation and Rationales Generation:

The authors compiled a dataset comprising 2.1 million question-answer pairs, derived from various visual instruction tuning datasets. To generate detailed rationales, these question-answer pairs were processed using the Claude Haiku API, and subsequently refined with human review assisted by GPT-4V. This curation resulted in 1.1 million question-rationale-answer triples, which span a wide range of tasks including fundamental image understanding, common-sense knowledge, and complex problem-solving procedures.

2. Model Architecture:

Meteor integrates several key components:

  • A vision encoder based on the CLIP-L/14 model for extracting visual features.
  • The Mamba architecture, referred to as Meteor-Mamba, designed for embedding lengthy rationales efficiently.
  • A backbone multimodal LLM (Meteor-MLM) built upon the InternLM2-7B, which leverages the embedded rationales for answer generation.
  • Vision and tor projectors to adapt feature dimensions between the vision encoder, Mamba, and MLM components.

Training and Inference

The training process is bifurcated into two principal steps:

  1. Embedding Rationales: Meteor-Mamba is trained to embed the lengthy rationales in an autoregressive manner, utilizing a new concept termed traversal of rationale. This method involves special <tor> tokens, which segment the rationales and ensure effective information passage into Meteor-MLM.
  2. Vision-Language Training: Subsequently, the entire Meteor architecture is trained using the question-answer pairs, enabling the model to generate answers supported by the embedded rationales without explicitly utilizing them during inference.

Results and Evaluation

Meteor demonstrates substantial improvements across multiple benchmarks, including MME, AI2D, MathVista, and MM-Vet. In evaluations against a range of existing open-source LLVMs, Meteor consistently outperformed, showcasing its adeptness in handling diverse tasks that require intricate understanding and rationalization.

For instance, results on the MME benchmark–which involves multifaceted image understanding tasks–highlight Meteor's superior performance, achieving significantly higher scores compared to models like LLaVA-Next-7B and InternLM-XC-7B. Moreover, in challenging evaluations (Tables (a) and (b) of Table 2), Meteor surpasses other state-of-the-art models in complex benchmarks like MMStar and MathVerse, further emphasizing the efficacy of embedding multifaceted rationales.

Implications and Future Directions

The results indicate that embedding multifaceted rationales considerably enhances a model's ability to deal with complex vision-language tasks, making the Meteor architecture a valuable alternative to increasing model sizes or employing additional encoders. This approach also mitigates hallucination in generative models, as evidenced by its performance on POPE and HallusionBench metrics.

Future Developments

While Meteor has demonstrated impressive results with a 7B model, there is potential to adapt similar methodologies for even smaller models (1-3B parameter range), by leveraging advanced techniques such as mixture of depths and layer analysis. This could pave the way for further democratizing access to highly capable vision-LLMs with minimized computational resources. The integration of rationale embedding offers a promising avenue for enhancing the interpretability and robustness of generative AI systems, potentially extending its application to more domains requiring nuanced understanding and reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (123)
  1. J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations, 2022.
  2. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  3. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  4. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  5. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
  6. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  7. L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin, “Sharegpt4v: Improving large multi-modal models with better captions,” arXiv preprint arXiv:2311.12793, 2023.
  8. H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023.
  9. W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  10. Y. Li, Y. Zhang, C. Wang, Z. Zhong, Y. Chen, R. Chu, S. Liu, and J. Jia, “Mini-gemini: Mining the potential of multi-modality vision language models,” arXiv preprint arXiv:2403.18814, 2024.
  11. W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al., “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079, 2023.
  12. H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024.
  13. A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al., “Yi: Open foundation models by 01. ai,” arXiv preprint arXiv:2403.04652, 2024.
  14. B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al., “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv preprint arXiv:2403.09611, 2024.
  15. Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, Z. Muyan, Q. Zhang, X. Zhu, L. Lu, et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” arXiv preprint arXiv:2312.14238, 2023.
  16. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
  17. Y. Tay, M. Dehghani, J. Rao, W. Fedus, S. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Vaswani, and D. Metzler, “Scale efficiently: Insights from pre-training and fine-tuning transformers,” arXiv preprint arXiv:2109.10686, 2021.
  18. B. Li, P. Zhang, J. Yang, Y. Zhang, F. Pu, and Z. Liu, “Otterhd: A high-resolution multi-modality model,” arXiv preprint arXiv:2311.04219, 2023.
  19. Q. Ye, H. Xu, J. Ye, M. Yan, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” arXiv preprint arXiv:2311.04257, 2023.
  20. A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, et al., “mplug-docowl 1.5: Unified structure learning for ocr-free document understanding,” arXiv preprint arXiv:2403.12895, 2024.
  21. R. Xu, Y. Yao, Z. Guo, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, M. Sun, and G. Huang, “Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images,” arXiv preprint arXiv:2403.11703, 2024.
  22. O. F. Kar, A. Tonioni, P. Poklukar, A. Kulshrestha, A. Zamir, and F. Tombari, “Brave: Broadening the visual encoding of vision-language models,” arXiv preprint arXiv:2404.07204, 2024.
  23. H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y. Sun, et al., “Deepseek-vl: towards real-world vision-language understanding,” arXiv preprint arXiv:2403.05525, 2024.
  24. E. Goncharova, A. Razzhigaev, M. Mikhalchuk, M. Kurkin, I. Abdullaeva, M. Skripkin, I. Oseledets, D. Dimitrov, and A. Kuznetsov, “Omnifusion technical report,” arXiv preprint arXiv:2404.06212, 2024.
  25. M. Ranzinger, G. Heinrich, J. Kautz, and P. Molchanov, “Am-radio: Agglomerative model–reduce all domains into one,” arXiv preprint arXiv:2312.06709, 2023.
  26. Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369, 2023.
  27. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  28. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.
  29. X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11975–11986, 2023.
  30. B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” arXiv preprint arXiv:2401.12168, 2024.
  31. W. Wang, Y. Ren, H. Luo, T. Li, C. Yan, Z. Chen, W. Wang, Q. Li, L. Lu, X. Zhu, et al., “The all-seeing project v2: Towards general relation comprehension of the open world,” arXiv preprint arXiv:2402.19474, 2024.
  32. Q. Jiao, D. Chen, Y. Huang, Y. Li, and Y. Shen, “Enhancing multimodal large language models with vision detection models: An empirical study,” arXiv preprint arXiv:2401.17981, 2024.
  33. B.-K. Lee, B. Park, C. W. Kim, and Y. M. Ro, “Collavo: Crayon large language and vision model,” arXiv preprint arXiv:2402.11248, 2024.
  34. B.-K. Lee, B. Park, C. W. Kim, and Y. M. Ro, “Moai: Mixture of all intelligence for large language and vision models,” arXiv preprint arXiv:2403.07508, 2024.
  35. C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,” arXiv preprint arXiv:2306.13394, 2023.
  36. A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 235–251, Springer, 2016.
  37. P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” arXiv preprint arXiv:2310.02255, 2023.
  38. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  39. J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, et al., “G-llava: Solving geometric problem with multi-modal large language model,” arXiv preprint arXiv:2312.11370, 2023.
  40. K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math-vision dataset,” arXiv preprint arXiv:2402.14804, 2024.
  41. X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen, “Mammoth: Building math generalist models through hybrid instruction tuning,” arXiv preprint arXiv:2309.05653, 2023.
  42. X. Yue, T. Zheng, G. Zhang, and W. Chen, “Mammoth2: Scaling instructions from the web,” 2024.
  43. Anthropic, “The claude 3 model family: Opus, sonnet, haiku.” https://www.anthropic.com, 2024.
  44. J. Strout, Y. Zhang, and R. Mooney, “Do human rationales improve machine explanations?,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 56–62, 2019.
  45. J. Lu, L. Yang, B. Namee, and Y. Zhang, “A rationale-centric framework for human-in-the-loop machine learning,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6986–6996, 2022.
  46. C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023.
  47. P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren, “Scott: Self-consistent chain-of-thought distillation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5546–5558, 2023.
  48. W. Xiong, Y. Song, P. Wang, and S. Li, “Rationale-enhanced language models are better continual relation learners,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15489–15497, 2023.
  49. L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi, “Symbolic chain-of-thought distillation: Small models can also “think” step-by-step,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2665–2679, 2023.
  50. M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang, “Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  51. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
  52. Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv preprint arXiv:2210.03493, 2022.
  53. K. Shum, S. Diao, and T. Zhang, “Automatic prompt augmentation and selection with chain-of-thought from labeled data,” in Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 12113–12139, 2023.
  54. S. Krishna, J. Ma, D. Slack, A. Ghandeharioun, S. Singh, and H. Lakkaraju, “Post hoc explanations of language models can improve language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  55. K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv preprint arXiv:2306.15195, 2023.
  56. H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, et al., “Obelisc: An open web-scale filtered dataset of interleaved image-text documents,” arXiv preprint arXiv:2306.16527, 2023.
  57. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  58. B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,” arXiv preprint arXiv:2305.03726, 2023.
  59. Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, et al., “mplug-owl: Modularization empowers large language models with multimodality,” arXiv preprint arXiv:2304.14178, 2023.
  60. X. Contributors, “Xtuner: A toolkit for efficiently fine-tuning llm.” https://github.com/InternLM/xtuner, 2023.
  61. P. Zhang, X. D. B. Wang, Y. Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, H. Yan, et al., “Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition,” arXiv preprint arXiv:2309.15112, 2023.
  62. Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,” arXiv preprint arXiv:2404.16821, 2024.
  63. G. H. Chen, S. Chen, R. Zhang, J. Chen, X. Wu, Z. Zhang, Z. Chen, J. Li, X. Wan, and B. Wang, “Allava: Harnessing gpt4v-synthesized data for a lite vision-language model,” arXiv preprint arXiv:2402.11684, 2024.
  64. Z. Zong, B. Ma, D. Shen, G. Song, H. Shao, D. Jiang, H. Li, and Y. Liu, “Mova: Adapting mixture of vision experts to multimodal context,” arXiv preprint arXiv:2404.13046, 2024.
  65. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 8748–8763, PMLR, 18–24 Jul 2021.
  66. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  67. I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities.” https://github.com/InternLM/InternLM-techreport, 2023.
  68. Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P. Chu, et al., “Internlm2 technical report,” arXiv preprint arXiv:2403.17297, 2024.
  69. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
  70. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25278–25294, 2022.
  71. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3558–3568, 2021.
  72. B. Saleh and A. Elgammal, “Large-scale classification of fine-art paintings: Learning the right metric on the right feature,” arXiv preprint arXiv:1505.00855, 2015.
  73. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755, Springer, 2014.
  74. O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: a dataset for image captioning with reading comprehension,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 742–758, Springer, 2020.
  75. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in neural information processing systems, vol. 24, 2011.
  76. C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021.
  77. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, 2018.
  78. M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2200–2209, 2021.
  79. A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,” arXiv preprint arXiv:2203.10244, 2022.
  80. K. Kafle, B. Price, S. Cohen, and C. Kanan, “Dvqa: Understanding data visualizations via question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656, 2018.
  81. S. Svetlichnaya, “Deepform: Understand structured documents at scale,” 2020.
  82. M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar, “Infographicvqa,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706, 2022.
  83. T. Stanisławek, F. Graliński, A. Wróblewska, D. Lipiński, A. Kaliska, P. Rosalska, B. Topolski, and P. Biecek, “Kleister: key information extraction datasets involving long documents with complex layouts,” in International Conference on Document Analysis and Recognition, pp. 564–579, Springer, 2021.
  84. W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang, “Tabfact: A large-scale dataset for table-based fact verification,” arXiv preprint arXiv:1909.02164, 2019.
  85. A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.
  86. P. Pasupat and P. Liang, “Compositional semantic parsing on semi-structured tables,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (C. Zong and M. Strube, eds.), (Beijing, China), pp. 1470–1480, Association for Computational Linguistics, July 2015.
  87. R. Tanaka, K. Nishida, and S. Yoshida, “Visualmrc: Machine reading comprehension on document images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13878–13888, 2021.
  88. M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “Retrieval-augmented multimodal language modeling,” arXiv preprint arXiv:2211.12561, 2022.
  89. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  90. Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv preprint arXiv:2311.06607, 2023.
  91. J. Lin, H. Yin, W. Ping, Y. Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” arXiv preprint arXiv:2312.07533, 2023.
  92. Z. Lin, C. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen, et al., “Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models,” arXiv preprint arXiv:2311.07575, 2023.
  93. P. Gao, R. Zhang, C. Liu, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, et al., “Sphinx-x: Scaling data and parameters for a family of multi-modal large language models,” arXiv preprint arXiv:2402.05935, 2024.
  94. H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhai, et al., “Q-bench: A benchmark for general-purpose foundation models on low-level vision,” arXiv preprint arXiv:2309.14181, 2023.
  95. P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022.
  96. B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” arXiv preprint arXiv:2307.16125, 2023.
  97. Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” arXiv preprint arXiv:2305.10355, 2023.
  98. F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob, D. Manocha, and T. Zhou, “Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models,” arXiv preprint arXiv:2310.14566, 2023.
  99. Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., “Mmbench: Is your multi-modal model an all-around player?,” arXiv preprint arXiv:2307.06281, 2023.
  100. W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,” arXiv preprint arXiv:2308.02490, 2023.
  101. L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al., “Are we on the right way for evaluating large vision-language models?,” arXiv preprint arXiv:2403.20330, 2024.
  102. R. Zhang, D. Jiang, Y. Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, P. Gao, et al., “Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?,” arXiv preprint arXiv:2403.14624, 2024.
  103. T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
  104. D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen, et al., “A study of bfloat16 for deep learning training,” arXiv preprint arXiv:1905.12322, 2019.
  105. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
  106. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  107. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  108. I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  109. N. S. Sohoni, C. R. Aberger, M. Leszczynski, J. Zhang, and C. Ré, “Low-memory neural network training: A technical report,” arXiv preprint arXiv:1904.10631, 2019.
  110. M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural machine translation,” in Proceedings of the First Workshop on Neural Machine Translation (T. Luong, A. Birch, G. Neubig, and A. Finch, eds.), (Vancouver), pp. 56–60, Association for Computational Linguistics, Aug. 2017.
  111. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16344–16359, 2022.
  112. T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
  113. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  114. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,”
  115. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
  116. D. Raposo, S. Ritter, B. Richards, T. Lillicrap, P. C. Humphreys, and A. Santoro, “Mixture-of-depths: Dynamically allocating compute in transformer-based language models,” arXiv preprint arXiv:2404.02258, 2024.
  117. B.-K. Lee, Y. Yu, and Y. M. Ro, “Towards adversarial robustness of bayesian neural network through hierarchical variational inference,” 2021.
  118. J. Kim, B.-K. Lee, and Y. M. Ro, “Distilling robust and non-robust features in adversarial examples by information bottleneck,” Advances in Neural Information Processing Systems, vol. 34, pp. 17148–17159, 2021.
  119. B.-K. Lee, J. Kim, and Y. M. Ro, “Masking adversarial damage: Finding adversarial saliency for robust and sparse network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15126–15136, 2022.
  120. J. Kim, B.-K. Lee, and Y. M. Ro, “Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12302–12312, 2023.
  121. B.-K. Lee, J. Kim, and Y. M. Ro, “Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4499–4509, 2023.
  122. J. Kim, B.-K. Lee, and Y. M. Ro, “Causal unsupervised semantic segmentation,” arXiv preprint arXiv:2310.07379, 2023.
  123. Y. Kim, J. Kim, B.-K. Lee, S. Shin, and Y. M. Ro, “Mitigating dataset bias in image captioning through clip confounder-free captioning network,” in 2023 IEEE International Conference on Image Processing (ICIP), pp. 1720–1724, IEEE, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Byung-Kwan Lee (14 papers)
  2. Chae Won Kim (10 papers)
  3. Beomchan Park (6 papers)
  4. Yong Man Ro (91 papers)
Citations (15)
Youtube Logo Streamline Icon: https://streamlinehq.com