Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of Multimodal Large Language Model from A Data-centric Perspective (2405.16640v2)

Published 26 May 2024 in cs.AI, cs.CL, cs.CV, and cs.MM
A Survey of Multimodal Large Language Model from A Data-centric Perspective

Abstract: Multimodal LLMs (MLLMs) enhance the capabilities of standard LLMs by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal role in the development and refinement of these models. In this survey, we comprehensively review the literature on MLLMs from a data-centric perspective. Specifically, we explore methods for preparing multimodal data during the pretraining and adaptation phases of MLLMs. Additionally, we analyze the evaluation methods for the datasets and review the benchmarks for evaluating MLLMs. Our survey also outlines potential future research directions. This work aims to provide researchers with a detailed understanding of the data-driven aspects of MLLMs, fostering further exploration and innovation in this field.

A Survey of Multimodal LLMs from a Data-centric Perspective

The paper "A Survey of Multimodal LLMs from a Data-centric Perspective" presents a comprehensive exploration of Multimodal LLMs (MLLMs) with a focus on the pivotal role of data. The authors systematically review existing literature to shed light on the methodologies for preparing, processing, and evaluating multimodal data, while also outlining potential future research directions. This essay provides an expert overview of the key contributions, findings, and implications of this paper.

Overview of MLLMs and Data-centric AI

The integration of multimodal data, such as text, images, audio, video, and 3D environments, significantly enhances the capabilities of traditional LLMs. MLLMs like GPT-4, Flamingo, and BLIP2 demonstrate compelling performance across traditional multimodal tasks and exhibit robust language understanding capabilities. In contrast to model-centric approaches that primarily focus on architectural enhancements, a data-centric perspective emphasizes iterative improvements to datasets to enhance performance. This paradigm shift towards data-centric AI is motivated by the recognition that the quality, diversity, and representativeness of training data are instrumental in the success of AI models.

Data Collection and Processing

The paper details various sources for data collection, drawing from common webpages, social media, academic papers, books, and professional domains. A meticulous data curation process is necessary to optimize the datasets:

  1. Common Webpages: Leveraging large-scale web crawls like CommonCrawl for diverse data ingestion.
  2. Social Media: Extracting rich multimodal data from platforms like Stack Exchange, Reddit, and YouTube.
  3. Academic Papers: Utilizing vast corpora like arXiv and S2ORC for high-quality scholarly content.
  4. Books: Mining extensive libraries like Project Gutenberg and BookCorpus for textual richness.
  5. Professional Sources: Domain-specific datasets derived from legal, medical, and financial resources.

Data Processing is equally critical. The authors discuss filtering methodologies to eliminate low-quality or irrelevant data and deduplication techniques to minimize redundancy. For instance, textual filtering tools like FastText and LangDetect ensure language consistency, while semantic deduplication leverages embedding models like Sentence-BERT to identify and remove semantically similar content. The enhancement of multimodal datasets involves augmenting textual descriptions and improving image or video quality, as demonstrated by methods like those used in LLaVA-1.5 and BLIVA.

Data-centric Pre-training

The pre-training phase is bifurcated into training the LLM backbone and modality encoders independently, followed by integrating these components using input projectors. The MLLMs are further trained using a domain and modality mixture. For example, the optimal ratio of image-caption pairs and interleaved text documents can enhance both zero-shot and few-shot learning.

The notion of Data Selection improves training efficiency and model performance. Techniques such as active learning-based selection and pre-training selection are crucial. These methods, using metrics like CLIP scores or advanced filters, prioritize high-quality data.

Data-centric Adaptation

Adaptation involves fine-tuning models using multimodal instruction-response datasets. Notable methodologies include:

  1. Supervised Fine-tuning (SFT): Constructing instruction-response pairs across various downstream tasks—captioning, question answering, reasoning, and classification.
  2. Data Selection for SFT: Employing coreset-based, gradient-based, and LLMs-based methods to refine datasets, thereby focusing on high-quality instructional data.
  3. Human Preference Alignment: Reinforcement Learning from Human Feedback (RLHF) models like InstructGPT align responses with human values, ensuring higher accuracy and ethical alignment.

Evaluation of MLLMs

The paper emphasizes data evaluation metrics to assess dataset quality, such as:

  • Diversity: Evaluating the range of concepts represented in the dataset.
  • Quality: Ensuring the factual accuracy and relevance of data points.
  • Similarity: Measuring how well datasets align with the target distribution.

The authors also review benchmark datasets used for evaluating MLLMs. These datasets, tailored for tasks such as captioning, question answering, and reasoning, serve as critical tools for measuring model performance across different dimensions.

Implications and Future Directions

The research highlights significant implications for both theoretical advancements and practical applications of AI. The findings underscore the value of high-quality, well-curated datasets in improving model robustness and generalizability. The paper calls for further studies on optimal data selection strategies, the development of MLLM-specific data processing systems, and an in-depth exploration of scaling laws for data quantity and quality.

Conclusion

In conclusion, the paper provides a thorough survey of MLLMs from a data-centric perspective, offering valuable insights into data collection, processing, pre-training, and adaptation. It underscores the importance of data in shaping the capabilities of MLLMs and sets the stage for future research aimed at optimizing data-centric methodologies in AI. This comprehensive review is an essential reference for researchers aiming to advance the field of multimodal AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (336)
  1. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  2. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision. 8948–8957.
  3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021), 24206–24221.
  4. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  5. A Survey on Data Selection for Language Models. arXiv:2402.16827 [cs.CL]
  6. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 39–48.
  7. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
  8. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
  9. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017).
  10. Jean-michel Attendu and Jean-Philippe Corbeil. 2023. NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP). 129–146.
  11. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017).
  12. Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9365–9374.
  13. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  14. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023).
  15. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
  16. Jack Bandy and Nicholas Vincent. 2021. Addressing ”Documentation Debt” in Machine Learning: A Retrospective Datasheet for BookCorpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=Qd_eU1wvJeu
  17. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. arXiv:2010.12421 [cs.CL]
  18. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, Vol. 14. 830–839.
  19. Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics 20 (2019), 1–23.
  20. A neural probabilistic language model. Advances in neural information processing systems 13 (2000).
  21. Scene Text Visual Question Answering. arXiv:1905.13648 [cs.CV]
  22. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023).
  23. Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21–29.
  24. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  25. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis 66 (2020), 101797.
  26. Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations.
  27. Nino Cauli and Diego Reforgiato Recupero. 2022. Survey on videos data augmentation for deep learning models. Future Internet 14, 3 (2022), 93.
  28. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. arXiv:2102.08981 [cs.CV]
  29. Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 380–388.
  30. Fully authentic visual question answering dataset from online communities. arXiv preprint arXiv:2311.15562 (2023).
  31. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023).
  32. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. arXiv preprint arXiv:2402.04788 (2024).
  33. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
  34. Data-juicer: A one-stop data processing system for large language models. arXiv preprint arXiv:2309.02033 (2023).
  35. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023).
  36. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292 (2023).
  37. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246 (2023).
  38. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478 [cs.CV]
  39. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).
  40. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023).
  41. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
  42. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023).
  43. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems 36 (2024).
  44. Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers. arXiv preprint arXiv:2402.19479 (2024).
  45. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023).
  46. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
  47. PaLI: A Jointly-Scaled Multilingual Language-Image Model. In The Eleventh International Conference on Learning Representations.
  48. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081 (2023).
  49. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  50. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  51. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919 (2023).
  52. Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
  53. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839.
  54. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  55. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition. 326–335.
  56. Devleena Das and Vivek Khetan. 2023. DEFT: Data Efficient Fine-Tuning for Large Language Models via Unsupervised Core-Set Selection. arXiv preprint arXiv:2310.16776 (2023).
  57. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  58. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431 (2021).
  59. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  60. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 736–740.
  61. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583 (2018).
  62. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569.
  63. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
  64. What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning. arXiv preprint arXiv:2311.01487 (2023).
  65. Quantifying visual image quality: A bayesian view. Annual Review of Vision Science 7 (2021), 437–464.
  66. Improving clip training with language rewrites. Advances in Neural Information Processing Systems 36 (2024).
  67. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393 (2023).
  68. Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13. Springer, 363–370.
  69. A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075 (2021).
  70. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 539–555.
  71. FSD50K: An Open Dataset of Human-Labeled Sound Events. arXiv:2010.00475 [cs.SD]
  72. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. 411–412.
  73. Dan Friedman and Adji Bousso Dieng. 2023. The vendi score: A diversity evaluation metric for machine learning. Transactions on Machine Learning Research (2023).
  74. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv:2306.13394 [cs.CV]
  75. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36 (2024).
  76. Jianfeng Gao and Chin-Yew Lin. 2004. Introduction to the special issue on statistical language modeling. , 87–93 pages.
  77. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision. 5267–5275.
  78. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  79. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780.
  80. Eleonora Giunchiglia and Thomas Lukasiewicz. 2020. Coherent hierarchical multi-label classification networks. Advances in neural information processing systems 33 (2020), 9662–9673.
  81. Audio Dialogues: Dialogues dataset for audio and music understanding. arXiv preprint arXiv:2404.07616 (2024).
  82. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision. 5842–5850.
  83. Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic. arXiv preprint arXiv:2404.07177 (2024).
  84. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. International Journal of Computer Vision 127, 4 (2019), 398–414.
  85. Learning Word Vectors for 157 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  86. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems 35 (2022), 26418–26431.
  87. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  88. Sample and computation redistribution for efficient face detection. arXiv preprint arXiv:2105.04714 (2021).
  89. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023).
  90. Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653 (2022).
  91. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3608–3617.
  92. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. Springer, 417–434.
  93. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints (2023).
  94. AutoAD: Movie description in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18930–18940.
  95. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
  96. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487 (2022).
  97. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 366–370.
  98. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
  99. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  100. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2023), 20482–20494.
  101. TRUE: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022).
  102. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022).
  103. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
  104. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2256–2264.
  105. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17980–17989.
  106. Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. 1233–1239.
  107. Sparkles: Unlocking chats across multiple images for multimodal instruction-following models. arXiv preprint arXiv:2308.16463 (2023).
  108. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
  109. Data-centric artificial intelligence. arXiv preprint arXiv:2212.11854 (2022).
  110. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2758–2766.
  111. The principles of data-centric ai (dcai). arXiv preprint arXiv:2211.14611 (2022).
  112. Frederick Jelinek. 1998. Statistical methods for speech recognition. MIT press.
  113. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  114. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. Advances in Neural Information Processing Systems 36 (2024).
  115. Transferability in deep learning: A survey. arXiv preprint arXiv:2201.05867 (2022).
  116. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046 (2023).
  117. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477 (2023).
  118. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data 10, 1 (2023), 1.
  119. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv:1901.07042 [cs.CV]
  120. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
  121. Jean Kaddour. 2023. The MiniPile Challenge for Data-Efficient Language Models. arXiv preprint arXiv:2304.08442 (2023).
  122. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648–5656.
  123. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning. PMLR, 10697–10707.
  124. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  125. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
  126. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF international conference on computer vision. 1244–1254.
  127. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787–798.
  128. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 235–251.
  129. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
  130. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems 33 (2020), 2611–2624.
  131. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132.
  132. Audio augmentation for speech recognition.. In Interspeech, Vol. 2015. 3586.
  133. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 (2022).
  134. Recurrent Neural Network Based Language Modeling in Meeting Recognition.. In Interspeech, Vol. 11. 2877–2880.
  135. Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems 34 (2021), 18685–18697.
  136. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
  137. Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  138. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision 128, 7 (2020), 1956–1981.
  139. From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023).
  140. VeCLIP: Improving CLIP Training via Visual-enriched Captions. https://api.semanticscholar.org/CorpusID:263835242
  141. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 36 (2024).
  142. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems 35 (2022), 31809–31826.
  143. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8424–8445.
  144. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
  145. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 447–463.
  146. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023).
  147. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 (2023).
  148. LAVIS: A Library for Language-Vision Intelligence. arXiv:2209.09019 [cs.CV]
  149. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19108–19118.
  150. Textbind: Multi-turn interleaved multimodal instruction-following. arXiv preprint arXiv:2309.08637 (2023).
  151. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  152. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
  153. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv:2305.01526 [cs.CL]
  154. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
  155. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
  156. M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387 (2023).
  157. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032 (2023).
  158. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  159. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
  160. One shot learning as instruction data prospector for large language models. arXiv preprint arXiv:2312.10302 (2023).
  161. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5495–5510.
  162. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
  163. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607 (2023).
  164. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
  165. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023).
  166. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  167. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 1140–1144.
  168. Visual spatial reasoning. Transactions of the Association for Computational Linguistics 11 (2023), 635–651.
  169. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774 (2023).
  170. Improved Baselines with Visual Instruction Tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  171. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  172. SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection. arXiv preprint arXiv:2402.16705 (2024).
  173. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
  174. Datasets for Large Language Models: A Comprehensive survey. arXiv:2402.18041 [cs.CL]
  175. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023).
  176. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473 (2024).
  177. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4969–4983.
  178. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. arXiv preprint arXiv:2305.13169 (2023).
  179. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432 (2023).
  180. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
  181. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023).
  182. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
  183. Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 182–189.
  184. Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models. arXiv preprint arXiv:2403.03003 (2024).
  185. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023).
  186. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093 (2023).
  187. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
  188. Sieve: Multimodal Dataset Pruning Using Image Captioning Models. arXiv:2310.02110 [cs.CV]
  189. Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935–948.
  190. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems 36 (2024).
  191. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11–20.
  192. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
  193. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022).
  194. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209.
  195. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611 (2024).
  196. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395 (2023).
  197. meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
  198. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. arXiv:1906.03327 [cs.CV]
  199. Recurrent neural network based language model. Interspeech 2010 (2010).
  200. Zeping Min and Jinbo Wang. 2023. Exploring the integration of large language models into automatic speech recognition systems: An empirical study. In International Conference on Neural Information Processing. Springer, 69–84.
  201. Haecheon Kim Sungjun Lee Woonhyuk Baek Saehoon Kim Minwoo Byeon, Beomhee Park. 2022. COYO-700M: Image-Text Pair Dataset. https://github.com/kakaobrain/coyo-dataset.
  202. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR). IEEE, 947–952.
  203. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 (2023).
  204. Improving Multimodal Datasets with Image Captioning. ArXiv abs/2307.10350 (2023). https://api.semanticscholar.org/CorpusID:259991316
  205. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  206. Hyunjong Ok. 2023. FinTree: Financial Dataset Pretrain Transformer Encoder for Relation Extraction. arXiv preprint arXiv:2307.13900 (2023).
  207. Constructing Image-Text Pair Dataset from Books. arXiv:2310.01936 [cs.CV]
  208. R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
  209. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011).
  210. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  211. mRedditSum: A Multimodal Abstractive Summarization Dataset of Reddit Threads with Images. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  212. X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799 (2023).
  213. SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
  214. Amey Pasarkar and Adji Bousso Dieng. 2023. Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning. arXiv preprint arXiv:2310.12952 (2023).
  215. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems 34 (2021), 20596–20607.
  216. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023).
  217. Red Teaming Language Models with Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3419–3448.
  218. Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
  219. Mauve scores for generative models: Theory and practice. Journal of Machine Learning Research 24, 356 (2023), 1–92.
  220. Neoklis Polyzotis and Matei Zaharia. 2021. What can data-centric ai learn from data and ml engineering? arXiv preprint arXiv:2112.06439 (2021).
  221. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  222. Improving Language Understanding by Generative Pre-Training. ([n. d.]).
  223. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507 (2019).
  224. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  225. GLaMM: Pixel Grounding Large Multimodal Model. arXiv:2311.03356 [cs.CV]
  226. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
  227. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
  228. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156 (2018).
  229. Movie description. International Journal of Computer Vision 123 (2017), 94–120.
  230. Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 88, 8 (2000), 1270–1278.
  231. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
  232. Assessing generative models via precision and recall. Advances in neural information processing systems 31 (2018).
  233. Analysing Mathematical Reasoning Abilities of Neural Models. In International Conference on Learning Representations.
  234. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems 36 (2024).
  235. A dataset and reranking method for multimodal MT of user-generated image captions. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). 140–153.
  236. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
  237. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
  238. Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).
  239. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
  240. Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning. PMLR, 5719–5728.
  241. Audio-Visual LLM for Video Understanding. arXiv preprint arXiv:2312.06720 (2023).
  242. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930 (2023).
  243. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 742–758.
  244. Noise-Robust De-Duplication at Scale. In The Eleventh International Conference on Learning Representations.
  245. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326.
  246. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama. https://huggingface.co/datasets/cerebras/SlimPajama-627B
  247. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint arXiv:2402.00159 (2024).
  248. Luca Soldaini and Kyle Lo. 2023. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical Report. Allen Institute for AI. ODC-By, https://github.com/allenai/pes2o.
  249. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
  250. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
  251. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems 35 (2022), 19523–19536.
  252. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2443–2449.
  253. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.
  254. Baochen Sun and Kate Saenko. 2016. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 443–450.
  255. Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models. arXiv preprint arXiv:2310.05863 (2023).
  256. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023).
  257. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023).
  258. Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432 (2023).
  259. TVLT: Textless vision-language transformer. Advances in Neural Information Processing Systems 35 (2022), 9617–9632.
  260. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284 (2023).
  261. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  262. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016).
  263. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.
  264. Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. 498–510.
  265. Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In 2014 International joint conference on neural networks (IJCNN). IEEE, 112–119.
  266. GIT: A Generative Image-to-text Transformer for Vision and Language. Transactions on Machine Learning Research (2022).
  267. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
  268. CogVLM: Visual Expert for Pretrained Language Models. arXiv:2311.03079 [cs.CV]
  269. Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters. arXiv preprint arXiv:2403.02677 (2024).
  270. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
  271. Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning. arXiv:2402.02055 [cs.LG]
  272. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023).
  273. InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding. arXiv preprint arXiv:2403.15377 (2024).
  274. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700 (2023).
  275. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023).
  276. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
  277. A comparison on data augmentation methods based on deep learning for audio classification. In Journal of physics: Conference series, Vol. 1453. IOP Publishing, 012085.
  278. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6 (2018), 287–302.
  279. Data collection and quality challenges in deep learning: A data-centric ai perspective. The VLDB Journal 32, 4 (2023), 791–813.
  280. Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165 (2023).
  281. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017).
  282. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023).
  283. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
  284. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333 (2024).
  285. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations.
  286. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems 36 (2024).
  287. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2024).
  288. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. 1645–1653.
  289. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4227–4239.
  290. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
  291. CiT: Curation in Training for Effective Vision-Language Data. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 15134–15143. https://api.semanticscholar.org/CorpusID:255440514
  292. Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks. arXiv preprint arXiv:2306.04362 (2023).
  293. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
  294. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355 (2020).
  295. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognition 137 (2023), 109347.
  296. LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images. arXiv preprint arXiv:2403.11703 (2024).
  297. Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
  298. ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding. arXiv preprint arXiv:2212.05171 (2022).
  299. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding. arXiv:2305.08275 [cs.CV]
  300. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision. 1686–1697.
  301. A study of face obfuscation in imagenet. In International Conference on Machine Learning. PMLR, 25313–25330.
  302. End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2733–2743.
  303. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023).
  304. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance. arXiv preprint arXiv:2403.16952 (2024).
  305. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  306. Navigating text-to-image customization: From lycoris fine-tuning to model evaluation. In The Twelfth International Conference on Learning Representations.
  307. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems 36 (2024).
  308. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
  309. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85.
  310. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023).
  311. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127–9134.
  312. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open 2 (2021), 65–68.
  313. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502 (2023).
  314. Christoph Zauner. 2010. Implementation and benchmarking of perceptual image hash functions. (2010).
  315. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226 (2023).
  316. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158 (2023).
  317. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. arXiv preprint arXiv:2402.17193 (2024).
  318. MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024).
  319. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
  320. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  321. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  322. Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474 (2023).
  323. Optimization algorithm for point cloud quality enhancement based on statistical filtering. Journal of Sensors 2021 (2021), 1–10.
  324. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  325. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023).
  326. Mllm-dataengine: An iterative refinement approach for mllm. arXiv preprint arXiv:2308.13566 (2023).
  327. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Proceedings of The European Conference on Computer Vision (ECCV).
  328. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 519–535.
  329. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239 (2023).
  330. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
  331. Towards Automatic Learning of Procedures from Web Instructional Videos. arXiv:1703.09788 [cs.CV]
  332. Oasis: Data curation and assessment system for pretraining of large language models. arXiv preprint arXiv:2311.12537 (2023).
  333. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592 [cs.CV]
  334. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems 36 (2024).
  335. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4995–5004.
  336. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision. 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Tianyi Bai (26 papers)
  2. Hao Liang (137 papers)
  3. Binwang Wan (1 paper)
  4. Ling Yang (88 papers)
  5. Bozhou Li (5 papers)
  6. Yifan Wang (319 papers)
  7. Bin Cui (165 papers)
  8. Conghui He (114 papers)
  9. Binhang Yuan (45 papers)
  10. Wentao Zhang (261 papers)
  11. Yanran Xu (3 papers)
  12. Xi Li (197 papers)
  13. Shiyu Li (37 papers)
  14. Ping Huang (34 papers)
  15. Jiulong Shan (22 papers)
Citations (22)