Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What matters when building vision-language models? (2405.02246v1)

Published 3 May 2024 in cs.CV and cs.AI
What matters when building vision-language models?

Abstract: The growing interest in vision-LLMs (VLMs) has been driven by improvements in LLMs and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Understanding Vision-LLMs: Insights and Innovations from the Idefics2 Study

The Significance of Backbone Models

In the field of vision-LLMs (VLMs), the backbone models play a crucial role. These are the pre-trained networks that the VLMs build upon, usually consisting of a vision part (like an image encoder) and a language part (like a LLM). This paper rigorously explores how the choice of these backbone models influences the overall performance of the VLM.

  • LLM Impact: The LLMs seem to have a more significant effect compared to the vision models. For instance, upgrading from one LLM to a more advanced version boosted performance noticeably, more so than enhancements on the vision side.
  • Vision Model Observations: Improvement in vision models also contributed to better performance, but the impact was slightly less pronounced compared to LLM upgrades.

These findings emphasize the importance of selecting high-quality backbone models, particularly in the language domain, to drive superior VLM performance.

Architectural Choices: Fully Autoregressive vs. Cross-Attention

Choosing the right architecture for integrating visual and textual information is pivotal. The paper compares two prominent architectures:

  • Fully Autoregressive Architecture: Directly concatenates the outputs from the vision model with text embeddings before processing them in the LLM. It seems to perform well especially when all components are trainable, but can suffer from stability issues.
  • Cross-Attention Architecture: Integrates vision and text by interleaving specialized cross-attention layers within the LLM. It performs exceptionally well when the vision and LLMs are frozen (not trainable during VLM training), but doesn't improve as much as the fully autoregressive method when all parts are trainable.

While the fully autoregressive method initially showed instability during training, adjustments using techniques like Low-Rank Adaptation (LoRA) significantly improved its performance, making it a strong contender for building efficient and powerful VLMs.

Efficiency and Performance Trade-offs

Efficiency in training and inference is as important as model performance. The research highlights several strategies to balance these aspects:

  • Reducing Visual Tokens: Implementing trainable pooling to reduce the number of visual tokens (i.e., the input features from images) resulted in both higher efficiency and improved performance, debunking the necessity for very high token counts that was previously assumed.
  • Handling Image Resolutions: Adaptive resolution handling, where images maintain their original aspect ratio and are processed in various resolutions, provided flexibility and memory savings without sacrificing performance.

These strategies enable more efficient model training and deployment, particularly when handling diverse and large-scale visual data.

Implications and Future Directions

The findings from the Idefics2 paper pave the way for more purposeful and informed design choices in the development of VLMs. Understanding the impact of model architectures, backbone selections, and efficiency strategies not only helps in building better models but also in fine-tuning them for specialized applications.

Looking ahead, these insights could influence future research directions, particularly in exploring new architectures and training methodologies that further optimize the balance between performance and computational efficiency. The potential applications of such enhanced VLMs are extensive, ranging from improved interactive AI systems to advanced content analysis tools.

Conclusion

The Idefics2 paper provides a comprehensive evaluation of various critical aspects in the design and implementation of vision-LLMs. By systematically testing and comparing different approaches, it offers valuable insights that contribute to the advancement of this technology, setting a benchmark for future endeavors in the AI and machine learning community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (151)
  1. Tallyqa: Answering complex counting questions. In AAAI.
  2. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE.
  3. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems, Volume 35, pp.  23716–23736. Curran Associates, Inc.
  4. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  5. Openflamingo: An open-source framework for training large autoregressive vision-language models.
  6. PromptSource: An integrated development environment and repository for natural language prompts. In V. Basile, Z. Kozareva, and S. Stajner (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, pp.  93–104. Association for Computational Linguistics.
  7. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
  8. Introducing our multimodal models.
  9. Automatikz: Text-guided synthesis of scientific vector graphics with tikz.
  10. Ocr-idl: Ocr annotations for industry document library dataset.
  11. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  4290–4300.
  12. Nougat: Neural optical understanding for academic documents.
  13. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems, Volume 33, pp.  1877–1901. Curran Associates, Inc.
  14. Chart-based reasoning: Transferring capabilities from llms to vlms.
  15. MapQA: A dataset for question answering on choropleth maps. In NeurIPS 2022 First Table Representation Workshop.
  16. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR.
  17. Sharegpt4v: Improving large multi-modal models with better captions.
  18. Pali-x: On scaling up a multilingual vision and language model.
  19. Pali: Scaling language-image learning in 100+ languages. In Conference on Neural Information Processing Systems (NeurIPS).
  20. Pali-3 vision language models: Smaller, faster, stronger.
  21. FinQA: A dataset of numerical reasoning over financial data. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp.  3697–3711. Association for Computational Linguistics.
  22. HiTab: A hierarchical table dataset for question answering and natural language generation. In S. Muresan, P. Nakov, and A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp.  1094–1110. Association for Computational Linguistics.
  23. Mobilevlm v2: Faster and stronger baseline for vision language model.
  24. Free dolly: Introducing the world’s first truly open instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm. Accessed: 2023-06-30.
  25. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  26. Vision transformers need registers. In The Twelfth International Conference on Learning Representations.
  27. Scaling vision transformers to 22 billion parameters. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning, Volume 202 of Proceedings of Machine Learning Research, pp.  7480–7512. PMLR.
  28. Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution. In Thirty-seventh Conference on Neural Information Processing Systems.
  29. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255.
  30. Redcaps: Web-curated image-text data created by the people, for the people. In J. Vanschoren and S. Yeung (Eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Volume 1. Curran.
  31. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  32. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.
  33. Google (2023). Gemini: A family of highly capable multimodal models.
  34. Google (2024a). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
  35. Google (2024b). Gemma: Open models based on gemini research and technology.
  36. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6325–6334.
  37. Pathvqa: 30000+ questions for medical visual question answering.
  38. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  39. Cogagent: A visual language model for gui agents.
  40. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding.
  41. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  42. Language is not all you need: Aligning perception with language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  43. Hudson, D. A. and C. D. Manning (2019). Gqa: A new dataset for real-world visual reasoning and compositional question answering. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6693–6702.
  44. Search-based neural structured learning for sequential question answering. In R. Barzilay and M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp.  1821–1831. Association for Computational Linguistics.
  45. Perceiver: General perception with iterative attention. In M. Meila and T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Volume 139 of Proceedings of Machine Learning Research, pp.  4651–4664. PMLR.
  46. NEFTune: Noisy embeddings improve instruction finetuning. In The Twelfth International Conference on Learning Representations.
  47. Learning to describe differences between pairs of similar images. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp.  4024–4034. Association for Computational Linguistics.
  48. Mistral 7b.
  49. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1988–1997.
  50. Dvqa: Understanding data visualizations via question answering. In CVPR.
  51. Figureqa: An annotated figure dataset for visual reasoning.
  52. Prismatic vlms: Investigating the design space of visually-conditioned language models.
  53. Geomverse: A systematic evaluation of large models for geometric reasoning. In Synthetic Data for Computer Vision Workshop @ CVPR 2024.
  54. A diagram is worth a dozen images. In B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Computer Vision – ECCV 2016, Cham, pp.  235–251. Springer International Publishing.
  55. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5376–5384.
  56. The hateful memes challenge: Detecting hate speech in multimodal memes. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems, Volume 33, pp.  2611–2624. Curran Associates, Inc.
  57. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA.
  58. Grounding language models to images for multimodal inputs and outputs.
  59. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data 5, 180251.
  60. The bigscience roots corpus: A 1.6tb composite multilingual dataset. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems, Volume 35, pp.  31809–31826. Curran Associates, Inc.
  61. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  62. Unlocking the conversion of web screenshots into html code with the websight dataset.
  63. Moai: Mixture of all intelligence for large language and vision models.
  64. Pix2struct: screenshot parsing as pretraining for visual language understanding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  65. Seed-bench: Benchmarking multimodal llms with generative comprehension.
  66. Mimic-it: Multi-modal in-context instruction tuning.
  67. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems.
  68. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  69. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
  70. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning.
  71. Evaluating object hallucination in large vision-language models. In H. Bouamor, J. Pino, and K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, pp.  292–305. Association for Computational Linguistics.
  72. Mini-gemini: Mining the potential of multi-modality vision language models.
  73. Monkey: Image resolution and text label are important things for large multi-modal models.
  74. Moe-llava: Mixture of experts for large vision-language models.
  75. Vila: On pre-training for visual language models.
  76. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Computer Vision – ECCV 2014, Cham, pp.  740–755. Springer International Publishing.
  77. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.
  78. Lindström, A. D. (2022). Clevr-math: A dataset for compositional language, visual, and mathematical reasoning.
  79. Visual spatial reasoning. Transactions of the Association for Computational Linguistics 11, 635–651.
  80. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  81. Llava-next: Improved reasoning, ocr, and world knowledge.
  82. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  83. Infimm-hd: A leap forward in high-resolution multimodal understanding.
  84. Dora: Weight-decomposed low-rank adaptation.
  85. Liu, T. and B. K. H. Low (2023). Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks.
  86. Mmbench: Is your multi-modal model an all-around player?
  87. Deepseek-vl: Towards real-world vision-language understanding.
  88. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.
  89. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR).
  90. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).
  91. Learn to explain: Multimodal reasoning via thought chains for science question answering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems, Volume 35, pp.  2507–2521. Curran Associates, Inc.
  92. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
  93. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.
  94. MAPL: Parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In A. Vlachos and I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, pp.  2523–2548. Association for Computational Linguistics.
  95. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
  96. The iam-database: An english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5, 39–46.
  97. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, pp.  2263–2279. Association for Computational Linguistics.
  98. Infographicvqa. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.  2582–2591.
  99. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.  2199–2208.
  100. Mm1: Methods, analysis & insights from multimodal llm pre-training.
  101. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV).
  102. Ocr-vqa: Visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp.  947–952.
  103. Orca-math: Unlocking the potential of slms in grade school math.
  104. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. In B. Davis, Y. Graham, J. Kelleher, and Y. Sripada (Eds.), Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, pp.  138–147. Association for Computational Linguistics.
  105. OpenAI (2024). Gpt-4 technical report.
  106. Compositional semantic parsing on semi-structured tables. In C. Zong and M. Strube (Eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp.  1470–1480. Association for Computational Linguistics.
  107. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  108. Connecting vision and language with localized narratives. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Cham, pp.  647–664. Springer International Publishing.
  109. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
  110. Exploring models and data for image question answering. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems, Volume 28. Curran Associates, Inc.
  111. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  112. Laion-5b: An open large-scale dataset for training next generation image-text models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Advances in Neural Information Processing Systems, Volume 35, pp.  25278–25294. Curran Associates, Inc.
  113. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
  114. A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, Berlin, Heidelberg, pp.  146–162. Springer-Verlag.
  115. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL.
  116. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations.
  117. ep-alm: Efficient perceptual augmentation of language models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, pp.  21999–22012. IEEE Computer Society.
  118. Textcaps: A dataset for image captioning with reading comprehension. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Cham, pp.  742–758. Springer International Publishing.
  119. Flava: A foundational language and vision alignment model. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15617–15629.
  120. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  8317–8326.
  121. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, New York, NY, USA, pp.  2443–2449. Association for Computing Machinery.
  122. A corpus for reasoning about natural language grounded in photographs. In A. Korhonen, D. Traum, and L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.  6418–6428. Association for Computational Linguistics.
  123. Generative multimodal models are in-context learners.
  124. Eva-clip: Improved training techniques for clip at scale.
  125. Aligning large multimodal models with factually augmented rlhf.
  126. Visualmrc: Machine reading comprehension on document images. In AAAI.
  127. VisText: A Benchmark for Semantically Rich Chart Captioning. In The Annual Meeting of the Association for Computational Linguistics (ACL).
  128. Teknium (2023). Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants.
  129. Thiel, D. (2023). Identifying and eliminating csam in generative ml training data and models.
  130. Llama: Open and efficient foundation language models.
  131. Llama 2: Open foundation and fine-tuned chat models.
  132. Improved baselines for data-efficient perceptual augmentation of llms.
  133. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, New York, NY, USA, pp.  498–510. Association for Computing Machinery.
  134. Cogvlm: Visual expert for pretrained language models.
  135. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  136. Palm2-vadapter: Progressively aligned language model makes a strong vision-language adapter.
  137. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78.
  138. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations.
  139. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR.
  140. MAmmoTH: Building math generalist models through hybrid instruction tuning. In The Twelfth International Conference on Learning Representations.
  141. Sigmoid loss for language image pre-training.
  142. Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  143. Pmc-vqa: Visual instruction tuning for medical visual question answering.
  144. MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, pp.  6588–6600. Association for Computational Linguistics.
  145. RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, pp.  6064–6081. Association for Computational Linguistics.
  146. Seq2sql: Generating structured queries from natural language using reinforcement learning.
  147. Tinyllava: A framework of small-scale large multimodal models.
  148. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
  149. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp.  3277–3287. Association for Computational Linguistics.
  150. Multimodal c4: An open, billion-scale corpus of images interleaved with text. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  151. Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hugo Laurençon (11 papers)
  2. Léo Tronchon (5 papers)
  3. Matthieu Cord (129 papers)
  4. Victor Sanh (21 papers)
Citations (96)
Youtube Logo Streamline Icon: https://streamlinehq.com