Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks (2307.16184v2)

Published 30 Jul 2023 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Abstract: LLMs have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

A Unified Model for Diverse Multimodal Tasks

This paper proposes a unified model designed to simultaneously address tasks across multiple modalities, including image, video, audio, and text. This ambitious goal seeks to streamline the extensive diversity and heterogeneity inherent in multimodal tasks, typically supported by distinct models. This work contributes to the burgeoning field of unified models, particularly focusing on scalability and efficiency.

Model Architecture and Methodology

The model employs a sequence-to-sequence neural architecture specifically designed to handle the representation and transformation of different modalities into a unified token-based input format. Notably, it leverages a relatively moderate model size of approximately 0.25 billion parameters, significantly smaller than many existing multi-billion parameter models in the field. This reduction in size is achieved without sacrificing the model’s ability to handle multiple modalities, an important consideration for resource-constrained environments.

Key to the architecture is the use of a linear connection layer for tokenizing non-textual data, such as images and audio. This technique engenders efficient mapping into the shared input space of a pretrained LLM, which forms the core of the system’s processing capability. The model is trained using a next-token prediction objective, thereby allowing it to both understand and generate coherent language-based outputs across diverse tasks.

Performance Evaluation

The paper presents the competitive performance of the model across a suite of standard benchmarks. In tasks such as Visual Grounding, the model achieves state-of-the-art results, notably in the RefCOCO, RefCOCO+, and RefCOCOg datasets. For more traditional language-related evaluations like VQAv2 (Visual Question Answering) and Image Captioning on the MSCOCO dataset, the model exhibits commendable efficacy, rivaling or often outperforming existing approaches that require larger training datasets.

Innovative Contributions

Among the novel contributions of this paper is the demonstration of multimodal curriculum learning, which provides a systematic pathway for efficiently incorporating multiple modalities into the training regimen. This strategy involves incrementally introducing additional modalities to the training process so that the model does not burden computational resources with the requirement to handle all data at once. This approach not only curtails computational costs but also, as demonstrated in the results, can enhance model generalization to new or less familiar modalities.

Moreover, this paper addresses the potential of downstream task adaptation via a weight interpolation strategy, showcasing the ability to merge expertise from models fine-tuned on different tasks. This feature highlights the model’s versatility and capability for seamless task transfer, a critical aspect for developing adaptable AI systems capable of real-time learning and adaptation.

Implications and Future Directions

The implications of this research extend both practically and theoretically. Practically, it informs the development of generalist agents capable of performing diverse multimodal tasks—an encouraging step towards more sophisticated and versatile AI applications. Theoretically, it sparks discourse on the trade-offs between model size, efficiency, and performance, emphasizing the viability of streamlined models for robust task execution.

Future research directions suggested by the paper include scaling the model size while advancing the unification strategy to accommodate more complex data interactions and tasks. Additionally, reducing hallucinations and improving handling of complex instructions remain important challenges. Exploring more training techniques and curriculum strategies may further bolster generalization capabilities, especially in anticipated scenarios involving new or unobserved modalities.

In sum, this paper significantly advances the quest for a unified multimodal model, offering substantial insights into efficient training, effective integration of diverse data types, and dynamic task adaptability. It serves as a competent model for both academic inquiry and real-world application development in the domain of AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (163)
  1. nocaps: novel object captioning at scale. In ICCV, 2019.
  2. Git re-basin: Merging models modulo permutation symmetries. In ICLR, 2022.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1728–1738, 2021.
  5. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Winter Conference on Applications of Computer Vision, 2022.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  8. Rich Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.
  9. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3558–3568, 2021.
  10. Audio captioning based on transformer and pre-trained cnn. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), pp.  21–25, Tokyo, Japan, November 2020a.
  11. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020b.
  12. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022a.
  13. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  14. Uniter: Universal image-text representation learning. In European conference on computer vision, pp.  104–120. Springer, 2020c.
  15. Vindlu: A recipe for effective video-and-language pretraining. arXiv preprint arXiv:2212.05051, 2022.
  16. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp. 1931–1942. PMLR, 2021.
  17. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
  18. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  19. Seasoning model soups for robustness to adversarial and natural distribution shifts. In CVPR, 2023.
  20. Elastic weight removal for faithful and abstractive dialogue generation. arXiv preprint arXiv:2303.17574, 2023.
  21. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. arXiv preprint arXiv:2210.07688, 2022.
  22. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023a.
  23. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2136–2148, Dubrovnik, Croatia, May 2023b. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.156.
  24. Improving selective visual question answering by learning from your peers. In CVPR, 2023.
  25. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  26. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
  27. ColD fusion: Collaborative descent for distributed multitask finetuning. In ACL, 2023.
  28. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  29. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387, 2021.
  30. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  31. Clotho: An audio captioning dataset. In ICASSP, 2020.
  32. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  33. The role of permutation invariance in linear mode connectivity of neural networks. In ICLR, 2022.
  34. Audio captioning based on combined audio and semantic embeddings. In 2020 IEEE International Symposium on Multimedia (ISM), pp. 41–48, 2020. doi: 10.1109/ISM.2020.00014.
  35. Linear mode connectivity and the lottery ticket hypothesis. In ICML, 2020.
  36. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  37. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
  38. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
  39. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  40. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  41. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018a.
  42. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  6546–6555, 2018b.
  43. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  16000–16009, 2022.
  44. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  45. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  46. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  47. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  17980–17989, 2022.
  48. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  49. Unifying multimodal transformer for bi-directional image and text generation. In ICM, 2021.
  50. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6700–6709, 2019.
  51. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. arXiv preprint arXiv:2005.08271, 2020a.
  52. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp.  958–959, 2020b.
  53. Patching open-vocabulary models by interpolating weights. In NeurIPS, 2022.
  54. Editing models with task arithmetic. In ICLR, 2023.
  55. Averaging weights leads to wider optima and better generalization. In UAI, 2018.
  56. Exploring the benefits of training expert language models over instruction tuning. arXiv preprint arXiv:2302.03202, 2023.
  57. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. PMLR, 2021.
  58. REPAIR: Renormalizing permuted activations for interpolation repair. In ICLR, 2023.
  59. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1780–1790, 2021.
  60. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  61. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019a.
  62. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, Minneapolis, Minnesota, June 2019b. Association for Computational Linguistics. doi: 10.18653/v1/N19-1011. URL https://aclanthology.org/N19-1011.
  63. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp. 5583–5594. PMLR, 2021.
  64. Automated audio captioning using transfer learning and reconstruction latent space similarity regularization. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7722–7726. IEEE, 2022.
  65. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
  66. A transformer-based audio captioning model with keyword estimation. Proc. Interspeech 2020, pp.  1977–1981, 2020.
  67. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. ACM, 2020.
  68. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp.  706–715, 2017a.
  69. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017b.
  70. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, 2017.
  71. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7331–7341, 2021.
  72. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, 2021.
  73. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.
  74. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4953–4963, 2022a.
  75. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021a.
  76. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022b.
  77. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  78. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022c.
  79. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  80. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020a.
  81. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp.  121–137. Springer, 2020b.
  82. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021b.
  83. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  7492–7500, 2018.
  84. Modular and parameter-efficient multimodal fusion with prompting. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  2976–2985, 2022.
  85. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  17949–17958, 2022.
  86. Microsoft coco: Common objects in context. In European conference on computer vision, pp.  740–755. Springer, 2014.
  87. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  88. Leveraging pre-trained bert for audio captioning. In 2022 30th European Signal Processing Conference (EUSIPCO), pp.  1145–1149. IEEE, 2022.
  89. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852, 2023b.
  90. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  91. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022a.
  92. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022b.
  93. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  94. Mapl: Parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. arXiv preprint arXiv:2210.07179, 2022.
  95. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  96. Merging models with Fisher-weighted averaging. In NeurIPS, 2022.
  97. Audio captioning transformer. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021), 2021.
  98. Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
  99. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, 2019.
  100. What is being transferred in transfer learning? NeurIPS, 2020.
  101. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  102. Im2text: Describing images using 1 million captioned photographs. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pp.  1143–1151, Red Hook, NY, USA, 2011. Curran Associates Inc. ISBN 9781618395993.
  103. Task arithmetic in the tangent space: Improved editing of pre-trained models. arXiv preprint arXiv:2305.12827, 2023.
  104. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  105. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  106. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  107. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  8908–8917, 2019.
  108. Diverse weight averaging for out-of-distribution generalization. In NeurIPS, 2022.
  109. Model Ratatouille: Recycling diverse models for out-of-distribution generalization. In ICML, 2023a.
  110. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023b.
  111. Zero-shot text-to-image generation. In ICML, 2021.
  112. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  113. Object hallucination in image captioning. In EMNLP, 2018.
  114. Improved techniques for training gans. NeurIPS, 2016.
  115. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  116. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  17959–17968, 2022.
  117. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  118. Efficient vision-language pretraining with visual concepts and hierarchical alignment. In 33rd British Machine Vision Conference (BMVC), 2022.
  119. ep-alm: Efficient perceptual augmentation of language models. arXiv preprint arXiv:2303.11403, 2023a.
  120. Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. arXiv preprint arXiv:2310.00647, 2023b.
  121. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15638–15650, 2022.
  122. Curriculum learning for data-efficient vision-language alignment. arXiv preprint arXiv:2207.14525, 2022.
  123. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  124. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023.
  125. Effects of word-frequency based pre-and post-processings for audio captioning. arXiv preprint arXiv:2009.11436, 2020.
  126. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  4858–4862, 2021.
  127. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
  128. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021.
  129. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  130. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
  131. GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022a. ISSN 2835-8856. URL https://openreview.net/forum?id=b4tMhpN0JC.
  132. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7190–7198, 2018a.
  133. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6598–6608, 2023a.
  134. Omnivl: One foundation model for image-language and video-language tasks. arXiv preprint arXiv:2209.07526, 2022b.
  135. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023b.
  136. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022c.
  137. What language model architecture and pretraining objective works best for zero-shot generalization? In ICML, 2022d.
  138. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022e.
  139. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.  795–801, New Orleans, Louisiana, June 2018b. Association for Computational Linguistics. doi: 10.18653/v1/N18-2125. URL https://aclanthology.org/N18-2125.
  140. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
  141. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 2022.
  142. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
  143. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  144. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, pp.  1645–1653, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349062. doi: 10.1145/3123266.3123427. URL https://doi.org/10.1145/3123266.3123427.
  145. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  146. A crnn-gru based reinforcement learning approach to audio captioning. 2020.
  147. Investigating local and global information for automated audio captioning with transfer learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  905–909. IEEE, 2021.
  148. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
  149. Resolving interference when merging models. arXiv preprint arXiv:2306.01708, 2023.
  150. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1686–1697, 2021a.
  151. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS 2022-36th Conference on Neural Information Processing Systems, 2022.
  152. Crossing the format boundary of text and boxes: Towards unified vision-language modeling. arXiv preprint arXiv:2111.12085, 2021b.
  153. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  154. Modeling context in referring expressions. In European Conference on Computer Vision, pp.  69–85. Springer, 2016.
  155. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  156. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.
  157. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  158. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  16375–16387, 2022.
  159. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5579–5588, 2021.
  160. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  161. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  162. Well-classified examples are underestimated in classification with deep neural networks. In AAAI, 2022.
  163. Generalized decoding for pixel, image, and language. In CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mustafa Shukor (27 papers)
  2. Corentin Dancette (14 papers)
  3. Matthieu Cord (129 papers)
  4. Alexandre Rame (8 papers)
Citations (32)
Youtube Logo Streamline Icon: https://streamlinehq.com