Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training (2401.00849v1)

Published 1 Jan 2024 in cs.CV
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-LLMs like \cite{flamingo, palme}, leveraging the long-context capability of LLMs, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the LLM into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

Introduction

AI research has been progressively advancing towards models that can understand and process not only text but also multimodal information—data that combines text with other forms such as images or videos. This development has come with its own set of challenges, especially in aligning and processing such multimodal data effectively.

COSMO Framework

Addressing these challenges, the paper introduces COSMO, a COntrastive-Streamlined MultimOdal Model, that integrates contrastive loss with text generation models. COSMO stands out by dividing the LLM into two segments—one focusing on processing text and the other on fusing multimodal information. This distinction enables COSMO to efficiently manage both unimodal and multimodal tasks, showing an impressive reduction in learnable parameters and performance improvements in 14 different downstream tasks, including those involving images, texts, and video data.

Howto-Interlink7M Dataset

A key hurdle for training such models is the lack of quality long-text multimodal datasets. The paper addresses this by introducing a novel video-text data set called Howto-Interlink7M. Derived by annotating segments of instructional videos, this dataset stands out for its high-quality, detailed captions that preserve narrative coherence across video clips. The dataset has a substantial impact on the COSMO model's performance, improving it even further in various image-text and video-text tasks.

Performance and Evaluation

When compared to OpenFlamingo, a similar autoregressive vision-LLM, COSMO shows a pronounced improvement in model performance even though it employs fewer learnable parameters and a smaller dataset sample size. This advantage is particularly noticeable in challenging tasks such as the Flickr captioning task, where COSMO outstrips OpenFlamingo's performance by a significant margin.

Conclusion

The integration of contrastive loss into multimodal learning frameworks, along with the development of high-quality, long-text datasets, represents a promising direction for AI research. The advancements made by COSMO and the Howto-Interlink7M dataset not only set new standards for multimodal tasks but also open up extensive opportunities for future research, especially in the field of long-text data applications. The release of the trained models and datasets is eagerly anticipated, with the potential to catalyze further research in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  5. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  9. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  10. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  15. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  17. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Openclip, 2021. If you use this software, please cite it as below.
  21. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  24. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  25. Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  26. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer, 2020.
  27. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  29. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  30. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  33. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  34. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  35. OpenAI. Gpt-4 technical report. 2023.
  36. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  37. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  38. Category-specific video summarization. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 540–555. Springer, 2014.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  41. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3202–3212, 2015.
  42. Alex Fang Samir Yitzhak Gadre, Gabriel Ilharco. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  43. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  44. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  45. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  46. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  47. Together.xyz. Releasing 3b and 7b redpajama incite family of models including base, instruction-tuned and chat models. https://www. together.xyz/blog/redpajama-models-v1, 2023.
  48. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  49. Cédric Villani. Topics in optimal transportation. American Mathematical Soc., 2021.
  50. Too large; data reduction for vision-language pre-training. ICCV, 2023a.
  51. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  52. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  53. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
  54. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022a.
  55. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022b.
  56. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022c.
  57. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  58. Vidchapters-7m: Video chapters at scale. arXiv preprint arXiv:2309.13952, 2023.
  59. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  60. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  61. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  62. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  63. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Alex Jinpeng Wang (20 papers)
  2. Linjie Li (89 papers)
  3. Kevin Qinghong Lin (28 papers)
  4. Jianfeng Wang (149 papers)
  5. Kevin Lin (98 papers)
  6. Zhengyuan Yang (86 papers)
  7. Lijuan Wang (133 papers)
  8. Mike Zheng Shou (165 papers)
Citations (8)