Papers
Topics
Authors
Recent
2000 character limit reached

Multimodal Transformer With a Low-Computational-Cost Guarantee (2402.15096v1)

Published 23 Feb 2024 in cs.LG, cs.CV, and cs.MM

Abstract: Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Yin and yang: Balancing and answering binary visual questions,” in Proc. CVPR, 2016.
  2. “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering,” in Proc. CVPR, 2017.
  3. “The kinetics human action video dataset,” arXiv:1705.06950, 2017.
  4. “Youtube-8m: A large-scale video classification benchmark,” arXiv:1609.08675, 2016.
  5. “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision (IJCV), 2021.
  6. “Attention is all you need,” in Proc. NeurIPS, 2017, vol. 30.
  7. “Perceiver: General perception with iterative attention,” in Proc. ICML, 2021, pp. 4651–4664.
  8. “Universal transformers,” in Proc. ICLR, 2019.
  9. “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” in Proc. CVPR, June 2023, pp. 19175–19186.
  10. “Pali: A jointly-scaled multilingual language-image model,” in Proc. ICLR, 2023.
  11. “OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in Proc. ICML, 17–23 Jul 2022, vol. 162, pp. 23318–23340.
  12. “Blockwise self-attention for long document understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2555–2565.
  13. “Generating long sequences with sparse transformers,” arXiv:1912.12180, 2019.
  14. “Longformer: The long-document transformer,” arXiv:2004.05150, 2020.
  15. “Reformer: The efficient transformer,” in Proc. ICLR, 2020.
  16. “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
  17. “Linformer: Self-attention with linear complexity,” arXiv:2006.04768, 2020.
  18. “Rethinking attention with performers,” in Proc. ICLR, 2021.
  19. “Attention bottlenecks for multimodal fusion,” in Proc. NeurIPS, 2021, pp. 14200–14213.
  20. A. Owens and A. A. Efros., “Audio-visual scene analysis with self-supervised multisensory features,” arXiv:1804.03641, 2018.
  21. “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017, pp. 776–780.
  22. “A dataset for medical instructional video classification and question answering,” Scientific Data, 2023.
  23. “Audiovisual masked autoencoders,” in Proc. ICCV, 2023.
  24. H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. EMNLP-IJCNLP, 2019, pp. 5100–5111.
  25. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. NeurIPS, 2019, vol. 32.
  26. “Vilt: Vision-and-language transformer without convolution or region supervision,” in Proc. ICML, 2021.
  27. “Big bird: Transformers for longer sequences,” in Proc. NeurIPS, 2020, vol. 33, pp. 17283–17297.
  28. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  29. “Ast: Audio spectrogram transformer,” in Proc. Interspeech, 2021, pp. 571–575.
  30. “Vivit: A video vision transformer,” in Proc. ICCV, 2021.
  31. “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
  32. “Multimodal fusion for multimedia analysis: a survey,” in Multimedia Systems, 2010.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.