High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning (2203.01311v4)
Abstract: Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots. While there has been an explosion of interest in multimodal learning, these methods are focused on a small set of modalities primarily in language, vision, and audio. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities {X1,X2} are by measuring how much information can be transferred from X1 to X2, while (2) interaction heterogeneity studies how similarly pairs of modalities {X1,X2}, {X3,X4} interact by measuring how much information can be transferred from fusing {X1,X2} to {X3,X4}. We show the importance of these 2 proposed metrics as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning.
- VQA: Visual question answering. International Journal of Computer Vision, 2017.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178, 2021.
- Overview of artificial intelligence in medicine. Journal of family medicine and primary care, 8(7):2328, 2019.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Jonathan Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning, 28(1):7–39, 1997.
- Social robots for education: A review. Science robotics, 3(21), 2018.
- Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
- Unifying vision-and-language tasks via text generation. In ICML, 2021.
- Hal Daumé III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263, 2007.
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845–854, 2017.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019.
- A multimodal fusion method for sarcasm detection based on late fusion. Multimedia Tools and Applications, 81(6):8597–8616, 2022.
- Human-computer interaction. Harlow ua, 2000.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Cross-modal data programming enables rapid medical machine learning. CoRR, abs/1903.11101, 2019. URL http://arxiv.org/abs/1903.11101.
- On the classification of emotional biosignals evoked while viewing affective pictures: an integrated data-mining-based approach for healthcare applications. IEEE Transactions on Information Technology in Biomedicine, 14(2):309–318, 2010.
- Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd international conference on information fusion (FUSION), pages 1–6. IEEE, 2020.
- Explaining black box predictions and unveiling data artifacts through influence functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5553–5563, 2020.
- Ur-funny: A multimodal language dataset for understanding humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2046–2056, 2019.
- Decoupling the role of data, attention, and losses in multimodal transformers. arXiv preprint arXiv:2102.00529, 2021.
- Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv preprint arXiv:2102.10772, 2021.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
- Perceiver: General perception with iterative attention. arXiv preprint arXiv:2103.03206, 2021b.
- Multiplicative interactions and where to find them. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rylnK6VtDH.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- One model to learn them all. arXiv preprint arXiv:1706.05137, 2017.
- Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.
- Learning multiple layers of features from tiny images. 2009.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA), pages 8943–8950. IEEE, 2019.
- Multimodal sensor fusion with differentiable filters. IROS, 2020a.
- Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics, 36(3):582–596, 2020b.
- Parameter efficient multimodal transformers for video representation learning. In International Conference on Learning Representations, 2020c.
- Enrico: A dataset for topic modeling of mobile ui designs. In 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI’20 Extended Abstracts), 2020.
- Tidigits speech corpus. Texas Instruments, Inc, 1993.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Towards a unified foundation model: Jointly pre-training transformers on unpaired images and text. arXiv preprint arXiv:2112.07074, 2021.
- Learning language and multimodal privacy-preserving markers of mood from mobile data. In ACL/IJCNLP (1), 2021a.
- Multibench: Multiscale benchmarks for multimodal representation learning. In NeurIPS Datasets and Benchmarks Track, 2021b.
- Cross-modal generalization: Learning in low resource modalities via meta-alignment. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2680–2689, 2021c.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- Polyvit: Co-training vision transformers on images, videos and audio, 2022. URL https://openreview.net/forum?id=9r4_7GxTLnS.
- Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 2021.
- Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13–23, 2019.
- Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021.
- George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995. ISSN 0001-0782.
- A comprehensive survey on multimodal medical signals fusion for smart healthcare systems. Information Fusion, 76:355–375, 2021.
- Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 6966–6975, 2019.
- Rosalind W Picard. Affective computing. MIT press, 2000.
- Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
- Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482, 2021.
- Zero-shot learning through cross-modal transfer. arXiv preprint arXiv:1301.3666, 2013.
- An information theoretic framework for multi-view learning. Conference on Learning Theory, 2008.
- Which tasks should be learned together in multi-task learning? In International Conference on Machine Learning, pages 9120–9132. PMLR, 2020.
- Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SygXPaEYvH.
- Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473, 2019.
- A survey of multimodal deep generative models. Advanced Robotics, 36(5-6):261–278, 2022.
- Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
- World models and predictive coding for cognitive and developmental robotics: Frontiers and challenges. arXiv preprint arXiv:2301.05832, 2023.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, 2019.
- On negative transfer: Effects of testing one list on the recall of another. Journal of Verbal Learning and Verbal Behavior, 13(2):181–193, 1974.
- Attention is all you need. In NIPS, 2017.
- Centralnet: a multilayer approach for multimodal fusion, 2018.
- What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12695–12705, 2020.
- Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11293–11302, 2019.
- An information-theoretic analysis for transfer learning: Error bounds and applications. arXiv preprint arXiv:2207.05377, 2022.
- Adaptive cross-modal few-shot learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32. 2019. URL http://papers.nips.cc/paper/8731-adaptive-cross-modal-few-shot-learning.pdf.
- Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1642–1651, 2022.
- Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.400. URL https://www.aclweb.org/anthology/2020.acl-main.400.
- Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors, 21(6):2140, 2021.
- Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
- Foundations of multimodal co-learning. Information Fusion, 64:188–193, 2020.
- Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL, 2018.
- Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018.
- Multimodal core tensor factorization and its applications to low-rank tensor completion. IEEE Transactions on Multimedia, 2022.
- Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.
- Paul Pu Liang (103 papers)
- Yiwei Lyu (30 papers)
- Xiang Fan (22 papers)
- Jeffrey Tsaw (1 paper)
- Yudong Liu (31 papers)
- Shentong Mo (56 papers)
- Dani Yogatama (49 papers)
- Louis-Philippe Morency (123 papers)
- Ruslan Salakhutdinov (248 papers)