Interpretable Tensor Fusion
Abstract: Conventional machine learning methods are predominantly designed to predict outcomes based on a single data type. However, practical applications may encompass data of diverse types, such as text, images, and audio. We introduce interpretable tensor fusion (InTense), a multimodal learning method for training neural networks to simultaneously learn multimodal data representations and their interpretable fusion. InTense can separately capture both linear combinations and multiplicative interactions of diverse data types, thereby disentangling higher-order interactions from the individual effects of each modality. InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations. The approach is theoretically grounded and yields meaningful relevance scores on multiple synthetic and real-world datasets. Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.
- Multi-modal egocentric activity recognition using multi-kernel learning. Multimedia Tools and Applications, 80(11):16299–16328, 2021.
- Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Computational Linguistics.
- Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1–10. IEEE, 2016.
- Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in finnish cases and controls. Scientific reports, 8(1):13149, 2018.
- A multimodal language region in the ventral visual pathway. Nature, 394(6690):274–277, 1998.
- Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 565–580. Springer, 2020.
- Towards multimodal sarcasm detection, 2019.
- Classification of glomerular hypercellularity using convolutional features and support vector machine. Artificial intelligence in medicine, 103:101808, 2020.
- Do explanations make vqa models more predictable to a human? arXiv preprint arXiv:1810.12366, 2018.
- Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508–513, 2014.
- Fernando De la Torre and Jeffrey F Cohn. Facial expression analysis. Visual analysis of humans: Looking at people, pages 377–409, 2011.
- Covarep—a collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pages 960–964. IEEE, 2014.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Non-linear machine learning models incorporating snps and prs improve polygenic prediction in diverse human populations. Communications Biology, 5(1):856, 2022.
- Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. arXiv preprint arXiv:2109.04448, 2021.
- Perceptual score: What data modalities does your model perceive? Advances in Neural Information Processing Systems, 34:21630–21643, 2021.
- Ur-funny: A multimodal language dataset for understanding humor. arXiv preprint arXiv:1904.06618, 2019.
- Advances in multimodal emotion recognition based on brain–computer interfaces. Brain sciences, 10(10):687, 2020.
- Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 861–877, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Multimodal explanations by predicting counterfactuality in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8594–8602, 2019.
- Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 12:953–997, 2011.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Enrico: A dataset for topic modeling of mobile ui designs. In 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, pages 1–4, 2020.
- Computational modeling of human multimodal language: The mosei dataset and interpretable dynamic fusion. In First Workshop and Grand Challenge on Computational Modeling of Human Multimodal Language, volume 113, pages 116–125, 2018.
- Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Multimodal human-robot interface for accessible remote robotic interventions in hazardous environments. IEEE Access, 7:127290–127319, 2019.
- Learning the kernel function via regularization. Journal of machine learning research, 6(7), 2005.
- Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788, 2018.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6966–6975, 2019.
- Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2539–2544, 2015.
- Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008.
- Dynamic routing between capsules. Advances in neural information processing systems, 30, 2017.
- Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access, 2019.
- Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2020, page 1823. NIH Public Access, 2020.
- Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
- Beyond additive fusion: Learning non-additive multimodal interactions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4681–4696, 2022.
- Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, 2016.
- Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, 2017.
- Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.