Efficient Low-rank Multimodal Fusion with Modality-Specific Factors (1806.00064v1)

Published 31 May 2018 in cs.AI, cs.LG, and stat.ML

Abstract: Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.

Authors (6)

Zhun Liu (7 papers)
Ying Shen (76 papers)
Varun Bharadhwaj Lakshminarasimhan (1 paper)
Paul Pu Liang (103 papers)
Amir Zadeh (36 papers)
Louis-Philippe Morency (123 papers)

Citations (568)

View on Semantic Scholar

Summary

The paper proposes Low-rank Multimodal Fusion (LMF) which decomposes weight tensors into modality-specific low-rank factors to overcome exponential complexity.
It achieves linear scaling with fewer parameters, substantially cutting down computing time and memory compared to traditional tensor fusion methods.
Empirical validation on sentiment analysis, speaker trait, and emotion recognition tasks demonstrates LMF’s efficient performance and competitive accuracy.

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

This paper presents a novel approach to multimodal fusion in artificial intelligence called Low-rank Multimodal Fusion (LMF). The primary challenge addressed here pertains to the exponential increase in computational complexity encountered when transforming unimodal data into high-dimensional tensors. Traditional tensor-based fusion methods, while effective, often become computationally prohibitive with multiple modalities. The authors propose utilizing low-rank tensors to mitigate these issues, thereby enhancing efficiency without sacrificing performance.

LMF operates by decomposing the weight tensor into modality-specific low-rank factors, enabling linear scaling with the number of modalities. The method strategically circumvents the need to explicitly construct high-dimensional tensors, instead performing fusion directly from input representations. This innovation significantly reduces computational overhead, facilitating more efficient training and inference.

The efficacy of LMF is empirically validated across three distinct tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. The model consistently achieves competitive results, often outperforming more computationally intensive tensor-based methods. For example, on the CMU-MOSI dataset, LMF improves Mean Absolute Error (MAE) and Pearson correlation compared to state-of-the-art models, demonstrating robust performance across a variety of metrics.

Key contributions highlighted include the linear scaling capability of LMF, which enables its application to scenarios involving multiple modalities. The method also requires substantially fewer parameters compared to traditional tensor fusion networks (TFN), drastically decreasing time complexity from exponential to linear.

Theoretical and practical implications of LMF are significant. By achieving efficient fusion with less computational burden, this approach opens possibilities for deploying multimodal AI systems on resource-constrained platforms. The linear scalability and parameter efficiency render it adaptable to diverse and evolving multimodal datasets.

Looking forward, the integration of low-rank tensors could be extended to attention mechanisms in neural networks, which would benefit from the reduced memory and computational demands that LMF offers. This research thus provides a foundational step towards more accessible and efficient multimodal AI applications.

PDF Markdown

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors (1806.00064v1)

Summary

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Related Papers