MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences (2010.11985v2)
Abstract: Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along with a dynamic pruning and read-out technique, is designed to efficiently process this modal-temporal graph and capture various interactions. By learning to focus only on the important interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.
- Jianing Yang (21 papers)
- Yongxin Wang (21 papers)
- Ruitao Yi (2 papers)
- Yuying Zhu (18 papers)
- Azaan Rehman (5 papers)
- Amir Zadeh (36 papers)
- Soujanya Poria (138 papers)
- Louis-Philippe Morency (123 papers)