Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Factorized Multimodal Representations

Published 16 Jun 2018 in cs.LG, cs.CL, cs.CV, and stat.ML | (1806.06176v3)

Abstract: Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

Citations (361)

Summary

  • The paper introduces the Multimodal Factorization Model (MFM) that separates shared discriminative factors from modality-specific generative factors.
  • It leverages a Bayesian network with encoder-decoder structures and Wasserstein approximations to jointly optimize generative and discriminative objectives.
  • Experimental results on datasets like MNIST+SVHN and real-world multimodal time series demonstrate superior accuracy and resilience to missing modalities compared to baseline models.

Learning Factorized Multimodal Representations

Introduction

The study of multimodal machine learning focuses on developing models capable of leveraging multiple sources of data, such as text, images, and audio. "Learning Factorized Multimodal Representations" introduces the Multimodal Factorization Model (MFM), which robustly handles both intra and cross-modal interactions while remaining resilient to noisy and missing modalities. The proposed model excels by partitioning multimodal data into discriminative factors shared across modalities and modality-specific generative factors, using a joint generative-discriminative learning objective. Figure 1

Figure 1: Illustration of the proposed Multimodal Factorization Model with three modalities.

Multimodal Factorization Model

Model Architecture

MFM assumes a Bayesian network that uses conditional independence to separate shared discriminative components from modality-specific generative components. Generative and discriminative components jointly handle inference, with encoder-decoder structures organized to maximize the distributional similarity between observed and generated data using Wasserstein distance approximations.

Implementation Details

The model leverages a neural architecture that uses nonlinear transformations within its encoder Q(Z∣X1:M,Y)Q(\mathbf{Z}|\mathbf{X}_{1:M},\mathbf{Y}) and deterministic functions for its decoder modules. The optimization objective balances generative reconstruction loss with discriminative label prediction efficiency, ensuring performance robustness even with partial modality availability. Figure 2

Figure 2

Figure 2: (a) MFM generative network structure/datasets, (b) classification accuracy, (c) conditional generation for digits.

Experimental Studies

Synthetic Image Dataset

The MNIST+SVHN dataset evaluates MFM's capability to manage distinct visual styles and shared labels. The model outperformed unimodal and multimodal baselines in accuracy and versatility, demonstrating superior discriminative and generative performance capabilities across random modality samples.

Real-world Multimodal Time Series

MFM outperformed existing models on six multimodal datasets, achieving state-of-the-art results in speaker traits, sentiment, and emotion recognition. Precise factorization of multimodal data enabled this model to surpass the efficacy of several contemporary frameworks like MFN and BC-LSTM. Figure 3

Figure 3: Ablation study results: Impact of omitting design components.

Handling Missing Modalities

Using surrogate inference allowed MFM to infer missing modalities effectively, maintaining robust predictive performance. Empirical tests demonstrated the model's impressive resilience to incomplete data scenarios, notably outperforming purely generative or discriminative approaches on various test sets. Figure 4

Figure 4: Analysis of multimodal representations using inference methods.

Conclusion

MFM marks a significant advance in multimodal approaches, particularly through its novel factorization method that optimally splits generative and discriminative aspects. The model's improvements in prediction accuracy and modality reconstruction establish it as a leading framework in multimodal learning. Future developments may explore extending MFM for unsupervised and semi-supervised methods, enhancing its utility across broader applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.