SMIL: Multimodal Learning with Severely Missing Modality (2103.05677v1)

Published 9 Mar 2021 in cs.CV

Abstract: A common assumption in multimodal learning is the completeness of training data, i.e., full modalities are available in all training examples. Although there exists research endeavor in developing novel methods to tackle the incompleteness of testing data, e.g., modalities are partially missing in testing examples, few of them can handle incomplete training modalities. The problem becomes even more challenging if considering the case of severely missing, e.g., 90% training examples may have incomplete modalities. For the first time in the literature, this paper formally studies multimodal learning with missing modality in terms of flexibility (missing modalities in training, testing, or both) and efficiency (most training data have incomplete modality). Technically, we propose a new method named SMIL that leverages Bayesian meta-learning in uniformly achieving both objectives. To validate our idea, we conduct a series of experiments on three popular benchmarks: MM-IMDb, CMU-MOSI, and avMNIST. The results prove the state-of-the-art performance of SMIL over existing methods and generative baselines including autoencoders and generative adversarial networks. Our code is available at https://github.com/mengmenm/SMIL.

PDF Abstract

A Formal Exploration of Multimodal Learning with Severely Missing Modality

The paper "SMIL: Multimodal Learning with Severely Missing Modality" provides a systematic inquiry into the domain of multimodal learning under constraints of incomplete data availability. This research asserts its unique position by addressing the complexities introduced when significant proportions of training data are devoid of one or more modalities—a scenario hitherto underexplored in the scholarly discourse.

Core Contributions and Methodology

The paper introduces a novel method called SMIL, which leverages a Bayesian meta-learning framework to navigate the dual challenges of missing modality in both training and testing phases, while emphasizing efficiency. SMIL distinctively addresses scenarios where up to 90% of training instances might lack a complete set of modalities, making it a pertinent solution for real-world applications marred by privacy issues and data acquisition costs.

The theoretical design of SMIL is based on two pivotal components:

Missing Modality Reconstruction: This involves the reconstruction network that generates predicted feature representations of the absent modalities. By approximating the missing modalities through a Bayesian strategy, the method circumvents the need for imputation based on full-data assumptions.
Uncertainty-Guided Feature Regularization: To counteract the innate bias in the reconstructed features, SMIL employs a feature regularization mechanism powered by a Bayesian neural network. This meta-regularization introduces a stochastic perturbation to enrich feature learning, which distinguishes it from deterministic approaches predominant in current literature.

Experimental Evaluation

The empirical validation of SMIL is conducted using three benchmarks—MM-IMDb, CMU-MOSI, and avMNIST. The results substantiate that SMIL frequently surpasses traditional generative models such as Autoencoders and GANs in scenarios with severely limited modality availability. For instance, in an experiment where only 10% of text modality was available for CMU-MOSI, SMIL exhibited superior classification accuracy and F1 scores compared to baseline models.

Implications and Future Directions

Practically, SMIL paves the way for robust multimodal systems operable in environments with constrained modality datasets, typical of applications in intelligent tutoring systems, robotics, and healthcare. Theoretically, it challenges the conventional reliance on full-modality datasets, encouraging a pivot towards more versatile learning models capable of inference from incomplete multimodal inputs.

Looking ahead, this research opens several avenues for development in AI. Future inquiries could aim to integrate more comprehensive prior knowledge into the Bayesian framework, potentially enriching the feature reconstruction process. Additionally, exploring how these methods scale with the growing complexity of multimodal datasets could provide critical insights beneficial for deploying AI systems in dynamic, real-world applications.

In summary, the paper on SMIL significantly broadens the scope of multimodal learning research by not only addressing but also optimizing the learning process in scenarios with missing modalities through Bayesian meta-learning. This marks a substantial progression towards resilient AI systems adaptable to incomplete data landscapes.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Mengmeng Ma (10 papers)
Jian Ren (97 papers)
Long Zhao (64 papers)
Sergey Tulyakov (108 papers)
Cathy Wu (45 papers)
Xi Peng (115 papers)

Citations (204)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - mengmenm/SMIL: Pytorch implementation of SMIL: Multimodal Learning with Severely Missing Modality (AAAI 2021) (85 stars)