Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Generative Models for Scalable Weakly-Supervised Learning (1802.05335v3)

Published 14 Feb 2018 in cs.LG and stat.ML

Abstract: Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.

Multimodal Generative Models for Scalable Weakly-Supervised Learning: A Review

This paper introduces a novel approach to multimodal learning with weak supervision using the Multimodal Variational Autoencoder (MVAE). It addresses the challenges of learning joint representations from multimodal data, which is often incomplete or weakly labeled. The proposed MVAE architecture efficiently handles missing data by sharing parameters across different combinations of existing modalities, utilizing a product-of-experts (PoE) inference network, and employing a sub-sampled training paradigm to optimize the evidence lower bound (ELBO).

Methodology

The MVAE model extends the variational autoencoder (VAE) framework to handle multiple modalities by assuming conditional independence among modalities given a latent variable. This assumption allows the use of an efficient product-of-experts approach to combine variational distributions, reducing the number of inference networks and parameters significantly compared to previous multimodal models.

The approach utilizes two formulations: a quotient-of-experts (QoE) method, denoted as MVAE-Q, and a more numerically stable extension using PoE that avoids quotient terms. The inference network parameters are shared across different combinations of available modalities, enabling the system to manage missing data efficiently.

A key innovation of the MVAE is the sub-sampled training paradigm. This involves leveraging fully-observed examples as partially-observed ones by randomly sub-sampling them, thereby simulating realistic weak supervision conditions. This paradigm supports robust learning even when complete supervision is scarce, as often encountered in real-world multimodal datasets.

Experimental Results

The paper presents experimental validation on five datasets, including MNIST, FashionMNIST, MultiMNIST, and CelebA, achieving performance comparable to or better than state-of-the-art models while using fewer parameters. Notably, the MVAE demonstrates robust handling of missing data and efficiently learns from datasets with low levels of complete supervision.

In particular, MVAE is shown to excel in scenarios where each modality benefits from statistical strength sharing, such as the CelebA dataset, where individual attributes are treated as distinct modalities. Furthermore, case studies in computer vision tasks like image colorization, edge detection, and machine translation illustrate the general applicability and effectiveness of the MVAE in modeling complex multimodal transformations.

Implications and Future Directions

The MVAE's ability to generalize to various modal scenarios with incomplete data supervision has substantial implications for many AI fields, including computer vision, natural language processing, and beyond. It enables scalable learning of joint representations without requiring prohibitively large amounts of fully annotated data, thus making it particularly appealing for applications in resource-constrained environments.

Looking forward, extensions of the MVAE could explore synergies with advanced model architectures, such as transformers, which could further enhance its representational capacity and performance in tasks like unsupervised and semi-supervised learning. Moreover, exploring the theoretical bounds of the PoE framework within the context of other generative model architectures might open new research avenues for even more efficient multimodal learning solutions.

In conclusion, the contribution of the MVAE in efficiently learning joint distributions from multimodal data under weak supervision represents a significant enhancement in the toolkit for AI researchers dealing with complex, multimodal datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Mike Wu (30 papers)
  2. Noah Goodman (57 papers)
Citations (346)