Learning Representations by Maximizing Mutual Information Across Views (1906.00910v2)

Published 3 Jun 2019 in cs.LG and stat.ML

Abstract: We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. For example, one could produce multiple views of a local spatio-temporal context by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual). Or, an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation. Maximizing mutual information between features extracted from these views requires capturing information about high-level factors whose influence spans multiple views -- e.g., presence of certain objects or occurrence of certain events. Following our proposed approach, we develop a model which learns image representations that significantly outperform prior methods on the tasks we consider. Most notably, using self-supervised learning, our model learns representations which achieve 68.1% accuracy on ImageNet using standard linear evaluation. This beats prior results by over 12% and concurrent results by 7%. When we extend our model to use mixture-based representations, segmentation behaviour emerges as a natural side-effect. Our code is available online: https://github.com/Philip-Bachman/amdim-public.

Citations (1,397)

View on Semantic Scholar

Summary

The paper proposes AMDIM, a self-supervised model that maximizes mutual information across augmented views to improve feature representation.
It introduces multiscale feature optimization and a modified ResNet encoder with Noise-Contrastive Estimation for efficient learning.
The approach achieves significant accuracy gains on benchmarks like ImageNet and STL10, indicating scalability and potential for unsupervised segmentation.

An Analysis of "Learning Representations by Maximizing Mutual Information Across Views"

The paper "Learning Representations by Maximizing Mutual Information Across Views" presents a novel approach to self-supervised representation learning. The authors, Philip Bachman, R Devon Hjelm, and William Buchwalter from Microsoft Research, propose maximizing mutual information (MI) between features derived from different views of shared contexts. This method aims to capture high-level factors indicative of certain objects or events, which hold significance across the diverse views presented.

Methodological Contributions

The authors' proposed model, termed Augmented Multiscale Deep InfoMax (AMDIM), extends the classic Deep InfoMax (DIM) approach in significant ways:

Multiple Views via Augmentation:
- The key advance here is the generation of multiple views of an input through data augmentation. Examples include observing a scene from different angles, or using diverse modalities such as auditory or tactile feedback.
- The approach optimizes a mutual information bound between features extracted from these augmented views, which compels feature vectors to capture vital high-level details.
Multiscale Feature Optimization:
- AMDIM introduces the concept of multiscale mutual information. This involves maximizing MI between features from various scales simultaneously, rather than a single global-local scale.
- The optimization harnesses Noise-Contrastive Estimation (NCE), leveraging large-scale negative sampling for efficient mutual information computation.
Powerful Encoder Architecture:
- The architecture modifies the ResNet framework for controlled receptive fields and to manage the stationarity of feature distributions. A noteworthy detail is the avoidance of padding to stabilize feature distributions.
Mixture-based Representations:
- The paper explores the introduction of mixture-based features, which produced segmentation-like behavior. This allows the model to generate more granular predictions, akin to natural scene segmentation.

Numerical Results

The AMDIM model demonstrates strong empirical results across popular benchmark datasets:

ImageNet:
- Achieved 68.1% accuracy using linear evaluation, which surpassed previous results by more than 12% and contemporaneous results by 7%. Using an MLP classifier, the accuracy further increased to 69.5%.
STL10:
- The model attained 94.2% accuracy with linear evaluation, marking a significant improvement over prior self-supervised methods.
CIFAR10 and CIFAR100:
- On CIFAR10, AMDIM achieved a remarkable 93.1% accuracy with MLP evaluation, and on CIFAR100, it reached 72.8%.
Places205:
- The model demonstrated generalization by achieving 55% accuracy on this dataset using representations learned on ImageNet.

Practical and Theoretical Implications

The results highlight several implications:

Practical Utility:
- AMDIM provides a robust self-supervised learning framework that can significantly improve feature learning without labeled data. This reduction in dependency on annotated datasets potentially lowers training costs and broadens applicability.
Scalability:
- The architecture and approach are computationally feasible, making it reproducible on standard hardware setups like NVIDIA Tesla V100 GPUs.
Generative Capabilities:
- The emergent segmentation outcomes when using mixture-based representations suggest potential applications in unsupervised image segmentation and similar tasks.

Future Developments

Further research could dive into several domains:

Domain Generalization:
- Applying AMDIM to diverse fields such as video analysis, audio processing, and multimodal integration could reveal additional strengths and any potential limitations.
Refinement and Regularization:
- Investigating the nuances of regularization in the NCE-based mutual information bounds could yield new insights and performance enhancements.
Infrastructure and Scalability:
- Continued development on the model's scalability and infrastructure optimization could facilitate larger-scale deployments, enhancing its viability in commercial applications.

In conclusion, this paper provides substantial contributions to the field of self-supervised learning by introducing an innovative approach to maximizing mutual information across augmented views. The significant performance gains alongside its practical methods mark AMDIM as a noteworthy advancement in representation learning.

PDF Markdown

Related Papers

GitHub

GitHub - Philip-Bachman/amdim-public: Public repo for Augmented Multiscale Deep InfoMax representation learning (400 stars)