On Deep Multi-View Representation Learning: Objectives and Optimization (1602.01024v1)

Published 2 Feb 2016 in cs.LG

Abstract: We consider learning representations (features) in the setting in which we have access to multiple unlabeled views of the data for learning while only one view is available for downstream tasks. Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward networks with a batch-style correlation-based objective. We analyze several techniques based on prior work, as well as new variants, and compare them empirically on image, speech, and text tasks. We find an advantage for correlation-based representation learning, while the best results on most tasks are obtained with our new variant, deep canonically correlated autoencoders (DCCAE). We also explore a stochastic optimization procedure for minibatch correlation-based objectives and discuss the time/performance trade-offs for kernel-based and neural network-based implementations.

Citations (861)

View on Semantic Scholar

Summary

The paper introduces DCCAE, a hybrid model that combines reconstruction and correlation objectives for effective multi-view learning.
The paper demonstrates that correlation-based methods like DCCA and DCCAE outperform autoencoder approaches in noisy tasks such as MNIST clustering and speech recognition.
The paper provides insights into stochastic optimization strategies, enhancing scalability and performance in multi-view representation learning.

An Expert Overview of "On Deep Multi-View Representation Learning: Objectives and Optimization"

The paper "On Deep Multi-View Representation Learning: Objectives and Optimization" by Wang, Arora, Livescu, and Bilmes addresses the problem of learning representations from multi-view data using deep neural networks (DNNs). Multi-view learning involves scenarios where data from several modalities or sources, known as "views," are available during the training phase, but only one view is available for downstream tasks or testing. The authors investigate multiple existing and novel methodologies to optimize and evaluate representation learning in such settings. The methodologies explored include autoencoder-based approaches, correlation-based approaches, and their proposed hybrid model—deep canonically correlated autoencoders (DCCAE).

Objectives in Multi-View Representation Learning

The primary methodologies evaluated include:

Autoencoder-Based Approaches: In these approaches, such as split autoencoders (SplitAE), the objective is to learn compact representations that can accurately reconstruct the original inputs. Specifically, SplitAEs aim to extract shared representations from one view that can reconstruct all views.
Correlation-Based Approaches: Canonical Correlation Analysis (CCA) and its deep variant (DCCA) seek to maximize the correlation between projections of two views. Unlike reconstruction objectives, the focus here is on the mutual information between learned representations from different views.
Hybrid Approaches: The introduction of Deep Canonically Correlated Autoencoders (DCCAE) merges the objectives of both reconstruction and correlation. By combining these, DCCAE is designed to extract features that balance the properties provided by both approaches.

Methodological Comparison

The paper compares these approaches across various tasks, demonstrating distinct advantages and limitations through empirical evaluations:

MNIST Digits Task: The paper uses a noisy variant of the MNIST dataset to assess clustering and classification performance. The results show that CCA-based methods outperform autoencoder-based methods by capturing essential features while ignoring noise. Notably, DCCA and DCCAE achieve the highest clustering accuracy and classification performance, demonstrating the efficacy of correlation-based objectives.
Speech Recognition: Using the Wisconsin X-Ray Microbeam (XRMB) corpus, the paper evaluates these models' ability to improve phonetic recognition using multi-view data from acoustic and articulatory measurements. DCCA and DCCAE again perform superiorly, underlining their ability to learn useful representations for downstream tasks.
Word Embeddings: The authors extend their evaluations to word similarity tasks using multilingual embeddings. Here, DCCAE excels by balancing reconstruction and correlation, achieving high consistency with human judgments on word similarities.

Novelty and Contributions

The paper's key contributions include:

Proposed Deep Canonically Correlated Autoencoders (DCCAE): DCCAE combines the reconstruction objective of autoencoders and the correlation objective of CCA, demonstrating superior performance across multiple tasks.
Empirical Evaluations: Through comprehensive experiments, the authors establish the comparative strengths of various methodologies, highlighting scenarios where each method excels.
Stochastic Optimization Analysis: The paper explores the practicality of stochastic approaches for DCCA, providing a theoretical analysis of the optimization error and empirically validating different minibatch sizes and optimization strategies.

Theoretical and Practical Implications

The findings have several theoretical and practical implications:

Autoencoder vs. Correlation Objectives: The results suggest that correlation-based objectives (and by extension, information theory-inspired objectives) are often more effective at capturing useful representations in noise-rich environments compared to reconstruction-focused objectives.
Hybrid Models: The introduction and success of DCCAE suggest that hybrid models can effectively combine the strengths of different objectives. Future research can explore other hybrid forms that may balance varying learning criteria.
Optimization Strategies: The analysis of stochastic optimization for DCCA offers insights into the balance between computational efficiency and model performance, valuable for large-scale applications.

Future Directions

Given the promising results of the DCCA and DCCAE models, several future directions are suggested:

Architecture Exploration: Extending the DCCA/DCCAE framework to other network architectures such as convolutional and recurrent networks could yield further improvements and applications across different data modalities and tasks.
Stronger Constraints: Investigating constraints beyond uncorrelatedness—such as independence—could further refine feature extraction processes in multi-view learning.
Integration in Supervised Learning: Applying CCA-based objectives as regularizers in supervised or semi-supervised learning frameworks could exploit the multi-view nature in more supervised settings, potentially boosting performance on various recognition tasks.

In conclusion, the paper by Wang et al. effectively highlights the potential of combining correlation and reconstruction objectives for multi-view representation learning, proving crucial for applications that deal with multimodal data. Their proposed DCCAE model stands out as a balanced and robust approach, opening avenues for further research and optimization in the field.

PDF Markdown