Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders (1812.01784v4)

Published 5 Dec 2018 in cs.CV, cs.AI, and cs.LG

Abstract: Many approaches in generalized zero-shot learning rely on cross-modal mapping between the image feature space and the class embedding space. As labeled images are expensive, one direction is to augment the dataset by generating either images or image features. However, the former misses fine-grained details and the latter requires learning a mapping associated with class embeddings. In this work, we take feature generation one step further and propose a model where a shared latent space of image features and class embeddings is learned by modality-specific aligned variational autoencoders. This leaves us with the required discriminative information about the image and classes in the latent features, on which we train a softmax classifier. The key to our approach is that we align the distributions learned from images and from side-information to construct latent features that contain the essential multi-modal information associated with unseen classes. We evaluate our learned latent features on several benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2, and establish a new state of the art on generalized zero-shot as well as on few-shot learning. Moreover, our results on ImageNet with various zero-shot splits show that our latent features generalize well in large-scale settings.

PDF Abstract

Overview of Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders

The paper "Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders" addresses a significant challenge in machine learning: the ability to classify unseen or scarcely seen classes through generalized zero-shot learning (GZSL) and generalized few-shot learning (GFSL) frameworks. This work introduces a novel approach using aligned Variational Autoencoders (VAEs) to improve knowledge transfer across modalities, presenting strong results across multiple datasets.

Methodology

The authors propose a model termed as CADA-VAE (Cross- and Distribution Aligned Variational Autoencoders) which efficiently handles the task of learning shared latent space embeddings for image features and class embeddings without the need for visual data from unseen classes during training. The innovation of CADA-VAE lies in its use of modality-specific VAEs that align distributions from different sources. This alignment occurs through two main mechanisms: Cross-Alignment (CA) and Distribution-Alignment (DA).

Cross-Alignment (CA): This approach uses cross-reconstruction objectives, which encourage the VAE to reconstruct input data from an unrelated modality, ensuring the encoded representations retain shared multimodal information.
Distribution-Alignment (DA): The model minimizes the Wasserstein distance between latent distributions of different modalities, further forcing the latent spaces to encode consistent information across modalities.

The combined effect of these alignment techniques is a robust capability to generate discriminative latent features that theoretically and practically enhance the classification of both seen and unseen data.

Results

CADA-VAE establishes state-of-the-art performance on multiple benchmark datasets, including CUB, SUN, AWA1, AWA2, and notably, the large-scale ImageNet dataset. The model yields a harmonic mean accuracy of 52.4% on CUB, 40.6% on SUN, 64.1% on AWA1, and 63.9% on AWA2 in the GZSL setting. Comparatively, it surpasses previous models, especially those relying on feature generation through GANs or traditional compatibility function approaches.

Importantly, on ImageNet, CADA-VAE achieves higher unseen class accuracy than previous methods by leveraging the aligned latent space effectively to handle large and varied search spaces. The improvements are consistent across splits having varying levels of granularity and class balance, demonstrating the method's scalability and robustness.

Implications

The introduction of CADA-VAE represents a significant stride in multimodal learning, particularly in tasks where data scarcity is predominant. The proposed model not only achieves higher accuracy rates but also offers a more stable training process compared to GAN-based techniques.

Theoretically, the alignment of latent spaces provides an effective way to harness shared information across modalities, thereby minimizing issues like projection domain shift. Practically, this research opens avenues for future exploration into more fine-grained and complex multimodal tasks, such as visual question answering and multimodal sentiment analysis, using similar alignment techniques.

Future Directions

The results and methodologies presented in this paper lay the groundwork for several future research directions:

Extension to More Modalities: Implementing CADA-VAE with more than two modalities could lead to even more versatile models capable of integrating information from various sources like text, audio, and video.
Exploration of Different Alignment Techniques: Investigating alternative or additional alignment objectives might enhance the discrimination power of latent spaces, facilitating better knowledge transfer.
Application in Real-World Scenarios: Leveraging these findings in practical applications, such as autonomous systems or personalized recommendation engines, could drastically improve performance where annotated data is limited.

In conclusion, the alignment methods introduced in CADA-VAE present a promising paradigm shift in zero- and few-shot learning, demonstrating superior performance across diverse datasets and paving the way for future innovations in AI and machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Edgar Schönfeld (21 papers)
Sayna Ebrahimi (27 papers)
Samarth Sinha (22 papers)
Trevor Darrell (324 papers)
Zeynep Akata (144 papers)

Citations (560)

View on Semantic Scholar