Semantic Autoencoder for Zero-Shot Learning (1704.08345v1)

Published 26 Apr 2017 in cs.CV

Abstract: Existing zero-shot learning (ZSL) models typically learn a projection function from a feature space to a semantic embedding space (e.g.~attribute space). However, such a projection function is only concerned with predicting the training seen class semantic representation (e.g.~attribute prediction) or classification. When applied to test data, which in the context of ZSL contains different (unseen) classes without training data, a ZSL model typically suffers from the project domain shift problem. In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models. However, the decoder exerts an additional constraint, that is, the projection/code must be able to reconstruct the original visual feature. We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes. Importantly, the encoder and decoder are linear and symmetric which enable us to develop an extremely efficient learning algorithm. Extensive experiments on six benchmark datasets demonstrate that the proposed SAE outperforms significantly the existing ZSL models with the additional benefit of lower computational cost. Furthermore, when the SAE is applied to supervised clustering problem, it also beats the state-of-the-art.

Authors (3)

Elyor Kodirov (10 papers)
Tao Xiang (324 papers)
Shaogang Gong (94 papers)

Citations (803)

View on Semantic Scholar

Summary

Semantic Autoencoder for Zero-Shot Learning

The paper "Semantic Autoencoder for Zero-Shot Learning" by Elyor Kodirov, Tao Xiang, and Shaogang Gong introduces a novel method for zero-shot learning (ZSL) leveraging an encoder-decoder architecture. The proposed model, termed Semantic AutoEncoder (SAE), aims to address the common issue of domain shift in ZSL by incorporating a reconstruction constraint.

Key Contributions

Novel Encoder-Decoder Model: The proposed SAE model differentiates itself from conventional autoencoders by aligning the latent space with a semantically meaningful representation, such as attributes or word vectors. Unlike existing ZSL approaches which primarily project features from a visual space to a semantic space or vice versa, SAE jointly optimizes for both projections.
Linear and Symmetric Architecture: Both encoder and decoder are designed as linear and symmetric, making the SAE model computationally efficient. This is crucial for applications in large-scale visual recognition tasks. The symmetry enforces that the weights used for encoding are transposed for decoding, optimizing the reconstruction of the input visual features.
Efficient Optimization Algorithm: The paper specifies an efficient solver for the resulting optimization problem, which has a complexity independent of the size of the training data. This aspect ensures that the model remains scalable to large datasets, a significant benefit over many contemporary ZSL approaches.
Performance and Robustness: Extensive experiments on six benchmark datasets demonstrate that SAE outperforms existing state-of-the-art ZSL models. Furthermore, SAE also shows superior performance in supervised clustering tasks, highlighting the generalizability of the model beyond the ZSL context.

Numerical Results

The empirical results are notable:

On benchmarks like Animals with Attributes (AwA) and Caltech-UCSD Birds-200-2011 (CUB), SAE achieved accuracy improvements ranging between 3.5% and 6.5% over the best previous methods.
In large-scale datasets, for instance, ILSVRC2012/ILSVRC2010, SAE achieved a hit@5 accuracy that was 8.8% higher than the top competing method, showcasing its robustness in handling extensive and diverse class distributions.

Implications and Future Developments

Practical Implications:

The reduced computational cost of SAE, compared to competitive models, makes it particularly attractive for real-world applications where computational resources may be a limiting factor.
The applicability of SAE to supervised clustering means that this method can be extended to various unsupervised and semi-supervised learning scenarios, enriching its utility across different machine learning tasks.

Theoretical Implications:

The integration of a reconstruction constraint into the learning objective mitigates domain shift, a pervasive problem in ZSL. This demonstrates a pathway for further research into more complex, nonlinear reconstruction constraints and their potential benefits.
Future work could explore the application of deep learning versions of SAE, where convolutional layers could be used to enhance the encoder and decoder structures, leveraging the latest advancements in neural network architectures.

Conclusion

The "Semantic Autoencoder for Zero-Shot Learning" paper presents an innovative approach to ZSL by embedding a reconstruction mechanism into an encoder-decoder model. The method’s efficiency and effectiveness, as evidenced by substantial empirical evaluation, suggest a significant step forward in the domain of zero-shot learning. The broader applicability to supervised clustering further cements its value, marking it as a noteworthy contribution to the fields of computer vision and machine learning. Future research will undoubtedly benefit from building on the foundational concepts introduced in this work.

PDF Markdown