CrossTransformers: spatially-aware few-shot transfer (2007.11498v5)

Published 22 Jul 2020 in cs.CV

Abstract: Given new tasks with very little data$-$such as new classes in a classification problem or a domain shift in the input$-$performance of modern vision systems degrades remarkably quickly. In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains. We then propose two methods to mitigate this problem. First, we employ self-supervised learning to encourage general-purpose features that transfer better. Second, we propose a novel Transformer based neural network architecture called CrossTransformers, which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features. The result is a classifier that is more robust to task and domain shift, which we demonstrate via state-of-the-art performance on Meta-Dataset, a recent dataset for evaluating transfer from ImageNet to many other vision datasets.

Citations (301)

View on Semantic Scholar

Summary

The paper tackles supervision collapse by integrating self-supervised learning to preserve generalized embeddings in few-shot scenarios.
The paper introduces the CrossTransformer architecture that leverages spatial correspondence for localized, part-based comparisons, improving adaptability to domain shifts.
The paper validates its approach on the Meta-Dataset benchmark, achieving state-of-the-art performance in transferring from ImageNet to diverse vision tasks.

CrossTransformers: Spatially-Aware Few-Shot Transfer

The paper "CrossTransformers: Spatially-Aware Few-Shot Transfer" investigates the challenges associated with few-shot learning in the context of computer vision, specifically focusing on the phenomena of supervision collapse. Despite the advancements in training deep learning models with large datasets such as ImageNet, the ability of such models to transfer learned representations to tasks with limited data availability remains suboptimal. The authors present two main contributions to address this limitation: integrating self-supervised learning into the training of embeddings, and the introduction of the CrossTransformer architecture for few-shot image classification.

In scenarios where deep learning models must adapt to new classes or domain shifts with minimal data, a marked degradation in performance is observed due to what the authors term "supervision collapse". This occurs when neural network representations discard useful information not directly related to the training objective, culminating in inadequately defined embeddings for new tasks. To ameliorate this issue, the paper introduces the use of SimCLR, a self-supervised learning algorithm that helps in preserving a more generalized feature space through the episodic formulation of its instance discrimination pretext.

Secondarily, the paper proposes a novel Transformer-inspired architecture, termed CrossTransformers, that leverages spatial correspondence to enhance robustness against task and domain shifts. Unlike traditional Prototypical Nets that rely on global embeddings for classification, CrossTransformers compute coarse spatial correspondences between query instances and support set images, thereby enabling localized, part-based feature comparisons. This approach leverages the inherent compositional structure of objects in vision tasks, fundamentally improving the model's adaptability in few-shot learning scenarios.

The empirical evaluation leverages the Meta-Dataset benchmark, showcasing that CrossTransformers achieve state-of-the-art performance in transferring from ImageNet to novel, diverse vision datasets. The results indicate substantial improvements, particularly in datasets with complex intra-class variability, highlighting the model's ability to factorize visual tasks into more transferable components.

The implications of this research are twofold. Practically, it advances the capability of vision models to generalize over new tasks with limited labeled data, impacting applications from home robotics to industrial vision systems. Theoretically, it marks a shift towards the integration of spatially-aware architectures within the few-shot learning paradigm, paving the way for further explorations into finer-grained correspondence mechanisms within transfer learning.

Future directions could involve exploring more sophisticated correspondence mechanisms to further mitigate supervision collapse and expanding the use of self-supervised learning paradigms within the transformer architectures for enhanced few-shot transferability. It will also be crucial to explore how these techniques can be integrated into other tasks such as detection and segmentation, where spatial awareness could play a significant role. The paper's contributions lay a crucial foundation towards these exploratory avenues, promising improved adaptability and versatility in the deployment of AI systems across diverse real-world applications.

PDF Markdown

CrossTransformers: spatially-aware few-shot transfer (2007.11498v5)

Summary

CrossTransformers: Spatially-Aware Few-Shot Transfer

Related Papers