- The paper introduces the integrative few-shot classification and segmentation (FS-CS) task, proposing the iFSL framework with Attentive Squeeze Network (ASNet) to jointly learn classification and segmentation from few examples.
- ASNet employs hypercorrelation construction, an attentive squeeze layer with self-attention, and multi-layer fusion to build reliable class-wise foreground maps for both classification and segmentation.
- Experiments show that iFSL and ASNet achieve significant improvements over existing methods on standard benchmarks like Pascal-5$^i$ and COCO-20$^i$, demonstrating flexibility and robustness.
Integrative Few-Shot Learning for Classification and Segmentation
This paper addresses the integrative task of few-shot classification and segmentation (FS-CS). The authors propose to combine few-shot classification (FS-C) and few-shot segmentation (FS-S) into one integrative task, where models learn to both classify and segment target objects in a query image given only a few support examples. The task aims to resolve deficiencies in conventional few-shot learning setups by allowing multi-label and background-aware classification and segmentation, which are more aligned with real-world scenarios where objects can be absent or belong to multiple classes.
To tackle FS-CS, the authors introduce the integrative few-shot learning (iFSL) framework, which trains models to produce class-wise foreground maps for both classification and segmentation. This integrative approach enables learning with class tags or segmentation annotations. A pivotal element in the proposed framework is the attentive squeeze network (ASNet), which employs deep semantic correlation and global self-attention to create reliable foreground maps for each target class.
ASNet's design contains several innovative facets:
- Hypercorrelation Construction: ASNet computes correlation tensors between a query and support feature pyramids, thus enabling a dense evaluation of semantic similarity across images.
- Attentive Squeeze Layer: This layer implements a high-order self-attention mechanism, transforming support correlation tensors to foreground maps while preserving spatial dimensions relevant to the query image.
- Multi-layer Fusion: The model merges outputs from multiple layers to ensure the integration of multi-scale semantic features, continuously aggregating contextual information from coarse to fine levels.
The authors demonstrate that iFSL, together with ASNet, achieves substantial improvements over existing few-shot segmentation methods on standard benchmarks. Experiments on Pascal-5i and COCO-20i datasets show that ASNet consistently outperforms other techniques in both 1-way and multi-way setups, verifying its flexibility and effectiveness in generalizing across different few-shot learning tasks.
Furthermore, the paper offers a detailed analysis of the robustness of FS-CS against task transfer, demonstrating that models trained on FS-CS generalize well to FS-C and FS-S tasks, overcoming limitations in each. The empirical results confirm that FS-CS enables a flexible and realistic evaluation framework by removing rigid constraints found in traditional setups, such as strict class exclusiveness in FS-C or mandatory target class presence in FS-S.
Future research paths emerging from this work include exploring efficient ways to improve segmentation performance using only class tags when segmentation annotations are unavailable, potentially opening avenues for weakly supervised few-shot learning. Additionally, further optimizations of ASNet's architecture could enhance segmentation map precision, making it suitable for even larger and more challenging datasets, thus broadening its application in practical AI systems.
Overall, the paper's introduction of the FS-CS task and its proposed methodological framework significantly advance the capacity of few-shot learning to deal with realistic, varied, and challenging scenarios encountered in computer vision tasks. The numerical results and theoretical insights presented create valuable benchmarks and set new directions for further explorations in few-shot learning and its applications.