Learning Super-Features for Image Retrieval (2201.13182v1)

Published 31 Jan 2022 in cs.CV

Abstract: Methods that combine local and global features have recently shown excellent performance on multiple challenging deep image retrieval benchmarks, but their use of local features raises at least two issues. First, these local features simply boil down to the localized map activations of a neural network, and hence can be extremely redundant. Second, they are typically trained with a global loss that only acts on top of an aggregation of local features; by contrast, testing is based on local feature matching, which creates a discrepancy between training and testing. In this paper, we propose a novel architecture for deep image retrieval, based solely on mid-level features that we call Super-features. These Super-features are constructed by an iterative attention module and constitute an ordered set in which each element focuses on a localized and discriminant image pattern. For training, they require only image labels. A contrastive loss operates directly at the level of Super-features and focuses on those that match across images. A second complementary loss encourages diversity. Experiments on common landmark retrieval benchmarks validate that Super-features substantially outperform state-of-the-art methods when using the same number of features, and only require a significantly smaller memory footprint to match their performance. Code and models are available at: https://github.com/naver/FIRe.

Citations (41)

View on Semantic Scholar

Summary

The paper introduces Super-features, a mid-level representation that minimizes redundancy and captures distinct image patterns.
It leverages the Local Feature Integration Transformer (LIT) to iteratively refine learnable templates for enhanced pattern focus.
The training strategy using contrastive and decorrelation losses achieves notable improvements in mean average precision on landmark retrieval benchmarks.

Evaluation of Mid-Level Features for Deep Image Retrieval

The paper presents a significant innovation in the domain of image retrieval by proposing a novel architecture that utilizes mid-level features, termed "Super-features," to enhance retrieval performance. These Super-features are extracted through an iterative attention mechanism, the Local Feature Integration Transformer (LIT), which refines a set of learnable templates over multiple iterations to focus on discriminative image patterns. This approach addresses the issues of redundancy in traditional local features and the discrepancy between training methods and inference processes in existing retrieval systems.

Key Contributions

Super-feature Representation: The introduction of Super-features represents a shift from traditional local feature aggregation to a more refined, mid-level representation. These Super-features are less redundant and focus on distinct patterns within images, providing an efficient and compact representation for retrieval tasks.
Iterative Attention Module (LIT): LIT is designed to process local features extracted from a CNN and iteratively refine a set of templates. This process results in an ordered set of Super-features that attend to specific patterns across different images of the same class.
Training Framework: The paper proposes a training strategy that directly operates on Super-features using a contrastive loss and an attention decorrelation loss. The contrastive loss ensures that Super-features from matching image pairs are similar and discriminates against non-matching elements. The decorrelation loss promotes diverse attention distributions among Super-features, enhancing their distinctiveness.
Evaluation and Results: The method was extensively tested on landmark retrieval benchmarks, including R and R datasets, showing notable improvements in mean average precision (mAP) compared to existing state-of-the-art methods with significantly lower memory requirements.

Implications and Future Directions

The implications of using Super-features in image retrieval extend beyond landmark recognition, potentially transforming image retrieval tasks across various domains. The reduction in redundancy and memory usage makes the approach scalable and practical for large-scale applications.

Future research could explore the integration of this approach with more complex datasets and diverse image types, leveraging the ordered nature of Super-features for other tasks, such as object detection or segmentation. Additionally, expanding the application of the LIT module to other domains, such as video analysis or 3D object recognition, presents intriguing possibilities.

Conclusion

This work provides a robust framework for advancing deep image retrieval by leveraging mid-level features, offering a detailed analysis of the benefits over traditional methods. The advancement presented by this approach signifies an important contribution to the field of computer vision, particularly in enhancing the efficiency and effectiveness of image retrieval systems. As such, it invites further exploration within the broader context of AI and deep learning research.

PDF Markdown

Related Papers

GitHub

GitHub - naver/fire (123 stars)