Large-Scale Image Retrieval with Attentive Deep Local Features (1612.06321v4)

Published 19 Dec 2016 in cs.CV

Abstract: We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset. To identify semantically useful local features for image retrieval, we also propose an attention mechanism for keypoint selection, which shares most network layers with the descriptor. This framework can be used for image retrieval as a drop-in replacement for other keypoint detectors and descriptors, enabling more accurate feature matching and geometric verification. Our system produces reliable confidence scores to reject false positives---in particular, it is robust against queries that have no correct match in the database. To evaluate the proposed descriptor, we introduce a new large-scale dataset, referred to as Google-Landmarks dataset, which involves challenges in both database and query such as background clutter, partial occlusion, multiple landmarks, objects in variable scales, etc. We show that DELF outperforms the state-of-the-art global and local descriptors in the large-scale setting by significant margins. Code and dataset can be found at the project webpage: https://github.com/tensorflow/models/tree/master/research/delf .

Citations (738)

View on Semantic Scholar

Summary

The paper introduces DELF, a novel CNN-based descriptor that leverages attention to enhance keypoint selection for image retrieval.
The method applies multi-scale dense feature extraction using a ResNet50-based FCN over a 7-scale image pyramid, capturing detailed local features.
Experimental results on the Google-Landmarks dataset show that DELF outperforms traditional global and local descriptors, particularly under occlusion and clutter.

Large-Scale Image Retrieval with Attentive Deep Local Features

The paper "Large-Scale Image Retrieval with Attentive Deep Local Features" introduces DELF (DEep Local Feature), a local feature descriptor optimized for large-scale image retrieval. The framework employs convolutional neural networks (CNNs) and proposes an attention mechanism for selecting keypoints. This design facilitates more accurate feature matching and robust geometric verification, which is particularly valuable for datasets exhibiting various challenges, such as background clutter and partial occlusion.

Overview

Proposed Methodology

Dense Localized Feature Extraction

The DELF descriptor leverages a fully convolutional network (FCN), with its base model derived from ResNet50, to extract dense localized features from images. To cater to scale variations, the FCN is applied across an image pyramid consisting of 7 different scales. Each scale's receptive field corresponds to a different region size, aiding in capturing multiscale features effectively.

Attention-Based Keypoint Selection

A novel attention mechanism is incorporated to select semantically meaningful keypoints from the dense features. The attention model shares most of the network layers with the feature descriptor, facilitating efficient computation. The attention mechanism is trained using weak supervision, relying only on image-level class labels, and is designed as a 2-layer CNN with a softplus activation function. This ensures that substantial computation resources are conserved while maintaining high discriminative power.

The finely-tuned descriptors and the attention-based keypoint selection mechanism are shown to significantly enhance the performance of the retrieval system, demonstrating that attention-based models are adept at ignoring irrelevant image regions.

Dataset and Evaluation

A new dataset, Google-Landmarks, was introduced to evaluate the effectiveness of DELF. The dataset comprises over 1 million landmark images from 12,894 unique landmarks. It is notably more challenging than previous datasets due to its extensive diversity and the inclusion of distractor queries. This scale and complexity allow for robust validation of the system's performance.

Experimental Results

Accuracy Assessment

The proposed DELF system outperforms existing global descriptors, such as DIR and siaMAC, as well as traditional local descriptors like CONGAS, on the introduced Google-Landmarks dataset. Precision-recall curves indicate that DELF retains higher precision across various recall levels. Specifically, both the fine-tuning and attention components were crucial in achieving these improvements, with the attention mechanism playing a particularly significant role.

Comparison with Global and Local Features

The paper provides a comprehensive comparison between DELF and state-of-the-art global and local descriptors. DELF consistently outperforms these methods, especially under conditions involving occlusion and background clutter. Fine-tuning DELF on domain-specific dataset features and employing the attention mechanism collectively contribute to enhanced discriminative capability.

Implications and Future Research

The research underscores the importance of attention-based keypoint selection in large-scale image retrieval tasks. The introduction of semantically-aware keypoint selection mechanisms marks a significant advancement over traditional approaches. Future research could explore integrating these mechanisms with other feature extraction models, and further optimizing the attention mechanism to encompass even larger and more diverse datasets.

Additionally, the approach could be extended to other computer vision applications beyond landmark recognition, such as object detection and segmentation. The principles established here could drive the development of more robust and efficient systems capable of operating in real-world scenarios with substantial variability and complexity.

Conclusion

The paper presents DELF, which introduces significant innovations in the field of large-scale image retrieval. The combination of CNN-based local features and an attention mechanism for keypoint selection establishes a new benchmark in retrieval accuracy and robustness. By providing a thorough evaluation on the newly proposed Google-Landmarks dataset, the research demonstrates that attention mechanisms are essential for the next generation of image retrieval systems, paving the way for future advancements in this domain.

PDF Markdown