Deep Image Retrieval: Learning global representations for image search (1604.01325v2)

Published 5 Apr 2016 in cs.CV

Abstract: We propose a novel approach for instance-level image retrieval. It produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. In contrast to previous works employing pre-trained deep networks as a black box to produce features, our method leverages a deep architecture trained for the specific task of image retrieval. Our contribution is twofold: (i) we leverage a ranking framework to learn convolution and projection weights that are used to build the region features; and (ii) we employ a region proposal network to learn which regions should be pooled to form the final global descriptor. We show that using clean training data is key to the success of our approach. To that aim, we use a large scale but noisy landmark dataset and develop an automatic cleaning approach. The proposed architecture produces a global image representation in a single forward pass. Our approach significantly outperforms previous approaches based on global descriptors on standard datasets. It even surpasses most prior works based on costly local descriptor indexing and spatial verification. Additional material is available at www.xrce.xerox.com/Deep-Image-Retrieval.

Citations (783)

View on Semantic Scholar

Summary

The paper introduces an end-to-end learnable R-MAC representation that optimizes convolution and projection weights for enhanced retrieval accuracy.
It leverages a region proposal network to select salient image regions, effectively reducing background noise in instance-level retrieval.
The method achieves high mAP on benchmark datasets like Oxford and Paris, demonstrating efficiency and scalability for large-scale image search.

Overview of "Deep Image Retrieval: Learning Global Representations for Image Search"

The paper "Deep Image Retrieval: Learning Global Representations for Image Search," authored by Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus, introduces a novel approach for instance-level image retrieval. This approach creates a global and compact fixed-length representation for each image by aggregating many region-wise descriptors. The primary contributions of this work are two-fold: it leverages a ranking framework to learn convolution and projection weights for region features, and it uses a region proposal network to identify regions to pool for the global descriptor.

Key Contributions

End-to-End Learning of R-MAC Representation: While traditional image retrieval methods often use pre-trained networks as feature extractors, this paper proposes an architecture trained specifically for image retrieval tasks. The architecture utilizes regional maximum activations of convolutions (R-MAC), which aggregates several image regions into a fixed-length vector. The authors transform R-MAC into an end-to-end learnable representation by backpropagating through the network to optimize weights, resulting in significant improvements in retrieval accuracy.
Utilization of Region Proposal Network (RPN): The proposed approach replaces the rigid grid of regions used in traditional methods with regions proposed by an RPN trained to localize regions of interest. This not only enhances coverage but also reduces the influence of background clutter. The RPN ensures regions tightly cover objects of interest, leading to improved retrieval performance.
Clean Training Data: Emphasizing the importance of clean training data, the authors develop an automatic cleaning process for a large-scale but noisy landmark dataset. This process involves keypoint matching and a graph-based approach to retain images with consistent visual content while removing unrelated images.

Experimental Results

The proposed method demonstrates superiority over traditional techniques based on local descriptors and spatial verification. The standard datasets, Oxford 5k, Paris 6k, and INRIA Holidays, are used to evaluate the retrieval performance. The key findings can be summarized as follows:

Significant Performance Gains:

The proposed architecture, when finely tuned for image retrieval tasks using a ranking loss, achieves a mean Average Precision (mAP) of 83.1% on Oxford 5k, 87.1% on Paris 6k, and 89.1% on Holidays after query expansion (QE). These results indicate notable improvements over existing global descriptor methods.

Efficiency:

Utilizing high-resolution images, the method efficiently encodes images at approximately five images per second using a single Nvidia K40 GPU. The compact descriptor created in a single forward pass offers an efficient and scalable solution for large-scale image retrieval tasks.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, the remarkable retrieval accuracy combined with the efficiency of the image representation makes this approach highly suitable for real-world applications involving large image databases. Theoretically, it sets a precedent for end-to-end learning of image representations optimized for specific tasks, moving beyond traditional usage of pre-trained networks.

Future developments in AI might explore further enhancements in region proposal networks, perhaps integrating more sophisticated models that can predict more accurate regions even in cluttered environments. Another interesting direction would be investigating unsupervised or semi-supervised methods for cleaning large-scale datasets, reducing the reliance on manually curated or cleaned data.

Overall, this research underscores the significance of task-specific training and the utilization of region-based models for image retrieval, presenting an effective and scalable solution that stands to influence future advancements in the domain.

PDF Markdown