Detect-to-Retrieve: Efficient Regional Aggregation for Image Search (1812.01584v2)

Published 4 Dec 2018 in cs.CV

Abstract: Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $86k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially with no dimensionality increase, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data available at the project webpage: https://github.com/tensorflow/models/tree/master/research/delf.

Citations (111)

View on Semantic Scholar

Summary

The paper introduces a novel regional aggregated selective match kernel (R-ASMK) that improves image retrieval accuracy using fewer, curated regions.
It leverages a custom-trained landmark detector and a new dataset of 86,000 images to enable focused and efficient regional indexing.
Experiments on Revisited Oxford and Paris datasets show significant mean average precision gains, advancing state-of-the-art retrieval methods.

Detect-to-Retrieve: Efficient Regional Aggregation for Image Search

The paper "Detect-to-Retrieve: Efficient Regional Aggregation for Image Search" explores the domain of image retrieval, particularly focusing on improving the efficiency and accuracy of retrieving object instances from cluttered scenes. This paper introduces a novel approach that leverages regional aggregation, enhancing regional selection processes by employing a custom-trained landmark detector.

Summary of Contributions

Landmark Dataset Development: The authors address a notable gap in traditional retrieval benchmarks by developing a new dataset. Using the Google Landmarks dataset as a foundation, they create a collection of 86,000 images encompassing manually curated bounding boxes from 15,000 unique landmarks. This dataset aids in training a robust landmark detector, facilitating more focused regional indexing.
Regional Aggregated Selective Match Kernel (R-ASMK): The cornerstone of this research is the introduction of a regional aggregated selective match kernel. R-ASMK integrates information from detected regions into coherent image representations, significantly uplifting retrieval accuracy without expanding dimensionality. It surpasses existing methods that independently index regions, demonstrating superior efficiency and performance.
Enhanced Image Retrieval System: By seamlessly incorporating the R-ASMK, the proposed image retrieval system significantly advances the state-of-the-art, recording substantial improvements in mean average precision on the challenging Revisited Oxford and Paris datasets.

Methodology

The research advances the use of convolutional neural networks (CNNs) to construct compact embeddings essential for efficient similarity computations. Meanwhile, for re-ranking - traditionally dominated by hand-crafted features and geometric verification - this paper uses advanced CNN representations to refine accuracy.

The paper critiques existing regional selection techniques, highlighting their inefficiency and memory intensity. It posits that many of these methods yield a surplus of irrelevant regions. The Detect-to-Retrieve (D2R) approach selectively captures fewer, more meaningful regions, using these to refine an image's representation, emphasizing the discernment of pertinent features.

Experimental Evaluation

The authors conduct thorough evaluations, assessing landmark detection models via detection-based SSD and Faster R-CNN frameworks. Achieving high mean average precision reflects the reliability of their dataset-derived models. Experimentation cascades into image retrieval, validating the enhancement in image recognition through regional selection and aggregation methodologies.

Regional Search: D2R showcases clear advancements, achieving superior mean average precision with notably fewer regions, contrasting with memory-intensive alternate methods like RMACB and Selective Search.
Regional Aggregation: This approach not only avoids memory increases but also surpasses methods that catalog regions separately. By harmonizing local descriptors into a singular compelling image representation, it demonstrates methodological superiority.

Implications and Future Directions

The implications of this research are multifaceted. Practically, by boosting recognition accuracy in cluttered scenes efficiently, possible applications expand across numerous domains such as autonomous navigation, augmented reality, and digital asset management. Theoretically, it challenges traditional paradigms, proposing a robust marriage of detection and retrieval.

The future of this research might entail exploring broader dataset applications, integrating additional AI models, and refining real-time retrieval systems. Experimenting with different CNN architectures or further optimizing aggregated selective match kernels could foreseeably yield additional gains in accuracy and efficiency.

In summary, this paper contributes a significant step forward in image retrieval, demonstrating the prowess of integrating regional detection into efficient, accurate image search systems.

PDF Markdown