Particular object retrieval with integral max-pooling of CNN activations (1511.05879v2)

Published 18 Nov 2015 in cs.CV

Abstract: Recently, image representation built upon Convolutional Neural Network (CNN) has been shown to provide effective descriptors for image search, outperforming pre-CNN features as short-vector representations. Yet such models are not compatible with geometry-aware re-ranking methods and still outperformed, on some particular object retrieval benchmarks, by traditional image search systems relying on precise descriptor matching, geometric re-ranking, or query expansion. This work revisits both retrieval stages, namely initial search and re-ranking, by employing the same primitive information derived from the CNN. We build compact feature vectors that encode several image regions without the need to feed multiple inputs to the network. Furthermore, we extend integral images to handle max-pooling on convolutional layer activations, allowing us to efficiently localize matching objects. The resulting bounding box is finally used for image re-ranking. As a result, this paper significantly improves existing CNN-based recognition pipeline: We report for the first time results competing with traditional methods on the challenging Oxford5k and Paris6k datasets.

Citations (943)

View on Semantic Scholar

Summary

The paper introduces a novel method using integral max-pooling to derive compact CNN activations that enhance both retrieval accuracy and object localization.
It leverages convolutional layer features for effective re-ranking, outperforming traditional fully connected descriptors on benchmarks like Oxford5k and Paris6k.
The approach efficiently balances speed and accuracy through approximate max-pooling localization, yielding high mAP scores with reduced computational cost.

Particular Object Retrieval with Integral Max-Pooling of CNN Activations

Overview

The paper "Particular Object Retrieval with Integral Max-Pooling of CNN Activations" presents a comprehensive approach aimed at addressing the limitations of conventional CNN-based image retrieval systems. Utilizing compact vectors derived from convolutional neural network (CNN) activations, the proposed methodology re-examines both the initial search and re-ranking stages in image retrieval pipelines. This paper integrates a new method to handle max-pooling on convolutional layer activations using integral images, thereby enhancing object localization and ultimately retrieval performance.

Initial Search and Re-Ranking Methodology

The authors focus on the use of compact, yet highly discriminative, feature vectors generated from multiple image regions through convolutional layer activations without necessitating multiple network inputs. Key contributions include:

A compact representation derived from convolutional layers that encodes multiple image regions.
An extension of integral images for efficient max-pooling over convolutional activations.
Utilization of the bounding boxes generated via integral max-pooling for image re-ranking.

These strategies yield significant improvements in CNN-based recognition, providing results that compete favorably with traditional methods on challenging datasets like Oxford5k and Paris6k.

Related Work

The paper builds upon several advances in CNN-based representations for image retrieval. Previous endeavors predominantly focused on leveraging fully connected layer activations for descriptor construction, whereas this paper advocates for convolutional layers following max-pooling operations, citing improved generalization and compactness. Their approach is aligned with trends seen in Fast-RCNN and Faster-RCNN for object detection, but retains a specific focus on particular object retrieval.

Experimental Evaluation

Compact Representation: MAC and R-MAC

The Minimum Activation of Convolutions (MAC) vectors and Regional MAC (R-MAC) vectors are used to create compact representations for images. MAC vectors are -normalized, PCA-whitened, and again -normalized, while R-MAC vectors achieve superior performance by aggregating regional features at multiple scales.

Performance Metrics

Performance evaluations on Oxford5k, Paris6k, Oxford105k, and Paris106k datasets demonstrate substantial improvements in mean Average Precision (mAP). The proposed methods are shown to outperform other contemporary compact descriptors by significant margins. For instance, with VGG16, the R-MAC approach achieves an mAP of 66.9% on Oxford5k and 83.0% on Paris6k.

Localization Efficiency

The Approximate Max-Pooling Localization (AML) framework enables efficient object localization by employing an extended integral image technique to approximate max-pooling across convolutional maps. This method optimally balances localization accuracy and computational cost, markedly improving the speed of re-ranking top retrieved images. AML achieves an overlap measure (IoU) of 51.3% with much-reduced computational effort compared to exhaustive search methods.

Implications and Future Directions

The integration of CNN-based features into compact yet efficient representations and re-ranking mechanisms lays a pathway for practical and scalable image retrieval systems. Notably, applications could extend beyond image retrieval to domains requiring precise object localization and identification within large-scale databases.

Future research directions might include leveraging end-to-end fine-tuning of CNNs specifically for similarity-based retrieval tasks and exploring more sophisticated pooling strategies. The emphasis on scalable, fast processing pipelines will continue to drive innovation in retrieval systems, allowing for integration into real-world applications involving massive datasets.

Conclusion

In summary, this paper offers a nuanced exploration of CNN activations for particular object retrieval, significantly pushing the boundaries of previous approaches. By effectively revisiting and enhancing initial search as well as re-ranking stages, the authors present a compelling case for the adoption of their methods in advanced image retrieval systems. The interplay of efficiency and high retrieval performance marks this paper as a benchmark for future explorations in the field.

PDF Markdown