Aggregating Deep Convolutional Features for Image Retrieval (1510.07493v1)

Published 26 Oct 2015 in cs.CV

Abstract: Several recent works have shown that image descriptors produced by deep convolutional neural networks provide state-of-the-art performance for image classification and retrieval problems. It has also been shown that the activations from the convolutional layers can be interpreted as local features describing particular image regions. These local features can be aggregated using aggregation approaches developed for local features (e.g. Fisher vectors), thus providing new powerful global descriptors. In this paper we investigate possible ways to aggregate local deep features to produce compact global descriptors for image retrieval. First, we show that deep features and traditional hand-engineered features have quite different distributions of pairwise similarities, hence existing aggregation methods have to be carefully re-evaluated. Such re-evaluation reveals that in contrast to shallow features, the simple aggregation method based on sum pooling provides arguably the best performance for deep convolutional features. This method is efficient, has few parameters, and bears little risk of overfitting when e.g. learning the PCA matrix. Overall, the new compact global descriptor improves the state-of-the-art on four common benchmarks considerably.

Citations (672)

View on Semantic Scholar

Summary

The paper proposes the SPoC descriptor, showing that sum pooling of deep convolutional features yields superior performance compared to traditional methods.
It demonstrates that deep convolutional features offer strong discriminative power in a compact 256-dimensional space, enhancing efficiency and reducing overfitting.
The study challenges conventional aggregation techniques by advocating for revised approaches when using neural-derived descriptors, with significant implications for mobile image retrieval.

Aggregating Deep Convolutional Features for Image Retrieval

The paper "Aggregating Deep Convolutional Features for Image Retrieval" by Artem Babenko and Victor Lempitsky addresses the challenge of constructing compact and effective global descriptors for image retrieval. The authors focus on leveraging deep convolutional neural networks (CNNs) to extract and aggregate features, showing how these surpass traditional hand-crafted descriptors such as SIFT.

Overview of Methodology

The authors explore several methods for aggregating local features obtained from deep convolutional layers. Unlike fully-connected layer features, convolutional features can be treated as local descriptors akin to traditional SIFT features. However, Babenko and Lempitsky demonstrate that the statistical properties of these deep features are distinct and require reevaluation of existing aggregation methods.

Key Findings

Sum Pooling Aggregation: The authors identify that simple sum pooling is highly effective for aggregating deep features, resulting in a descriptor they term SPoC (Sum-Pooled Convolutional features). This method significantly outperforms more complex techniques like Fisher vectors and triangular embeddings.
Discriminative Properties: Deep convolutional features show a higher discriminative ability without requiring high-dimensional embedding. This feature enables simpler and more efficient aggregation strategies without sacrificing performance.
Centering Prior: Incorporating a Gaussian centering prior improves retrieval performance by emphasizing features closer to the image center, particularly benefiting datasets where objects of interest are central.
Comparison with the State-of-the-Art: SPoC descriptors advance the state-of-the-art on standard benchmarks (e.g., Oxford and Holidays datasets) by achieving higher mean average precision (mAP) scores in a compact format, such as a 256-dimensional space.

Practical Implications

The introduction of SPoC enhances image retrieval capabilities in applications where computational efficiency and storage are crucial, such as mobile and embedded devices. The reduced requirement for high-dimensional processing and overfitting susceptibility makes SPoC a compelling choice in these contexts.

Theoretical Implications

This work challenges the notion of extending traditional techniques directly to new feature types. The findings urge reconsideration of methods like VLAD or Fisher Vectors when transitioning from hand-crafted to learned features, highlighting the unique characteristics of neural-derived features.

Future Directions

The insights presented in this paper open avenues for further exploration in fine-tuning deep networks for specific retrieval tasks or integrating sum pooling within more sophisticated deep learning frameworks. Additionally, the application of SPoC in multimodal retrieval systems or its adaptation for non-visual data could represent promising research trajectories.

In conclusion, the authors offer a thorough examination of deep feature aggregation for image retrieval and propose a practical, robust solution with SPoC. Their analysis underscores the importance of adapting aggregation strategies to the distinct nature of deep features, enhancing the efficacy of CNN-based image retrieval systems.

PDF Markdown