SIFT Meets CNN: A Decade Survey of Instance Retrieval (1608.01807v2)

Published 5 Aug 2016 in cs.CV

Abstract: In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors (de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.

Authors (3)

Liang Zheng (181 papers)
Yi Yang (856 papers)
Qi Tian (314 papers)

Citations (680)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey that contrasts SIFT-based methods with evolving CNN-based approaches, emphasizing improvements in retrieval performance.
It details methodologies including large, medium, and small codebooks for SIFT and hybrid, pre-trained, and fine-tuned strategies for CNN, highlighting trade-offs in efficiency and accuracy.
The study outlines future directions toward generalized retrieval systems and end-to-end learning, underscoring the drive for more adaptive computer vision solutions.

Overview of "SIFT Meets CNN: A Decade Survey of Instance Retrieval"

This paper provides a comprehensive survey of instance retrieval methods developed over the past decade, highlighting the transition from SIFT-based methodologies to those based on convolutional neural networks (CNNs). The landscape of instance retrieval has evolved significantly, driven by advancements in both hand-crafted features like SIFT and the emergence of deep learning techniques.

Categories and Methodologies

The paper delineates instance retrieval methods into two broad categories: SIFT-based and CNN-based approaches. For SIFT-based methods, further distinctions are made based on the size of the codebook used: large, medium-sized, and small.

SIFT-based Methods:
- Large Codebooks: Characterized by high discriminative power but potentially increased computational complexity. Techniques like hierarchical k-means and approximate k-means are utilized to handle these large vocabulary sizes efficiently.
- Medium-sized Codebooks: Use Hamming Embedding (HE) to improve the discriminative ability of visual words, balancing recall and precision.
- Small Codebooks: Employ encoding techniques such as VLAD and Fisher Vector for compact representations, focusing on reducing memory footprint and improving efficiency.
CNN-based Methods:
- Hybrid Methods: Integrate CNN features into traditional patch-based retrieval frameworks, using techniques like VLAD on CNN descriptors.
- Pre-trained Models: Leverage existing CNNs trained on large datasets like ImageNet to extract global or regional features.
- Fine-tuned Models: Adapt CNNs to specific retrieval tasks using targeted datasets, yielding highly discriminative features.

Key Findings and Experimental Results

The survey underscores significant improvements in retrieval performance, especially with the introduction of CNN-based methods. Fine-tuned CNN models, in particular, have shown state-of-the-art results on specific tasks such as landmark retrieval, benefiting from large training datasets and sophisticated learning techniques.

CNN-based techniques demonstrate higher efficiency in feature extraction with GPUs and offer competitive accuracy across varied datasets.
SIFT-based methods maintain relevance, especially in scenarios involving grayscale images or severe occlusions, owing to their local descriptor robustness.
Compact representations are increasingly favored due to their efficiency with approximate nearest neighbor search methods.

Implications and Future Directions

The transition from SIFT to CNN in instance retrieval reflects broader trends in computer vision towards end-to-end learning systems. This shift not only enhances retrieval accuracy but also streamlines the feature extraction process.

Future research is directed towards creating more generalized retrieval systems applicable across diverse datasets, as well as specialized systems fine-tuned for specific tasks like pedestrian or vehicle retrieval. The development of large-scale instance-level datasets will be crucial in driving forward both generic and specialized retrieval capabilities. Additionally, novel CNN architectures and transfer learning strategies hold potential for further improving the adaptability and accuracy of retrieval systems.

This survey serves as a pivotal reference for researchers seeking to understand the evolution and current state of instance retrieval technologies, as well as guiding future innovations in the field.

PDF Markdown