CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples (1604.02426v3)

Published 8 Apr 2016 in cs.CV

Abstract: Convolutional Neural Networks (CNNs) achieve state-of-the-art performance in many computer vision tasks. However, this achievement is preceded by extreme manual annotation in order to perform either training from scratch or fine-tuning for the target task. In this work, we propose to fine-tune CNN for image retrieval from a large collection of unordered images in a fully automated manner. We employ state-of-the-art retrieval and Structure-from-Motion (SfM) methods to obtain 3D models, which are used to guide the selection of the training data for CNN fine-tuning. We show that both hard positive and hard negative examples enhance the final performance in particular object retrieval with compact codes.

Citations (594)

View on Semantic Scholar

Summary

The paper presents an unsupervised fine-tuning technique that automatically selects hard positive and negative training examples using 3D models.
It employs a siamese CNN architecture with contrastive loss and advanced whitening to optimize compact image representations.
Evaluations on benchmark datasets demonstrate significant performance gains over traditional methods with reduced memory and computational requirements.

CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples

This paper addresses the challenge of image retrieval enhancement using Convolutional Neural Networks (CNNs) without manual annotation. The authors propose an unsupervised fine-tuning technique leveraging Structure-from-Motion (SfM) and state-of-the-art retrieval methods to guide the selection of training data for CNNs.

Methodology

The authors utilize a fully automated process to construct 3D models from large collections of unordered images. These models aid in selecting both hard positive and negative training samples, significantly enhancing the CNN’s performance in image retrieval tasks.

The approach decomposes into the following key steps:

Training Data Selection: The retrieval system identifies images and constructs 3D models via SfM to ensure robust clustering and reliable matching graphs. Training tuples composed of query, positive, and negative images are established, concentrating on variability and difficulty levels in examples.
Network Architecture: A siamese network structure employs contrastive loss to learn the image representation through fine-tuning. The process improves upon previously established CNN representations by optimizing hard examples within training data.
Whitening and Dimensionality Reduction: Linear discriminant projections are derived from the same training dataset for enhanced normalization. This introduces stability and amplifies performance over traditional PCA-whitening methods.

Results

The paper presents a comprehensive evaluation using standard datasets such as Oxford Buildings, Paris, and Holidays, both with and without additional distractors. Notably, the fine-tuned CNNs achieve new benchmarks in image retrieval performance across varying dimensionalities, ranging from 16D to 512D.

Key points include:

The proposed approach significantly surpasses existing methods for compact representations and extremely short codes (e.g., 16D and 32D vectors).
Noteworthy improvements are observed even in traditional models like AlexNet and VGG, where re-training with proposed techniques shows substantial gains.
Compared to state-of-the-art systems employing query expansion and local features, the proposed CNN method achieves competitive results with reduced memory and computational requirements.

Implications and Future Directions

This research holds significant implications for unsupervised learning techniques in computer vision, suggesting robust ways to optimize CNN performance without human annotation. The methods discussed could lead to more efficient processing of large-scale image datasets, significantly impacting web-scale visual search applications.

Future directions could explore integrating the proposed fine-tuning with more complex architectures or extending to different image domains beyond landmark recognition. Additionally, developing methods to tailor post-processing for other representations such as R-MAC within the same framework holds promise for further performance gains.

In summary, the paper presents a compelling advancement in leveraging CNNs for image retrieval tasks, marking notable improvements in automated system efficiency and effectiveness.

PDF Markdown