Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking (1803.11285v1)

Published 29 Mar 2018 in cs.CV

Abstract: In this paper we address issues with image retrieval benchmarking on standard and popular Oxford 5k and Paris 6k datasets. In particular, annotation errors, the size of the dataset, and the level of challenge are addressed: new annotation for both datasets is created with an extra attention to the reliability of the ground truth. Three new protocols of varying difficulty are introduced. The protocols allow fair comparison between different methods, including those using a dataset pre-processing stage. For each dataset, 15 new challenging queries are introduced. Finally, a new set of 1M hard, semi-automatically cleaned distractors is selected. An extensive comparison of the state-of-the-art methods is performed on the new benchmark. Different types of methods are evaluated, ranging from local-feature-based to modern CNN based methods. The best results are achieved by taking the best of the two worlds. Most importantly, image retrieval appears far from being solved.

Citations (347)

View on Semantic Scholar

Summary

The paper’s main contribution is the re-evaluation of the Oxford and Paris benchmarks by correcting ground truth errors to improve evaluation accuracy.
It introduces three evaluation protocols—Easy, Medium, and Hard—with a curated distractor set from YFCC100M to simulate real-world retrieval challenges.
Extensive tests on both classical and CNN-based methods highlight that merging local features with CNN descriptors could enhance image retrieval robustness.

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

The paper "Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking" presents a comprehensive re-evaluation of the well-established Oxford and Paris image retrieval datasets. These datasets have been pivotal in evaluating image retrieval methods over the years but have reached a state where many retrieval methods achieve near-perfect results, raising concerns about the utility of these benchmarks in stimulating further advancements in the field.

Ground Truth Annotation and Benchmarking

One of the key contributions of this paper is the reassessment of the ground truth annotations for the Oxford 5k and Paris 6k datasets. The authors identify and correct annotation errors that include false positives and negatives, which have previously led to potentially misleading performance evaluations. Moreover, the scale and challenge level of these datasets are addressed by introducing new annotations that are more reliable. To stimulate further research, the authors propose three new evaluation protocols of varying difficulty—Easy, Medium, and Hard. These protocols redefine what constitutes a positive or negative match, allowing for a more nuanced evaluation of image retrieval methods.

New Queries and Distractor Set

To enhance the robustness of the evaluation framework, new challenging queries were generated for each dataset, making the task more reflective of real-world challenges. Additionally, the authors introduce a distractor set of 1 million images curated from YFCC100M with semi-automatic cleaning processes to ensure the absence of unintended landmarks. This new setup is designed to create a more arduous and reliable benchmark for assessing the efficacy of image retrieval algorithms at a larger scale.

Extensive Evaluation of Retrieval Methods

The paper conducts a thorough evaluation of both classical local-feature-based and modern CNN-based retrieval methods. There is significant attention on how different approaches, such as local feature aggregation and CNN-based global descriptors, perform across the new protocols. Methods employing fine-tuned CNNs, particularly those implementing generalized mean-pooling (GeM), are shown to perform extremely well under the given settings. However, it is noted that despite improvements, retrieval in challenging setups, especially at large scales, remains a significant hurdle.

Critical Insights and Future Directions

The evaluation highlights that image retrieval is an incomplete problem, particularly when considering complex scenarios and large-scale databases. While CNN-based methods have shown tremendous progress, the potential for combining CNN advantages with local feature robustness offers an exciting direction for future research. The paper's development of an improved benchmark provides incentives for the exploration of hybrid methodologies that leverage both local and global feature aspects to enhance retrieval robustness and accuracy.

Conclusion

Through meticulous re-evaluation and the introduction of a more challenging dataset and protocols, this paper revitalizes the utility of the Oxford and Paris benchmarks. It sets a new standard for image retrieval evaluation that will undoubtedly influence upcoming research. The call for harmonizing CNN achievements with local feature stability outlines a promising avenue for future developments, potentially paving the way for more adaptive and accurate image retrieval systems tailored to complex real-world environments. This motivated endeavor ensures that image retrieval benchmarking remains a dynamic catalyst for innovation in this domain.

PDF Markdown