Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

Published 6 Mar 2026 in cs.CV and cs.AI | (2603.05781v1)

Abstract: Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces BM25-V, a sparse retrieval system for images leveraging SAE-derived visual words and BM25 scoring.
It employs a two-stage architecture, first retrieving candidates with sparse BM25 and then refining rankings via dense reranking.
BM25-V offers high recall, interpretability, and scalable index efficiency with significant compression of visual features.

Sparse Auto-Encoder Visual Word Scoring for Image Retrieval with BM25-V

Introduction

The paper "Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval" (2603.05781) proposes BM25-V, a sparse retrieval system for images that combines Sparse Auto-Encoder (SAE)-derived visual words with Okapi BM25 scoring. BM25-V operates as an efficient, interpretable, and highly scalable first-stage image retriever, specifically designed to address practical challenges in dense retrieval—such as limited interpretability, compute cost, and loss of fine-grained spatial evidence—by leveraging sparse, semantically rich features extracted from Vision Transformer (ViT) patch tokens.

Methodology

SAE Visual Words and Zipfian Distribution

BM25-V employs a SAE trained on ViT patch features to produce monosemantic visual word activations. Each image is decomposed into patch-level sparse vectors, which are aggregated via sum pooling to yield image-level visual word frequency vectors. Notably, the distribution of these frequencies exhibits heavy-tailed (Zipfian-like) statistics, analogous to term frequency distributions in text retrieval. This parallel is rigorously established, enabling direct application of BM25's inverse document frequency (IDF) weighting in the visual domain.

Two-Stage Retrieval Architecture

BM25-V forms the first stage of a two-stage pipeline:

Sparse BM25-V candidate retrieval: Each image is indexed by its top- $k$ visual words. At query time, retrieval is performed via sparse inverted index lookup, scoring only candidates sharing active visual words with the query.
Dense reranking: The top- $K$ candidates are reranked using MAP-pooled dense embeddings with cosine similarity. Both patch features (for BM25-V) and pooled embeddings (for rerank) are computed in a single backbone pass.

This design decouples coarse semantics and local discriminative evidence, providing both high-recall candidate pruning and interpretable retrieval signals.

Interpretability and Index Efficiency

BM25-V retrieval decisions are inherently interpretable: each image's sparse representation is quantized and indexed by visual words with explicit IDF scores. The inverted index structure enables efficient candidate querying, sharding, and dynamic updates. Memory requirements for the sparse index are minimal (96 bytes/image for $k=16$ visual words), attaining $48\times$ compression over full dense embeddings and avoiding the nontrivial accuracy–memory trade-offs of product quantization.

Experimental Results

Benchmarks and Numerical Performance

BM25-V is extensively evaluated across seven fine-grained benchmarks (CUB-200, Cars-196, FGVC-Aircraft, Oxford-IIIT Pets, Oxford Flowers-102, DTD, Food-101) and generalizes zero-shot across domains—SAE trained on ImageNet-1K transfers without fine-tuning. Major results include:

Recall@100 and Recall@200: BM25-V achieves Recall@200 $\geq$ 0.993 and Recall@100 comparable to dense Recall@10 across all datasets.
Two-stage recall and rank-1 accuracy: Dense reranking of BM25-V candidates matches dense search accuracy within $0.2\%$ on average (R@1: $0.857$ vs. $0.859$), and exceeds it on DTD ( $+0.7\%$ ) and Flowers-102 ( $+0.1\%$ ).
Sparse retrieval costs: Per-query scoring operations are $O(k\cdot\overline{\text{df}})$ ($1,000$–$30,000$ ops per query at benchmark scale), with index build times and dynamic updates vastly more efficient than HNSW or IVF+PQ.

Ablations and Empirical Observations

Sparsity $k$ : Maintaining low patch-level sparsity (e.g., $k{\leq}32$ ) preserves discriminative Zipfian structure required for effective IDF scoring. Overly dense activations lead to collapse in discriminative power and retrieval performance.
Distributional validation: SAE visual word frequencies are more heavy-tailed ( $\alpha \in [1.20, 2.32]$ ) than text Zipf ( $\alpha \approx 1$ ), supporting IDF-based suppression of pervasive, low-information dimensions.

Efficiency and Scalability

BM25-V's sparse index provides linear scalability and real-time update capability, contrasting with graph-based indices (HNSW) requiring expensive rebuilds. The inverted-index approach enables natural sharding, sub-linear query times with WAND pruning, and practical deployment at industrial scale. Sparse scoring cost remains competitive with or below IVF+PQ as dataset size increases.

Practical and Theoretical Implications

BM25-V advances interpretable, efficient, and high-recall image retrieval by directly exploiting Zipfian statistics in deep visual features. The approach bridges classical IR theory and modern deep architectures, showing that monosemantic sparse features are amenable to principled probabilistic relevance weighting. The separation of local (sparse, IDF-weighted) and global (dense, semantic) signals supports hybrid retrieval systems, analogous to their widespread adoption in text IR.

Practically, BM25-V is deployable in scenarios demanding high explainability, auditable decision traces, and scalable candidate selection (e.g., medical, forensic, e-commerce search). The sparse–dense hybrid achieves state-of-the-art accuracy with minimal computational and memory overhead, and provides attribution at the visual word level.

Theoretically, these findings suggest opportunities for broader application of sparse retrieval paradigms in vision, further integration of interpretable SAE features, and principled weighting mechanisms derived from corpus statistics. Extending BM25-V to multimodal search systems and dense-index scale-up with intelligent pruning are promising directions.

Conclusion

BM25-V demonstrates that SAE-derived sparse visual words, scored with Okapi BM25, constitute an efficient, interpretable, and high-recall retrieval channel for images. The Zipfian distribution of visual word activations validates IDF weighting as a principled signal-suppression mechanism. The proposed two-stage pipeline achieves near-dense retrieval accuracy with minimal overhead and provides token-level attribution for each retrieval event. These results establish BM25-V as an effective solution for real-world large-scale image retrieval and motivate further exploration of sparse autoencoder paradigms in visual and multimodal IR.