FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

Published 31 Mar 2026 in cs.IR and cs.CL | (2604.00242v1)

Abstract: Document retrieval identifies relevant documents but does not provide fine-grained evidence cues, such as specific relevant spans. A possible solution is to apply an LLM after retrieval; however, this introduces significant computational overhead and limits practical deployment. We propose FGR-ColBERT, a modification of ColBERT retrieval model that integrates fine-grained relevance signals distilled from an LLM directly into the retrieval function. Experiments on MS MARCO show that FGR-ColBERT (110M) achieves a token-level F1 of 64.5, exceeding the 62.8 of Gemma 2 (27B), despite being approximately 245 times smaller. At the same time, it preserves retrieval effectiveness (99% relative Recall@50) and remains efficient, incurring only a ~1.12x latency overhead compared to the original ColBERT.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a lightweight ColBERT extension that integrates token-level relevance attribution to match LLM-extracted evidence spans.
It employs a joint training objective combining binary cross-entropy for tokens and KL divergence for document-level signals, enhancing retrieval precision.
The model achieves a token-level F1 of 64.5 with minimal latency overhead (≈1.12x) while maintaining strong document recall at approximately 97.1%.

Fine-Grained Relevance Extraction in Retrieval: An Analysis of FGR-ColBERT

Introduction

The task of identifying relevant documents in response to a query—core to information retrieval (IR)—has undergone major advances with the advent of multi-vector dense retrievers such as ColBERT, which leverage efficient bi-encoder architectures. However, for applications that demand explainability or direct evidence, traditional retrieval granularity is insufficient; users and downstream systems often require fine-grained evidence spans (e.g., specific sentences or phrases) that directly answer or support the query. Recent approaches have offloaded this function to LLMs via post-retrieval reranking or span annotation, incurring prohibitive latency and computational cost.

"FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval" (2604.00242) introduces FGR-ColBERT, a minimal extension to ColBERT that enables fine-grained, token-level relevance estimation directly within the retrieval process, sidestepping the need for separate heavyweight LLM calls. The model achieves high agreement with human and LLM-provided span annotations, matching the span plausibility of Gemma~2 (27B) at a ~245x smaller parameter count, while incurring negligible impact on retrieval latency (≈1.12x overhead) and recall.

Approach: Integrating Fine-Grained Supervision into ColBERT

FGR-ColBERT modifies the late interaction mechanism of ColBERT. Instead of limiting the aggregation to document-level relevance, the model computes per-token relevance probabilities aligned with span-level cues. Explicitly, both query and document tokens are transformed via a lightweight feed-forward network with residual connections. ColBERT's aggregation is used orthogonally: each document token receives a maximum similarity score with any query token, passed through a sigmoid to yield a token-level relevance probability. This vector serves as a direct implementation of the selection function for evidence span tagging.

Figure 1: (a) Traditional ColBERT with post-hoc LLM extraction versus FGR-ColBERT's integrated relevance supervision. (b) Late interaction modified for token-level scoring without sacrificing document-level precision.

Training is conducted with a joint objective: document-level distillation loss (KL divergence to a cross-encoder re-ranker) is combined with a binary cross-entropy loss on token-level signals distilled from LLM-labeled evidence spans. Supervision for token-level relevance is applied exclusively to positive (plausible match) instances, intentionally biasing the model towards always producing at least one evidence span per passage.

Dataset Construction via LLM Distillation

Fine-grained annotation at scale is prohibitively expensive if conducted manually. The authors leverage Gemma~2, a 27B LLM with strong alignment to human span annotation, to construct MS-MARCO-Gemma datasets for both training and development. For deeper validation, a smaller set with triple human annotation is compiled from MS MARCO’s dev split.

LLM outputs are supplied as token-level supervision, providing the "ground truth" for the binary cross-entropy span identification head. This design enables effective transfer of LLM capabilities to a much smaller and retriever-aligned model, without the inefficiency of querying the LLM at retrieval time.

Results: Efficiency, Effectiveness, and Relevance Plausibility

Quantitative evaluation demonstrates three main findings:

Token-level plausibility (F1): FGR-ColBERT achieves a token-level F1 of 64.5 on human-annotated data, compared to Gemma~2's 62.8, and far outperforming the original ColBERT (F1 51.7).
Retrieval recall: Document recall@50 is preserved at 97.1 vs. ColBERT’s 98, indicating only a 1% absolute reduction in standard retrieval performance after adding span identification functionality.
Model efficiency: The architecture incurs no index size increase (as transformations are done on-the-fly) and only ~1.12x latency overhead, as confirmed by measured inference times.
Figure 2: Qualitative demonstration of three passage-query examples, highlighting FGR-ColBERT’s per-token scores in strong alignment with LLM-derived relevance spans.

Qualitative analysis confirms that high token-level scores are assigned to truly relevant evidence phrases, with spurious or irrelevant text receiving appropriately low scores. These fine-grained cues are useful for explanation, post-hoc answer extraction, and explainable AI systems.

Implications and Future Directions

FGR-ColBERT bridges the gap between fast multi-vector retrieval and evidence-oriented inference, directly exposing token-level explanations as a byproduct of retrieval. The approach demonstrates that model distillation from LLMs for fine-grained supervision can be accomplished in lightweight architectures (~110M parameters), making advanced IR systems more deployable in latency-critical or resource-constrained settings.

Potential avenues for future work include:

Robustness Evaluation: Testing transfer and generalization on heterogeneous benchmarks such as BEIR, to assess the broad applicability of LLM-distilled fine-grained signals.
Long-context Retrieval: Extending to long-document settings (e.g., with LongEmbed-like architectures) where span identification is more challenging and beneficial.
Interactive and Explainable QA: Leveraging token-level relevance for answer highlighting, rationale generation, or verdict justification in human-facing applications.

Conclusion

FGR-ColBERT demonstrates that fine-grained, token-level relevance attribution can be achieved efficiently within dense retrieval architectures by distilling LLM supervision, maintaining retrieval effectiveness and incurring low computational overhead. This methodology paves the way for fast, explainable retrieval systems that no longer require expensive post-hoc LLM inference for evidence span identification, and charts a promising trajectory for future work on robust, explainable, and scalable IR.

Markdown Report Issue