Efficient Listwise Reranking with Compressed Document Representations

Published 29 Apr 2026 in cs.IR | (2604.26483v1)

Abstract: Reranking, the process of refining the output from a first-stage retriever, is often considered computationally expensive, especially when using LLMs. A common approach to mitigate this cost involves utilizing smaller LLMs or controlling input length. Inspired by recent advances in document compression for retrieval-augmented generation (RAG), we introduce RRK, an efficient and effective listwise reranker compressing documents into multi-token fixed-size embedding representations. Our simple training via distillation shows that this combination of rich compressed representations and listwise reranking yields a highly efficient and effective system. In particular, our 8B-parameter model runs 3x-18x faster than smaller rerankers (0.6-4B parameters) while matching or outperforming them in effectiveness. The efficiency gains are even more striking on long-document benchmarks, where RRK widens its advantage further.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces RRK that jointly trains a compressor and listwise reranker, using soft LLM-driven compression to capture semantic relevance.
RRK achieves 3x–18x speed improvements on long-document benchmarks while maintaining or improving nDCG scores compared to state-of-the-art models.
The framework decouples reranking cost from document length, offering robust performance across diverse IR benchmarks and practical scalability in latency-sensitive scenarios.

Efficient Listwise Reranking with Compressed Document Representations: Technical Summary

Motivation and Context

The process of reranking is central to Information Retrieval (IR) pipelines, refining the selection of documents retrieved in a first-stage to produce more precise rankings for user queries. Despite the strong performance of LLM-based rerankers, the quadratic attention complexity associated with processing textual inputs constrains their efficiency, especially for long documents or high-throughput applications. Recent advances in retrieval-augmented generation (RAG) have introduced document compression techniques to mitigate these computational issues, prompting exploration of their utility in reranking settings.

RRK Framework: Core Innovations

The RRK (compressed ReRanKer) framework integrates document compression, specifically leveraging multi-token embeddings produced by a LoRA-finetuned PISCO model, with a listwise reranking approach. RRK operates in two phases: offline, documents are compressed into fixed-size memory token embeddings; online, these embeddings are concatenated with the query for input to a listwise LLM reranker.

Critically, RRK diverges from prior works that rely on IR-based embeddings (e.g., PE-Rank, E2RANK) by employing soft compression techniques directly learned from LLMs, yielding semantically rich, task-adaptive representations. Joint training of the compressor and listwise reranker via distillation from a teacher reranker enables RRK to capture nuanced relevance dynamics, with compression performed through backpropagation on the ranking loss.

From a complexity perspective, RRK reduces the attention scope from $O((|q|+k|d|)^2)$ to $O((2|q|+k(l+1))^2)$ , where $l$ (memory tokens) is significantly less than $|d|$ (document length), yielding substantial efficiency gains.

Empirical Results and Numerical Performance

Extensive evaluation across standard IR benchmarks (TREC-DL 2019/2020, BeIR) reveals several key results:

Efficiency: RRK's 8B-parameter model runs 3x-18x faster than smaller state-of-the-art rerankers (0.4-4B), including both listwise and pointwise variants, and up to 17x faster on long-document datasets such as MS MARCO DOC DL19/20.
Effectiveness: RRK matches or outperforms baselines in nDCG@10, achieving 58.4 (BeIR average) with compressed representations, versus ModernBERT-large (57.2) and Qwen3-4B (58.4-60.2 with costly input lengths).
Compression Robustness: RRK attains up to 256x compression and maintains effectiveness on long documents, an attribute not preserved by competing methods; effectiveness does not degrade with increased document length.
Reranker Comparison: RRK consistently outperforms PE-Rank and achieves comparable or better effectiveness than E2RANK, while running substantially faster due to the avoidance of decoding steps and expensive query computations.

The training regime employs distillation from robust teacher models (SPLADE-V3, Jina-v3), using datasets such as MS MARCO and BGE-M3, with combined training leading to competitive generalization across benchmarks. Ablations confirm that joint fine-tuning of compressor and reranker is essential for maintaining high performance; frozen compressor setups yield substantial drops in effectiveness.

Technical Analysis and Distinctiveness

RRK's efficiency is tightly bound to its input compression strategy, which decouples reranking cost from document length and introduces only a modest storage overhead (e.g., 230GB for MS MARCO collection with float16 encoding and 8 memory tokens). This trade-off is ameliorated by quantization optimizations. The framework's scalability and generalization are demonstrated by its strong performance on both short and long-document datasets, suggesting robustness with respect to document heterogeneity.

Theoretical implications include the demonstration that LLM-driven soft compression yields finer-grained representations than those inherited from first-stage dense retrieval models, supporting the hypothesis that listwise reranking objectives synergize with task-adaptive compression to preserve semantic fidelity.

Practical Implications and Limitations

Practically, RRK makes large-scale LLM reranking viable for latency-sensitive deployments and IR tasks facing long-context requirements. The approach enables configurations where larger models (8B) operate at speeds previously only accessible to much smaller architectures, broadening the feasible application domains for LLM rerankers.

Limitations persist. The efficiency advantage is contingent upon short queries; when query length matches document length (as in BRIGHT dataset), RRK's speed gains diminish. The model relies on substantial data storage for compressed representations and currently does not support smaller parameters (1-4B) without sacrificing effectiveness, indicating further research is needed in low-dimensional LLM compression and quantized embedding storage.

Comparison to Prior Art and Future Directions

RRK maintains a clearer distinction from PE-Rank and E2RANK by avoiding reliance on fixed retrieval embeddings and by learning compression jointly with reranking from LLMs. These approaches, while conceptually aligned in aiming for efficiency, demonstrate a trade-off between semantic expressiveness and computational practicality. RRK's empirical results validate that task-adaptive, multi-token compression can overcome the typical effectiveness degradation observed in IR-only compression paradigms.

Future research directions include improving RRK performance with smaller parameter models, optimizing storage overhead with advanced quantization schemes, and extending the approach to handle longer and more complex queries. There is potential for RRK-style compression to be adapted for cross-task retrieval scenarios, offering consistent gains in speed and effectiveness in both standard and retrieval-augmented generation contexts.

Conclusion

The RRK framework advances efficient reranking by tightly integrating LLM-based compressed document representations with listwise ranking objectives. Empirical and theoretical analysis demonstrates that RRK dramatically reduces reranking latency without compromising effectiveness, offering a scalable solution for IR pipelines confronted by demanding efficiency constraints. The findings underline the value of soft, task-adaptive compression in preserving relevance signals, and set the stage for further exploration of compressed representation learning in large-scale AI retrieval systems.

Markdown Report Issue