- The paper introduces RRK that jointly trains a compressor and listwise reranker, using soft LLM-driven compression to capture semantic relevance.
- RRK achieves 3x–18x speed improvements on long-document benchmarks while maintaining or improving nDCG scores compared to state-of-the-art models.
- The framework decouples reranking cost from document length, offering robust performance across diverse IR benchmarks and practical scalability in latency-sensitive scenarios.
Efficient Listwise Reranking with Compressed Document Representations: Technical Summary
Motivation and Context
The process of reranking is central to Information Retrieval (IR) pipelines, refining the selection of documents retrieved in a first-stage to produce more precise rankings for user queries. Despite the strong performance of LLM-based rerankers, the quadratic attention complexity associated with processing textual inputs constrains their efficiency, especially for long documents or high-throughput applications. Recent advances in retrieval-augmented generation (RAG) have introduced document compression techniques to mitigate these computational issues, prompting exploration of their utility in reranking settings.
RRK Framework: Core Innovations
The RRK (compressed ReRanKer) framework integrates document compression, specifically leveraging multi-token embeddings produced by a LoRA-finetuned PISCO model, with a listwise reranking approach. RRK operates in two phases: offline, documents are compressed into fixed-size memory token embeddings; online, these embeddings are concatenated with the query for input to a listwise LLM reranker.
Critically, RRK diverges from prior works that rely on IR-based embeddings (e.g., PE-Rank, E2RANK) by employing soft compression techniques directly learned from LLMs, yielding semantically rich, task-adaptive representations. Joint training of the compressor and listwise reranker via distillation from a teacher reranker enables RRK to capture nuanced relevance dynamics, with compression performed through backpropagation on the ranking loss.
From a complexity perspective, RRK reduces the attention scope from O((∣q∣+k∣d∣)2) to O((2∣q∣+k(l+1))2), where l (memory tokens) is significantly less than ∣d∣ (document length), yielding substantial efficiency gains.
Extensive evaluation across standard IR benchmarks (TREC-DL 2019/2020, BeIR) reveals several key results:
- Efficiency: RRK's 8B-parameter model runs 3x-18x faster than smaller state-of-the-art rerankers (0.4-4B), including both listwise and pointwise variants, and up to 17x faster on long-document datasets such as MS MARCO DOC DL19/20.
- Effectiveness: RRK matches or outperforms baselines in nDCG@10, achieving 58.4 (BeIR average) with compressed representations, versus ModernBERT-large (57.2) and Qwen3-4B (58.4-60.2 with costly input lengths).
- Compression Robustness: RRK attains up to 256x compression and maintains effectiveness on long documents, an attribute not preserved by competing methods; effectiveness does not degrade with increased document length.
- Reranker Comparison: RRK consistently outperforms PE-Rank and achieves comparable or better effectiveness than E2RANK, while running substantially faster due to the avoidance of decoding steps and expensive query computations.
The training regime employs distillation from robust teacher models (SPLADE-V3, Jina-v3), using datasets such as MS MARCO and BGE-M3, with combined training leading to competitive generalization across benchmarks. Ablations confirm that joint fine-tuning of compressor and reranker is essential for maintaining high performance; frozen compressor setups yield substantial drops in effectiveness.
Technical Analysis and Distinctiveness
RRK's efficiency is tightly bound to its input compression strategy, which decouples reranking cost from document length and introduces only a modest storage overhead (e.g., 230GB for MS MARCO collection with float16 encoding and 8 memory tokens). This trade-off is ameliorated by quantization optimizations. The framework's scalability and generalization are demonstrated by its strong performance on both short and long-document datasets, suggesting robustness with respect to document heterogeneity.
Theoretical implications include the demonstration that LLM-driven soft compression yields finer-grained representations than those inherited from first-stage dense retrieval models, supporting the hypothesis that listwise reranking objectives synergize with task-adaptive compression to preserve semantic fidelity.
Practical Implications and Limitations
Practically, RRK makes large-scale LLM reranking viable for latency-sensitive deployments and IR tasks facing long-context requirements. The approach enables configurations where larger models (8B) operate at speeds previously only accessible to much smaller architectures, broadening the feasible application domains for LLM rerankers.
Limitations persist. The efficiency advantage is contingent upon short queries; when query length matches document length (as in BRIGHT dataset), RRK's speed gains diminish. The model relies on substantial data storage for compressed representations and currently does not support smaller parameters (1-4B) without sacrificing effectiveness, indicating further research is needed in low-dimensional LLM compression and quantized embedding storage.
Comparison to Prior Art and Future Directions
RRK maintains a clearer distinction from PE-Rank and E2RANK by avoiding reliance on fixed retrieval embeddings and by learning compression jointly with reranking from LLMs. These approaches, while conceptually aligned in aiming for efficiency, demonstrate a trade-off between semantic expressiveness and computational practicality. RRK's empirical results validate that task-adaptive, multi-token compression can overcome the typical effectiveness degradation observed in IR-only compression paradigms.
Future research directions include improving RRK performance with smaller parameter models, optimizing storage overhead with advanced quantization schemes, and extending the approach to handle longer and more complex queries. There is potential for RRK-style compression to be adapted for cross-task retrieval scenarios, offering consistent gains in speed and effectiveness in both standard and retrieval-augmented generation contexts.
Conclusion
The RRK framework advances efficient reranking by tightly integrating LLM-based compressed document representations with listwise ranking objectives. Empirical and theoretical analysis demonstrates that RRK dramatically reduces reranking latency without compromising effectiveness, offering a scalable solution for IR pipelines confronted by demanding efficiency constraints. The findings underline the value of soft, task-adaptive compression in preserving relevance signals, and set the stage for further exploration of compressed representation learning in large-scale AI retrieval systems.