Papers
Topics
Authors
Recent
Search
2000 character limit reached

Layer-wise Token Compression for Efficient Document Reranking

Published 20 May 2026 in cs.IR | (2605.20683v1)

Abstract: Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that compression at middle layers preserves ranking quality while increasing inference QPS by up to 25% for passage ranking and up to 116% for document ranking. We also extend LTC to listwise LLM rerankers and show that the same approach can be easily applied to long-context listwise reranking, where the QPS improvements are even greater. More surprisingly, when applying rerankers trained on short passages to long-document ranking tasks, models trained with compression outperform their uncompressed counterparts, suggesting that compression may act as a beneficial regularizer that encourages length-invariant representations.

Summary

  • The paper introduces LTC, which compresses tokens in intermediate layers to retain early query–document interactions while boosting efficiency.
  • It demonstrates that applying compression at middle layers achieves significant throughput gains (up to 116% increase in QPS) with minimal nDCG@10 loss.
  • The approach acts as a regularizer, ensuring robust generalization across different document lengths and out-of-domain datasets.

Layer-wise Token Compression for Efficient Transformer-based Reranking

Introduction

Transformer-based cross-encoder rerankers deliver state-of-the-art performance in document and passage retrieval but remain computationally prohibitive for production environments due to quadratic self-attention complexity with respect to input sequence length. While input-level token compression has proven effective for bi-encoder architectures, transfer of these strategies to cross-encoder rerankers yields major effectiveness degradation, attributable to disruption of early query–document token interactions crucial for relevance modeling. The present work introduces Layer-wise Token Compression (LTC), in which adaptive token pooling is performed in intermediate transformer layers, maintaining fine-grained interaction in the lower stack while enabling significant acceleration in later layers.

Methodology: Layer-wise Token Compression

LTC operates by compressing token representations at a designated transformer layer ℓ∗\ell^*, after early layers process the full input to capture fine-grained matching signals. The compression module C\mathcal{C} employs 1D adaptive average pooling, reducing the sequence length by a user-selectable rate rr (i.e., retaining n′=⌊n⋅r⌋n' = \lfloor n \cdot r \rfloor tokens). Attention masks and position indices are subsequently adjusted to reflect the compressed sequence.

For listwise LLM rerankers, LTC distinguishes query and instruction tokens from the documents and applies compression selectively to document-token positions using a document mask, ensuring that cross-document token entanglement is avoided. The result is a framework compatible with both pointwise and listwise transform er-based reranking architectures.

Empirical Results: Pointwise Reranking

Comprehensive evaluation was performed using Qwen3-0.6B (28 layers) as a pointwise reranker trained on MS MARCO passage datasets. Effectiveness (nDCG@10) and inference throughput (queries per second, QPS) were monitored on the TREC Deep Learning 2019/2020 test sets.

Early-stage compression, particularly at the embedding or initial transformer layers, was found to dramatically degrade ranking performance. Aggressively compressing at layer 2 with r=0.2r=0.2 resulted in nDCG@10 dropping from 0.727 to 0.638 on DL19—a statistically significant 12.2% decrease. In contrast, compression applied at intermediate layers (ℓ∗=8\ell^*=8–$14$) with moderate rates (r=0.4r=0.4–$0.8$) preserved effectiveness nearly indistinguishable from baseline while delivering significant speedups, e.g., 25% increase in QPS without measurable effectiveness loss. Figure 1

Figure 1: Passage ranking with LTC—nDCG@10 (left) and QPS (right) as a function of compression rate and target layer.

The embedding-layer Jasper approach (Zhang et al., 18 Nov 2025) achieves comparable QPS to LTC, but nDCG@10 loss is substantially higher (0.5832 vs. 0.727), supporting the hypothesis that token interactions captured in early layers are indispensable in cross-encoder architectures.

Application of LTC-trained Qwen3 models to document ranking (longer inputs, unseen during training) showed that compression not only improved QPS (by up to 116%) but also improved or maintained effectiveness compared to baseline, indicating strong robustness to input length and supporting the interpretation that LTC acts as a regularizer promoting length-invariant representations. Figure 2

Figure 2: Document ranking performance and QPS for LTC-trained rerankers; middle-layer compression outperforms baseline for long-sequence inputs.

Ablation demonstrated that applying LTC compression only at inference, without compression-aware training, resulted in catastrophic performance degradation under aggressive configurations. Effective LTC therefore requires compression-aware finetuning to adapt model representations to token loss. Figure 3

Figure 3: Zero-shot inference-only LTC yields severe breakdown in nDCG@10 versus LTC-aware trained models, especially for aggressive early-layer compression.

Zero-shot BEIR evaluation further indicated that LTC does not hinder out-of-domain generalization; in fact, mild compression slightly improved mean nDCG@10 across datasets such as NFCorpus, FiQA, and TREC-COVID, likely because regularization from compression counteracts overfitting.

Listwise Reranking with Large LMs

LTC was extended to listwise reranking with Mistral-7B-Instruct models. In this configuration, only document tokens are compressed, while instruction and query tokens remain unaltered. On MS MARCO passage and document ranking tasks, most LTC settings improved or preserved effectiveness with substantial throughput increases (+27% to +73% QPS), with several configurations delivering statistically significant nDCG@10 gains versus the baseline. Figure 4

Figure 4: Listwise passage ranking: nDCG@10 (left) and QPS (right) as a function of LTC configuration; several compressed settings provide significant effectiveness and efficiency improvements.

Figure 5

Figure 5: Listwise document ranking: effectiveness and QPS; selective LTC delivers throughput improvements up to nearly 2x with minimal or no effectiveness loss.

Implications and Future Directions

The LTC strategy demonstrates that efficiency/effectiveness trade-offs in transformer reranking can be reliably controlled by selecting appropriate compression rates and target layers. In production IR deployments—where model throughput remains critical—LTC provides a principled pathway to substantial cost reduction while preserving SOTA effectiveness, and supports robust generalization to longer documents and out-of-distribution datasets.

Theoretically, the result that LTC acts as a regularizer for robustness warrants deeper analysis. Future research directions include: (1) automating layer and rate selection (e.g., gradient-based or neural architecture search), (2) integrating LTC with orthogonal compression strategies such as quantization, KV cache compression, or prompt reduction, and (3) probing the representational invariances induced by LTC via attention and hidden norm analyses.

Conclusion

Layer-wise Token Compression (LTC) offers a robust and empirically validated approach for accelerating transformer-based reranking systems. By situating token pooling at intermediate transformer layers, LTC preserves necessary early token interactions and enables aggressive sequence length reduction for downstream blocks, producing up to 116% improvement in computational throughput without sacrificing retrieval effectiveness. LTC’s architectural generality allows immediate application to listwise reranking with large LMs, and the observed regularization effects support robustness under length variation and domain shift. Broad adoption of LTC across industrial and research retrieval stacks appears both practical and advantageous.

(2605.20683)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 9 likes about this paper.