Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 28 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Multi-Layer Representations in Retrieval

Updated 5 October 2025
  • Multi-layer Representations (MLR) is a method that builds document embeddings by combining hidden states from various encoder layers to capture diverse linguistic features.
  • The approach employs pooling strategies such as self-contrastive, average, and scalar mix pooling to reduce computational overhead while preserving retrieval quality.
  • MLR outperforms traditional dual encoders by aggregating complementary information across layers, resulting in improved tradeoffs between retrieval effectiveness and efficiency.

Multi-layer representations (MLR) are data representations that integrate information from multiple hierarchical layers of a neural architecture or multi-stage algorithmic process. In dense passage retrieval, MLR specifically refers to forming document (or passage) vectors by extracting and combining hidden states from multiple encoder layers, rather than solely relying on the final output layer. This approach acknowledges that different layers of pre-trained LLMs encode varied linguistic and semantic knowledge, and collectively these can yield richer, more discriminative document embeddings for information retrieval tasks.

1. Motivation for Multi-Layer Representations in Dense Retrieval

Dense retrieval systems commonly use a dual encoder framework, where both queries and documents are encoded into vectors, with document representations typically drawn from the [CLS] token of the last transformer layer. However, it is empirically established that intermediate layers of deep pre-trained models such as BERT and T5 capture distinct aspects of linguistic phenomena: lower and middle layers encode syntactic and lexical information, while upper layers encode more abstract, task-specific and semantic features. MLR leverages this vertical diversity by constructing representations from a selected set of layers (e.g., layers 10 and 12 in a 12-layer model), thereby capturing a broader range of information and improving retrieval effectiveness (Xie et al., 28 Sep 2025).

2. Construction of Multi-Layer Document Representations

Let dd be a document tokenized into TT tokens, encoded through an LL-layer transformer yielding hidden states hi(l)RDh_i^{(l)} \in \mathbb{R}^D for l=0,,Ll = 0, \ldots, L and i=0,,Ti = 0, \ldots, T. Typically, h0(l)h_0^{(l)} (corresponding to the [CLS] token) is selected per layer. For a subset S={l1,l2,...,lm}S = \{ l_1, l_2, ..., l_m \} with lm=Ll_m = L, the document is represented by:

ED(d)={h0(l)lS}\mathrm{ED}(d) = \left\{ h_0^{(l)} \mid l \in S \right\}

During retrieval, a query qq is embedded as hqh_q, and the document similarity is calculated as:

sim(q,d)=maxhED(d)(hqh)\text{sim}(q, d) = \max_{h \in \mathrm{ED}(d)} (h_q^\top h)

This max-inner product scoring enables multi-vector (MV) retrieval while optimally exploiting the discriminative capacity of multiple layers.

3. Pooling Strategies for Single-Vector Efficiency

Using multiple vectors per document increases both storage and computational burden, which can be prohibitive for web-scale retrieval systems. To address this, the following pooling strategies are introduced to compress the multi-layer representations into a single vector (hp(d)h_p(d)), achieving the retrieval speed of standard dual encoders without quality loss:

  • Self-Contrastive Pooling: At inference, only the last layer’s [CLS] embedding (h(L)h^{(L)}) is used. During training, a regularization loss aligns h(L)h^{(L)} with all candidate vectors:

Lreg(q,d+)=log(exp(hqhp(d+))hED(d+)exp(hqh))L_{\text{reg}}(q, d^+) = -\log\left( \frac{\exp(h_q^\top h_p(d^+))}{\sum_{h \in \mathrm{ED}(d^+)} \exp(h_q^\top h)} \right)

Thus, the total loss includes standard in-batch contrastive loss plus this regularization.

  • Average Pooling: Computes the simple arithmetic mean:

hp(d)=1mlSh(l)h_p(d) = \frac{1}{m} \sum_{l \in S} h^{(l)}

  • Scalar Mix Pooling: Learns a vector of weights aRma \in \mathbb{R}^m with softmax normalization:

hp(d)=lSsoftmax(a)lh(l)h_p(d) = \sum_{l \in S} \text{softmax}(a)_l \cdot h^{(l)}

This parameterization allows the network to learn which layers contain the most salient information for retrieval, in a data-driven manner.

4. Empirical Analysis: Layer Selection and Redundancy

Experiments with BERT and similar encoders reveal that representations from the last few layers (e.g., the top two or four) provide the largest retrieval gains. Adding vectors from more distant lower layers (e.g., uniformly sampling from {3,6,9,12}\{3, 6, 9, 12\}) can be beneficial but often introduces redundancy, as these vectors may contain correlated information. In 2- or 4-vector MLR setups, combinations such as S={10,12}S = \{10, 12\} or S={9,10,11,12}S = \{9, 10, 11, 12\} yield consistent and robust improvements. Adding too many vectors eventually leads to performance degradation due to this redundancy effect.

5. Comparisons with ME-BERT, ColBERT, and Dual Encoders

Model Layer Usage Vector Granularity Storage/Speed Effectiveness
Dual Encoder Last layer [CLS] only Single Efficient Baseline
ME-BERT Last layer, multiple tokens Multi-vector from one layer Increased overhead Improves with m
ColBERT Token-level, last layer All tokens, last layer Highest overhead High, but slower
MLR Multiple layers ([CLS]) Multiple/single (pooled) Comparable to dual Matches or exceeds others

MLR improves over dual encoders by aggregating complementary layer information, and outperforms ME-BERT with fewer vectors due to cross-layer diversity. When combined with pooling, MLR matches the efficiency of dual encoders and—with self-contrastive pooling, average, or scalar mix pooling—narrowly rivals or exceeds the performance of ColBERT and ME-BERT, especially in single-vector configurations.

6. Synergy with Retrieval-Oriented Pre-training and Hard Negative Mining

MLR benefits significantly from being initialized with retrieval-oriented pre-trained models such as RetroMAE. Furthermore, two-stage training with hard negative mining—where challenging distractor passages are iteratively harvested and used to update the retrieval model—further enhances MLR performance. The integration is seamless: MLR preserves all gains from techniques such as hard negative mining and retrieval-oriented pre-training without requiring architectural changes or retraining.

Empirical results demonstrate that single-vector MLR models using self-contrastive pooling and compatible with advanced training regimens achieve +1.0% top-5 accuracy improvements over RetroMAE and similarly outperform dual encoders and multi-vector baselines, all while retaining the computational efficiency of single-vector retrieval.

7. Implications and Outlook for MLR in Retrieval Systems

MLR provides a principled and empirically validated approach for fully utilizing the hierarchical feature structure of transformer-based encoders in dense passage retrieval. By leveraging layer-diverse representations, pooling for efficiency, and compatibility with retrieval-specific training, MLR achieves an improved tradeoff between retrieval quality and system scalability. This suggests that future passage retrieval systems should design for cross-layer feature aggregation rather than exclusive reliance on final-layer outputs. A plausible implication is that similar MLR strategies could benefit other tasks dependent on vector representations (such as document ranking, re-ranking, or zero-shot retrieval) where information from various representational depths is valuable.

The integration of MLR with pooling, retrieval-specific pre-training, and negative mining further underscores its versatility and potential as a baseline for future dense retrieval research and large-scale applications (Xie et al., 28 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-layer Representations (MLR).