Multi-Layer Representations in Retrieval

Updated 5 October 2025

Multi-layer Representations (MLR) is a method that builds document embeddings by combining hidden states from various encoder layers to capture diverse linguistic features.
The approach employs pooling strategies such as self-contrastive, average, and scalar mix pooling to reduce computational overhead while preserving retrieval quality.
MLR outperforms traditional dual encoders by aggregating complementary information across layers, resulting in improved tradeoffs between retrieval effectiveness and efficiency.

Multi-layer representations (MLR) are data representations that integrate information from multiple hierarchical layers of a neural architecture or multi-stage algorithmic process. In dense passage retrieval, MLR specifically refers to forming document (or passage) vectors by extracting and combining hidden states from multiple encoder layers, rather than solely relying on the final output layer. This approach acknowledges that different layers of pre-trained LLMs encode varied linguistic and semantic knowledge, and collectively these can yield richer, more discriminative document embeddings for information retrieval tasks.

1. Motivation for Multi-Layer Representations in Dense Retrieval

Dense retrieval systems commonly use a dual encoder framework, where both queries and documents are encoded into vectors, with document representations typically drawn from the [CLS] token of the last transformer layer. However, it is empirically established that intermediate layers of deep pre-trained models such as BERT and T5 capture distinct aspects of linguistic phenomena: lower and middle layers encode syntactic and lexical information, while upper layers encode more abstract, task-specific and semantic features. MLR leverages this vertical diversity by constructing representations from a selected set of layers (e.g., layers 10 and 12 in a 12-layer model), thereby capturing a broader range of information and improving retrieval effectiveness (Xie et al., 28 Sep 2025).

2. Construction of Multi-Layer Document Representations

Let $d$ be a document tokenized into $T$ tokens, encoded through an $L$ -layer transformer yielding hidden states $h_i^{(l)} \in \mathbb{R}^D$ for $l = 0, \ldots, L$ and $i = 0, \ldots, T$ . Typically, $h_0^{(l)}$ (corresponding to the [CLS] token) is selected per layer. For a subset $S = \{ l_1, l_2, ..., l_m \}$ with $l_m = L$ , the document is represented by:

$\mathrm{ED}(d) = \left\{ h_0^{(l)} \mid l \in S \right\}$

During retrieval, a query $q$ is embedded as $h_q$ , and the document similarity is calculated as:

$\text{sim}(q, d) = \max_{h \in \mathrm{ED}(d)} (h_q^\top h)$

This max-inner product scoring enables multi-vector (MV) retrieval while optimally exploiting the discriminative capacity of multiple layers.

3. Pooling Strategies for Single-Vector Efficiency

Using multiple vectors per document increases both storage and computational burden, which can be prohibitive for web-scale retrieval systems. To address this, the following pooling strategies are introduced to compress the multi-layer representations into a single vector ( $h_p(d)$ ), achieving the retrieval speed of standard dual encoders without quality loss:

Self-Contrastive Pooling: At inference, only the last layer’s [CLS] embedding ( $h^{(L)}$ ) is used. During training, a regularization loss aligns $h^{(L)}$ with all candidate vectors:

$L_{\text{reg}}(q, d^+) = -\log\left( \frac{\exp(h_q^\top h_p(d^+))}{\sum_{h \in \mathrm{ED}(d^+)} \exp(h_q^\top h)} \right)$

Thus, the total loss includes standard in-batch contrastive loss plus this regularization.

Average Pooling: Computes the simple arithmetic mean:

$h_p(d) = \frac{1}{m} \sum_{l \in S} h^{(l)}$

Scalar Mix Pooling: Learns a vector of weights $a \in \mathbb{R}^m$ with softmax normalization:

$h_p(d) = \sum_{l \in S} \text{softmax}(a)_l \cdot h^{(l)}$

This parameterization allows the network to learn which layers contain the most salient information for retrieval, in a data-driven manner.

4. Empirical Analysis: Layer Selection and Redundancy

Experiments with BERT and similar encoders reveal that representations from the last few layers (e.g., the top two or four) provide the largest retrieval gains. Adding vectors from more distant lower layers (e.g., uniformly sampling from $\{3, 6, 9, 12\}$ ) can be beneficial but often introduces redundancy, as these vectors may contain correlated information. In 2- or 4-vector MLR setups, combinations such as $S = \{10, 12\}$ or $S = \{9, 10, 11, 12\}$ yield consistent and robust improvements. Adding too many vectors eventually leads to performance degradation due to this redundancy effect.

5. Comparisons with ME-BERT, ColBERT, and Dual Encoders

Model	Layer Usage	Vector Granularity	Storage/Speed	Effectiveness
Dual Encoder	Last layer [CLS] only	Single	Efficient	Baseline
ME-BERT	Last layer, multiple tokens	Multi-vector from one layer	Increased overhead	Improves with m
ColBERT	Token-level, last layer	All tokens, last layer	Highest overhead	High, but slower
MLR	Multiple layers ([CLS])	Multiple/single (pooled)	Comparable to dual	Matches or exceeds others

MLR improves over dual encoders by aggregating complementary layer information, and outperforms ME-BERT with fewer vectors due to cross-layer diversity. When combined with pooling, MLR matches the efficiency of dual encoders and—with self-contrastive pooling, average, or scalar mix pooling—narrowly rivals or exceeds the performance of ColBERT and ME-BERT, especially in single-vector configurations.

6. Synergy with Retrieval-Oriented Pre-training and Hard Negative Mining

MLR benefits significantly from being initialized with retrieval-oriented pre-trained models such as RetroMAE. Furthermore, two-stage training with hard negative mining—where challenging distractor passages are iteratively harvested and used to update the retrieval model—further enhances MLR performance. The integration is seamless: MLR preserves all gains from techniques such as hard negative mining and retrieval-oriented pre-training without requiring architectural changes or retraining.

Empirical results demonstrate that single-vector MLR models using self-contrastive pooling and compatible with advanced training regimens achieve +1.0% top-5 accuracy improvements over RetroMAE and similarly outperform dual encoders and multi-vector baselines, all while retaining the computational efficiency of single-vector retrieval.

7. Implications and Outlook for MLR in Retrieval Systems

MLR provides a principled and empirically validated approach for fully utilizing the hierarchical feature structure of transformer-based encoders in dense passage retrieval. By leveraging layer-diverse representations, pooling for efficiency, and compatibility with retrieval-specific training, MLR achieves an improved tradeoff between retrieval quality and system scalability. This suggests that future passage retrieval systems should design for cross-layer feature aggregation rather than exclusive reliance on final-layer outputs. A plausible implication is that similar MLR strategies could benefit other tasks dependent on vector representations (such as document ranking, re-ranking, or zero-shot retrieval) where information from various representational depths is valuable.

The integration of MLR with pooling, retrieval-specific pre-training, and negative mining further underscores its versatility and potential as a baseline for future dense retrieval research and large-scale applications (Xie et al., 28 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Investigating Multi-layer Representations for Dense Passage Retrieval (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-layer Representations (MLR).