Multi-Layer Representations (MLR)

Updated 26 February 2026

MLR are intermediate layer outputs that capture low- to high-level abstractions, enabling enhanced interpretability and effective feature transfer.
They employ aggregation techniques such as average pooling, scalar-mix, and layer-attentive pooling to yield robust results across tasks like retrieval and speaker verification.
Extracted from models like RNNs, CNNs, and Transformers, MLR provide detailed insights into syntactic, semantic, and task-specific dynamics essential for model design.

Multi-Layer Representations (MLR) encompass the vector activations, statistics, or pooled summaries generated at each intermediate layer in a deep model architecture (RNN, CNN, Transformer, sparse model, etc). Each layer transforms its input into a new representational space, and the collection of these layerwise outputs exposes the progression from low-level to high-level abstractions, as well as encoding a diversity of syntactic, semantic, structural, or task-specific information. MLR serve both as a foundation for deeper interpretability and as building blocks for advanced downstream fusion, pooling, and transfer procedures.

1. Formal Definitions and Mathematical Foundations

In deep sequence models (RNNs, CNNs, Transformers), the ℓ-th layer outputs a vector sequence

$h^{(ℓ)} = [h^{(ℓ)}_1, h^{(ℓ)}_2, \dots, h^{(ℓ)}_T] \in \mathbb{R}^{T \times d}$

for a sequence of length $T$ and hidden size $d$ . In RNNs, $h^{(\ell)}_t$ is the recurrent state at $t$ ; in CNNs, the output of the $\ell$ -th convolution for position $t$ ; in Transformers, the output of the $\ell$ -th encoder or decoder block after self-attention and feed-forward modules. Transformers additionally generate self-attention matrices $A^{(\ell)} \in \mathbb{R}^{T \times T}$ , where $A^{(\ell)}_{ij}$ encodes the attention weight from $i$ to $j$ at layer $\ell$ , and for encoder-decoder models, cross-attention matrices $C^{(\ell)} \in \mathbb{R}^{T_t \times T_e}$ link targets to sources.

In sparse and feed-forward models, MLR generalizes to cascades of linear or convolutional dictionaries $D_1, \dots, D_L$ and associated codes, e.g.,

$x = D_1 \gamma_1, \quad \gamma_1 = D_2 \gamma_2, \dots, \gamma_{L-1} = D_L \gamma_L \,,$

with each intermediate code $\gamma_i$ constituting the MLR for layer $i$ (Aberdam et al., 2018).

For LLMs, MLR at the sentence or document level are commonly extracted as pooled representations, e.g., for BERT, $h^{(\ell)}(D) := h_{\text{CLS}}^{(\ell)}(D)$ ; for T5, averaging tokenwise $h_i^{(\ell)}$ (Xie et al., 28 Sep 2025).

2. Extraction and Aggregation Methodologies

Extraction of MLR entails running a forward pass, recording each layer's activations for sentences and tokens. Outputs can be normalized (e.g., z-score), then pooled to obtain per-sentence or per-token features. Storage conventionally follows structured JSON schemas, capturing arrays of layerwise embeddings or attention statistics (Escolano et al., 2019).

Aggregation strategies depend on the application:

Pooling: Average-pooling, max-pooling, scalar-mix (learned weights), or attention/gating over layers. For aggregating vectors from several layers $\{h^{(l)}(D)\}$ ${h^{(l)} (D)}$ :
- Average: $r_D = (1/|S|) \sum_{l \in S} h^{(l)}(D)$
- Max: $[r_D]_j = \max_{l \in S} [h^{(l)}(D)]_j$
- Scalar-mix: $r_D = \sum_{l \in S} \operatorname{softmax}(a)_l \cdot h^{(l)}(D)$
- Layer-attentive pooling (LAP): Dynamic, per-time, per-layer gates applied prior to max-pooling (Kim et al., 15 Dec 2025).
Visualization: Dimensionality-reduction (e.g. UMAP with $n_\text{neighbors}=15$ , $\text{min\_dist}=0.1$ ) projects high-dimensional MLR vectors into 2D for analysis of layer evolution, language clustering, or semantic drift (Escolano et al., 2019).
Fusion for prediction: In NMT, representation fusion modules combine all decoder (or encoder) layer outputs, using average, feedforward, or multi-hop self-attention mechanisms (Wang et al., 2020).

3. Applications Across Domains

MLR enable concrete advances and interpretability across diverse tasks:

Sequence Models and NLP: Layerwise analysis in Transformer models reveals where task-relevant semantics or syntactic phenomena emerge, and supports multilingual transfer, bias analysis, and monitoring of transfer bottlenecks (Escolano et al., 2019, Kaneko et al., 2019).
Dense Retrieval: Multi-vector and single-vector retrieval architectures leverage MLR—selecting two to four intermediate layers, or pooling with self-contrastive objectives—to improve retrieval accuracy over conventional last-layer approaches (Xie et al., 28 Sep 2025).
Speaker Verification: Layer-attentive pooling dynamically aggregates per-layer outputs from pre-trained models (e.g., WavLM), using per-frame gates and max-pooling for robust speaker embeddings, yielding superior efficiency and state-of-the-art accuracy (Kim et al., 15 Dec 2025).
Sparse Coding and Representation Learning: Multi-layer sparse models unify synthesis and analysis views; holistic pursuit algorithms leverage joint constraints across all layers for improved recovery and reduced error under noise (Aberdam et al., 2018). Variance-regularized deep decoders in autoencoders produce more stable and interpretable codes (Evtimova et al., 2021).
Hierarchical Matrix Factorization: Multi-layer NMF yields engaging feature hierarchies for document/image clustering, minimizing reconstruction loss and improving classification (Song et al., 2013).
Deep Relational Reasoning: Multi-layer relation networks (MLRN) successively refine relational information, enabling higher-order logic without combinatorial explosion, and outperforming flat architectures in e.g. bAbI QA tasks (Jahrens et al., 2018).

4. Experimental Protocols and Empirical Findings

Empirical protocols rely on training or inference-specific strategies:

Visualization Tasks: After batchwise extraction and normalization, 2D UMAP projections expose meaningful clusters at both sentence/token granularity (e.g., gender bias clustering in contextual ELMo, multilingual overlap at interlingua layers, decoder semantic drift) (Escolano et al., 2019).
Retrieval and Pooling: Performance peaks with selection of last 2–4 layers from BERT/T5 models, with self-contrastive pooling producing the strongest single-vector representations for retrieval—improving SQuAD Top-5 from 38.77% (dual encoder) to 42.47% (MLR) (Xie et al., 28 Sep 2025).
Speaker Verification: LAP+ASTP achieves EER as low as 0.37% on VoxCeleb1 (WavLM Large), outperforming larger backends at a fraction of the compute cost. Time-dynamic gating ensures that the speaker embedding is constructed from the layer most discriminative at each frame (Kim et al., 15 Dec 2025).
Translation and NMT: Fusion of decoder MLR via self-attention increases BLEU by up to 0.92 over a strong Transformer+RestartAdam baseline on De $\to$ En, and up to 0.56 BLEU on Zh $\to$ En. Self-attention fusion with learned layer embeddings generally outperforms average or feedforward pooling (Wang et al., 2020).
Sparse Coding Hierarchies: Holistic pursuit reduces recovery error by up to 50%, particularly as mid-layer cosparsity increases, with recovery matching theoretical (s_L–r)/s_L predictions (Aberdam et al., 2018).

5. Interpretability, Analysis, and Model Design Impact

Layerwise representations expose where and how linguistic, semantic, or speaker-specific signals are concentrated:

Interpreting Layer Function: Final layers become increasingly task-specialized, while lower/intermediate layers encode structural or local features (syntax, word-form, phonetics) (Xie et al., 28 Sep 2025, Kaneko et al., 2019, Kim et al., 15 Dec 2025).
Bias and Fairness Analysis: By visualizing layerwise contextual word embeddings, paper (Escolano et al., 2019) demonstrates occupation and context-specific gender separation, allowing targeted bias mitigation.
Design Guidance: Selection of the optimal layer (or fusion) for a task can now be guided by both quantitative and visual inspection, informing transfer learning, feature selection, and architecture modification.

6. Theoretical Considerations, Guarantees, and Limitations

Optimization Properties: In multi-layer sparse coding, the joint (holistic) optimization enables improved uniqueness and recovery guarantees, lowering error bounds due to additional mid-layer structure (Aberdam et al., 2018).
Hierarchical Models: Multi-layer NMF with proper smoothing strictly improves reconstruction loss over flat NMF with the same feature budget (theoretical sketch in (Song et al., 2013)).
Capacity and Generalization: Dense MLR fusion (e.g., multi-hop self-attention) and time-dynamic per-layer gating regularize and stabilize training, enabling deeper and more parameter-efficient architectures (Wang et al., 2020, Kim et al., 15 Dec 2025).
Limitations: Overly large numbers of pooled vectors can degrade accuracy in dense retrieval, and static weighted averages may suppress higher-layer discriminative content in speech applications (Xie et al., 28 Sep 2025, Kim et al., 15 Dec 2025).

7. Broader Implications and Future Directions

MLR is a central principle in model interpretability, adaptation, and efficiency:

Architecture Agnosticism: MLR extraction and fusion techniques are generally model-agnostic, able to be incorporated into RNNs, CNNs, Transformers, and sparse coding pipelines.
Adaptive Aggregation: There is growing emphasis on dynamic pooling (e.g., adaptive gating/max, relation diffusion, task-specific attention) rather than static or uniform weighting.
Model Compression and Scaling: Research addresses resource efficiency—seeking to minimize additional parameters or runtime in fusion heads, distillation, or pooling mechanisms (Wang et al., 2020).
Compositional Reasoning: Multi-layer relational and fusion modules push toward compositional reasoning capabilities, structured transfer, and explainable predictions in complex, multi-level tasks (Jahrens et al., 2018).

The ability to understand, aggregate, and exploit MLR is foundational for advances in interpretability, transfer, data efficiency, and robust generalization across computational domains.