MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

Published 27 Apr 2026 in cs.CL | (2604.24374v1)

Abstract: Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a unified training strategy (MIPIC) that combines self-distilled intra-relational alignment (SIA) and progressive information chaining (PIC) to enhance compressed embedding quality.
MIPIC improves semantic density and geometric consistency in low-dimensional regimes, outperforming prior methods on tasks like STS, NLI, and classification.
The approach balances higher training costs with zero inference latency, making it well-suited for dynamic, memory-constrained NLP deployments.

Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining: A Technical Perspective

Problem Motivation and Prior Work

Matryoshka Representation Learning (MRL) provides adaptive embeddings that support inference-level truncation to suit variable computational budgets, enabling a single model to yield nested representations at multiple dimensionalities without retraining. However, standard MRL methods predominantly supervise truncated prefixes either independently or via sentence-level alignment, often neglecting the internal arrangement of semantic information and token-level structural relations across both embedding dimension and model depth. This neglect limits semantic density, especially under extreme compression regimes. The paper "MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining" (2604.24374) addresses these limitations by introducing a unified training strategy to explicitly coordinate cross-dimensional structural coherence and progressive depth-wise semantic consolidation.

Framework Overview: MIPIC

MIPIC integrates two key mechanisms:

Self-Distilled Intra-Relational Alignment (SIA): SIA enforces cross-dimensional structural consistency using token-level geometric and attention-based self-distillation. Leveraging Centered Kernel Alignment (CKA) with hard top- $k$ selection, it selectively aligns salient intra-relational patterns between full and truncated representations, targeting information condensation in prefix dimensions.
Progressive Information Chaining (PIC): PIC incrementally propagates task-relevant semantic signals from deep to shallow layers through scaffolded alignment checkpoints using InfoNCE-based local mutual information maximization. This enables early-stage low-dimensional representations to incorporate core task signals, stabilizing semantic compression at reduced capacity.

The combined approach yields structurally robust and semantically dense Matryoshka representations suitable for deployment at any computational budget.

Mechanistic Details

SIA: Token-Level Structure Preservation

SIA departs from naive prefix alignment by focusing on token-level intra-relations:

Attention Distribution Matching: SIA constructs attention distributions based on [CLS]-to-token similarities, using the full-dimensional [CLS] as an anchor for softmax-based attention score computation. This guides lower-dimensional prefixes to preserve the ordering of token importance, enforced through KL divergence minimization between teacher (full-dimensional) and student (sliced, up-projected) distributions.
Top- $k$ Hidden State Alignment via CKA: Rather than aligning all token hidden states—which induces information saturation and noise—SIA performs hard selection of the top- $k$ tokens (nested per prefix dimension) based on teacher-side importance. CKA alignment between these top- $k$ tokens in full and truncated space ensures geometric consistency, with invariance to orthogonal transformations and scaling, crucial for robust cross-dimensional alignment.

PIC: Depth-Wise Semantic Consolidation

PIC incorporates depth-wise semantic flow:

Scaffolded Checkpoints: At each checkpoint layer and corresponding embedding dimension, PIC maximizes local mutual information between adjacent representations. Nonlinear projection bridges dimensionality gaps, InfoNCE alignment utilizes in-batch negatives.
Progressive Condensation: By focusing supervision only on selected Matryoshka subspaces, PIC avoids over-regularization while providing coarse-to-fine semantic guidance, ensuring discriminative structure is captured early without sacrificing representational flexibility.

Training Objective

MIPIC's loss function is:

$\mathcal{L}_{\text{MIPIC}} = \alpha \mathcal{L}_{\text{MRL}} + (1-\alpha)\big[\mathcal{L}_{\text{SIA}} + \mathcal{L}_{\text{PIC}}\big]$

where $\alpha$ balances standard multi-prefix supervision and auxiliary self-distillation losses.

Experimental Analysis

Extensive evaluation on diverse benchmarks—including STS (semantic similarity), NLI (natural language inference), and classification tasks—was conducted over a spectrum of backbone architectures (TinyBERT-6L, BERT-base, BGEM3, Qwen3-0.6B). Comparative baselines included MRL and ESE.

Key empirical findings:

Extreme Low-Dimensional Regimes: MIPIC exhibits significant accuracy advantages at aggressively truncated prefix dimensions (16 and 32), outperforming MRL and ESE in classification, similarity, and NLI tasks.
Full Capacity: At untruncated, full-sized representations, MIPIC maintains parity with baselines, demonstrating that structural and semantic organization does not degrade overall embedding quality.
Scalability and Generalization: MIPIC scales efficiently to large embedding models (BGEM3, Qwen3-0.6B), consistently retaining task-relevant knowledge under compression and generalizing robustly to out-of-domain benchmarks (STS12–16, SciTail).

Ablations confirm the synergy between SIA and PIC: SIA is essential for geometric structure under compression, PIC is crucial for semantic density across depth. Progressive dimension scaling (rather than non-bottlenecked checkpoints) further improves Matryoshka efficiency.

Training Costs: MIPIC incurs higher training latency due to additional losses, with throughput reductions of 45–62% versus baselines. Crucially, inference latency is unaffected, as auxiliary projectors are discarded post-training.

Theoretical and Practical Implications

MIPIC's coordinated structural and semantic chaining unlocks efficient, robust representation learning with adaptive inference-level dimensionality. By explicitly organizing semantic hierarchy across nested subspaces and network depth, MIPIC mitigates semantic drift and geometric disarray inherent in legacy MRL methods. The framework is well-suited for large-scale retrieval, memory-constrained deployment, and scenarios demanding dynamic embedding adaptation. Its compositional approach establishes a foundation for further research into hierarchical representation learning, progressive knowledge condensation, and efficient distillation protocols.

Potential future directions include:

Extending the framework to multi-modal Matryoshka architectures and decoder-only generative models.
Exploring alternative scheduling and checkpoint strategies for depth-wise semantic transfer.
Integrating advanced token selection algorithms for top- $k$ alignment, such as differentiable sampling or context-sensitive selection.
Generalizing the self-distillation approach to continual learning and plug-and-play adaptation in evolving NLP pipelines.

Conclusion

MIPIC introduces principled mechanisms—SIA and PIC—for organizing semantic information across dimensions and depth, yielding Matryoshka embeddings that excel in both full and compressed regimes. Rigorous benchmarking confirms substantial efficacy improvements relative to established baselines, particularly under severe truncation. While higher training costs are incurred, zero additional inference latency and enhanced representation quality justify the approach. The theoretical and empirical contributions substantially advance the state of semantic compression and adaptive embedding design in NLP (2604.24374).

Markdown Report Issue