Maximal Coding Rate Reduction Principle

Updated 18 November 2025

Maximal Coding Rate Reduction is an information-theoretic framework that encourages within-class compressibility and between-class separability by maximizing the difference between global and per-class coding rates.
It employs a log-determinant formulation rooted in rate–distortion theory to enforce a union-of-subspaces geometry, where each class occupies a nearly orthogonal low-dimensional subspace.
The approach has practical applications in clustering, metric learning, federated learning, and model compression, with scalable algorithmic instantiations overcoming computational challenges.

The Maximal Coding Rate Reduction (MCR²) Principle is a unified, information-theoretic framework for learning structured feature representations from high-dimensional data. It provides an objective function that encourages within-class compressibility and between-class separability by maximizing the difference between the coding rate of the overall feature set and the coding rates of individual class subsets. Rooted in classical rate–distortion theory, MCR² produces a geometry in which each class occupies a low-dimensional subspace, and these subspaces are mutually near-orthogonal, thus enabling robust, interpretable, and discriminative representation learning in both supervised and unsupervised settings (Lu et al., 2023, Yu et al., 2020).

1. Information-Theoretic Foundations and Coding Rate Formulation

MCR² is derived from the rate–distortion function for Gaussian sources. For a finite dataset $W = \{w_1, ..., w_m\} \subset \mathbb{R}^n$ , the (upper-bounded) per-sample coding rate under squared distortion $\epsilon^2$ is

$R(W, \epsilon) \simeq \frac12 \log\det\left[I + \frac{n}{m\epsilon^2} WW^\top\right]$

For data partitioned into $k$ clusters $W_1 \cup \cdots \cup W_k$ , one codes each cluster individually using diagonal masks $\Pi_j$ , yielding the per-cluster rate

$R^c(W, \epsilon\,|\,\Pi) = \sum_{j=1}^k \frac{m_j}{m}\cdot\frac12\log\det\left[I + \frac{n}{m_j\epsilon^2} W\Pi_j W^\top\right]$

The coding rate reduction is then defined as

$\Delta R(W, \Pi, \epsilon) := R(W, \epsilon) - R^c(W, \epsilon\,|\,\Pi)$

Maximizing $\Delta R$ directly incentivizes compressible within-class structure and expansive, mutually separated class subspaces (Lu et al., 2023).

2. Formal MCR² Objective and Optimization

Given data matrix $Z \in \mathbb{R}^{d\times m}$ (feature-vectors $z_i=f(x_i;\theta)$ ), the MCR² objective is to choose parameters $\theta$ (and possibly a partition $\Pi$ ) that solve

$\max_{\theta,\ \Pi \in \Omega}\;\Delta R\bigl(Z(\theta),\Pi,\epsilon\bigr)$

subject to per-class normalization $\| Z(\theta)\Pi_j\|_F^2 = \operatorname{tr} \Pi_j$ for $j = 1,..,k$ , and where $\Omega = \{ \Pi_j \ge 0,~\sum_j \Pi_j = I \}$ is the simplex of soft assignments. For known labels, $\Pi$ is fixed as the class indicator matrix (Lu et al., 2023).

The gradients required for stochastic or full-batch optimization consist of

$\frac{\partial}{\partial Z} R(Z, \epsilon) = \frac{\alpha}{2}\left(I+\alpha ZZ^\top\right)^{-1}Z$

for $\alpha = d/(m\epsilon^2)$ , and similar terms for each class. Optimizing in $\theta$ uses projected gradient ascent (SGD, Adam). If $\Pi$ is unknown (as in clustering), one alternates gradient steps between $\theta$ (for representation) and $\Pi$ (for assignment), exploiting the fact that $R^c$ is concave in $\Pi$ (Lu et al., 2023).

3. Subspace Geometry and Theoretical Guarantees

Maximizing the MCR² objective rigorously induces a union-of-subspaces geometry with desirable analytic properties:

Within-class compressibility: Each class's features $Z_j$ are driven to occupy a low-rank subspace (small per-class log det).
Between-class diversity: The aggregated feature matrix $Z$ is forced to span the maximal volume permitted by the ambient dimension under the packing resolution $\epsilon$ , making the class subspaces nearly orthogonal.
Global optima: At sufficiently high precision, the optimal embedding decomposes into mutually orthogonal, maximally spread class subspaces, as proven by explicit block-orthogonality and two-level SVD structure for critical points (2406.01909).

Importantly, every critical point of the objective is either a local maximizer with these orthogonality properties or a strict saddle, ensuring that standard first-order optimization algorithms converge to maximizers almost surely (2406.01909). This "benign landscape" underscores the tractability and stability of the objective.

4. Algorithmic Instantiations and Scalability

The direct evaluation of log-det terms for each class subset in $R^c$ scales linearly with the number of classes, imposing significant computational burdens. Variational forms and blockwise approximations have been proposed to reduce the computational complexity, such as expressing log det via spectral function variational forms, which enables scaling MCR² to datasets with hundreds or thousands of classes (Baek et al., 2022). For practical training, large-scale settings employ mini-batch evaluations, moving-average covariances, and weight decay (Lu et al., 2023).

White-box deep networks, such as "ReduNet," directly unroll projected gradient ascent steps for MCR² into explicit network layers, enabling closed-form layer-wise update rules and giving rise to interpretable, statically-constructed architectures (Chan et al., 2020, Chan et al., 2021). Shift- and translation-invariance can be enforced via circulant structures, leading to multi-channel convolutional operators in the feature update rules (Chan et al., 2021).

5. Applications: Clustering, Metric Learning, Communications, and Beyond

The MCR² principle forms the foundation for various interpretable approaches in clustering, metric learning, federated representation learning, and communications:

Clustering/representation: Directly maximizes $\Delta R$ for clustering and learns unsupervised embeddings via alternating updates over features and partitions. Graph Cut-guided MCR² (CgMCR $^2$ ) integrates differentiable normalized-cut with the MCR² objective for state-of-the-art clustering accuracy (He et al., 2024).
Metric learning: Anti-Collapse Loss is a proxy-based MCR²-inspired loss designed for deep metric learning; it prevents embedding space collapse by maximizing the coding rate across learned proxies (Jiang et al., 2024).
Federated learning: MCR² supports decentralized (label-aware) representation learning (e.g., the FLOW algorithm), producing compact, discriminative embeddings without cross-entropy loss (Cervino et al., 2022).
Task-oriented communications: In multi-device edge inference systems, MCR² serves as a surrogate objective for both learning and communication (e.g., MIMO precoder/channel optimization), promoting maximal downstream inference accuracy via geometric class separation (Cai et al., 2023).
Model compression: In natural language and vision domains, layers appended to frozen encoders (e.g., SBERT) can be trained using MCR² to achieve low-dimensional, compact, and cluster-preserving embeddings (Ševerdija et al., 2023).

6. Comparative Perspective and Practical Impact

Compared to conventional losses (cross-entropy, contrastive, contractive, information bottleneck), MCR² explicitly and jointly enforces subspace compressibility and class-separability at the feature level. Unlike cross-entropy, which only indirectly shapes representations, or information bottleneck, which may eliminate structure necessary for discrimination, MCR² leverages rate–distortion theory to learn robust, diverse, and discriminative embeddings (Yu et al., 2020).

Empirical studies demonstrate the robustness of MCR² objectives to label noise, the ability to recover the correct subspace geometry even under heavy corruption, and performance on par or superior to cross-entropy-based networks in clustering and classification, particularly in the presence of degenerate or subspace-like class distributions (2406.01909, Yu et al., 2020, Lu et al., 2023).

7. Extensions, Approximations, and Ongoing Research

Recent advances address analytical and computational challenges inherent in MCR². Adaptive regularized ReduNet architectures employ refined approximations of the Gaussian rate-distortion function, resulting in superior accuracy and convergence compared to standard ReduNet (Huang et al., 23 Jun 2025). Variational log-det formulations further allow for scalable optimization and broader applicability, including normalizing flows and system identification (Baek et al., 2022). The geometric structure uncovered by global optima analysis continues to motivate new objective designs and theoretical investigations in both white-box and black-box deep learning frameworks.

References: