Memory Utilization Regularizer (MUR)

Updated 16 March 2026

Memory Utilization Regularizer (MUR) is a differentiable penalty that maximizes internal memory usage in sequence models using an entropy-based Effective State-Size (ESS) metric.
It regularizes the gap between the theoretical maximal state size and the observed ESS, thereby promoting efficient long-range recall capabilities.
Empirical evaluations show that appropriate MUR settings enhance recurrent model performance while mitigating issues like state collapse and degenerate identity regimes.

A Memory Utilization Regularizer (MUR) is a differentiable penalty designed to encourage sequence models—specifically those parameterized as input-invariant or input-varying linear operators—to maximize their internal memory utilization as quantified by the Effective State-Size (ESS) metric. Developed within the general framework for sequence architecture analysis, MUR promotes the realization of a model’s available memory capacity, enabling more efficient and effective use of recurrent state for long-range recall tasks. The central mechanism relies on regularizing the gap between entropy-based effective state rank and the theoretical maximal state size, and it applies to a range of computational units, including attention variants, convolutions, and recurrent architectures (Parnichkun et al., 28 Apr 2025).

1. Mathematical Foundations: ESS and MUR Construction

Let $T \in \mathbb{R}^{d\ell \times d\ell}$ denote a causal linear operator governing a sequence model, mapping an input $u$ to an output $y = T u$ , with $d$ the hidden dimension and $\ell$ the sequence length. The operator is block-structured with $d\times d$ submatrices $T_{ij}$ mapping from step $j$ to $i$ .

Effective State-Size (ESS)

Input-invariant (LTI/LTV) case: For each time $i$ , form the strictly lower triangular submatrix $H_i = T_{i:, :i-1} \in \mathbb{R}^{d(\ell-i) \times d i}$ . The ESS at time $i$ is the minimal recurrent state size: $ESS_i = \operatorname{rank} H_i$ .
Input-varying operators: Where $T$ depends on $u$ via a featurizer $f_T(u)$ , define $H_i(u) = [f_T(u)]_{i:, :i-1}$ , and $ESS_i(u) = \operatorname{rank} H_i(u)$ .

Since the integer-valued rank is non-differentiable, a smooth surrogate based on singular value decomposition (SVD) is used. Define the normalized spectrum $p_j = \sigma_j / \sum_k \sigma_k$ , where $\sigma_j$ are the singular values of $H_i$ . The entropy-ESS is $ESS_i^\mathrm{ent} = \exp(-\sum_j p_j \log p_j)$ . This construction yields a differentiable approximation to the rank that is amenable to gradient-based optimization.

Regularizer Formulation

Given the maximum achievable rank $r_\mathrm{max} = \min(\text{rows}(H_i), \text{cols}(H_i))$ or some known theoretical state size $TSS$ , the per-step MUR loss is

$L_i^{MUR} = (r_\mathrm{max} - ESS_i^\mathrm{ent})^2$

Aggregating over all time steps, layers, and batch entries, the total MUR penalty becomes

$L_{MUR} = \lambda \sum_{\ell,b,i} (r_\mathrm{max}^{(\ell)} - ESS_{i,b,\ell}^{\mathrm{ent}})^2$

with regularization weight $\lambda \geq 0$ (Parnichkun et al., 28 Apr 2025).

2. Algorithmic Differentiation and Backpropagation

All steps in the MUR construction are differentiable and compatible with GPU-based auto-differentiation systems. The main computational steps and gradients are:

Entropy-ESS: $ESS_i^{\mathrm{ent}} = \exp(H_\mathrm{ent}(P_i))$ , $H_\mathrm{ent}(P_i) = -\sum_m P_i^m \log P_i^m$ , $P_i = s_i /\|s_i\|_1$ .
Backpropagation steps are:
- $\frac{\partial L_{MUR}}{\partial ESS_i^{\mathrm{ent}}} = -2\lambda(r_\mathrm{max} - ESS_i^{\mathrm{ent}})$
- $\frac{\partial ESS_i^{\mathrm{ent}}}{\partial H_\mathrm{ent}} = ESS_i^{\mathrm{ent}}$
- $\frac{\partial H_\mathrm{ent}}{\partial P_i^m} = -(\log P_i^m + 1)$
- $\frac{\partial P_i^m}{\partial \sigma_i^m} = [1/\|s\|_1 - \sigma_i^m/\|s\|_1^2]$
- $\frac{\partial \sigma_i^m}{\partial H_i} = u_i^m (v_i^m)^\top$ via the SVD $H_i = U \Sigma V^\top$

This pipeline allows training to directly optimize memory utilization subject to the capacity of each sequence model layer.

3. Implementation Workflow and Recommended Hyperparameters

A PyTorch implementation computes entropy-ESS over submatrices $H_i$ for all relevant time steps, averaging the squared penalty per batch and layer, as in the following template:

import torch
def compute_entropy_ess(H, eps=1e-12):
    U, S, V = torch.linalg.svd(H, full_matrices=False)
    S_sum = S.sum(dim=-1, keepdim=True)
    P = S / (S_sum + eps)
    H_ent = -(P * (P+eps).log()).sum(dim=-1)
    ESS = H_ent.exp()
    return ESS
def mur_penalty(T_matrices, lambda_mur, r_max):
    loss = 0.
    for T in T_matrices:
        for i in range(1, seq_len):
            H_i = T[..., d*i:, :d*i]
            ESS = compute_entropy_ess(H_i)
            loss += ((r_max - ESS)**2).mean()
    return lambda_mur * loss

To handle large

d\ell

, batched or truncated SVD is recommended. Typical hyperparameters:

$\lambda$ (MUR weight): $1 \times 10^{-3}$ to $1 \times 10^{-2}$
$\epsilon$ (numerical stability in entropy): $1 \times 10^{-8}$
$r_\mathrm{max}^{(\ell)}$ : set to theoretical state size $TSS^{(\ell)}$ or $\min(\text{dim}(H_i))$

Initialization of recurrent $A$ matrices near identity (e.g., $A = I +$ small noise) is empirically beneficial. For state-space featurizers, normalization may be necessary to ensure ESS tracks TSS growth.

4. Empirical Effects and Observations

Empirical evaluation demonstrates MUR’s substantive impact on memory-heavy tasks. Without MUR, models such as GLA and WLA exhibit “state collapse” and underperform compared to simple LA models. Introducing MUR with moderate $\lambda$ enables these models to recover ESS and surpass LA in long-range recall, specifically the multi-query associative recall task. A $U$ -shaped relationship is observed between test accuracy and $\lambda$ : too small has negligible effect, moderate yields optimal performance, while excessive regularization ( $\lambda$ too large) deteriorates the model towards a degenerate identity regime, reducing task accuracy (Parnichkun et al., 28 Apr 2025).

5. Limitations and Practical Caveats

Using ESS-based regularization introduces several computational and modeling considerations:

Cost: Full SVD per time step incurs $O(\ell^3)$ overhead; truncated or batched SVD should be employed for large-scale models.
Differentiability: Tolerance-ESS is not differentiable; entropy-ESS is used for gradient-based optimization but may be numerically unstable if the singular value spectrum is sharply peaked.
Applicability: Input-varying operators may not fully satisfy assumptions for causal minimal realization; entropy-ESS is still a lower bound on required state.
Regularization Strength: Excessive $\lambda$ leads to degenerate solutions (identity regime), defeating the intended memory benefit.
Averaging: With long sequences or batch averaging, careful selection of $\lambda$ and the averaging window is needed to avoid over-penalizing or under-regularizing.

6. Theoretical and Practical Significance

MUR operationalizes the distinction between a model’s theoretical memory capacity and the portion actually utilized during sequence processing. Unlike raw visualizations or naïve memory capacity measures, entropy-ESS provides a continuous, actionable, and interpretable metric for recurrent state analysis. Leveraging MUR establishes a pathway for principled architecture design, initialization, and distillation, directly targeting the efficiency-accuracy frontier for sequence models. Properly calibrated, MUR facilitates improved long-range information recall without incurring significant computational or efficiency penalties (Parnichkun et al., 28 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Quantifying Memory Utilization with Effective State-Size (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Utilization Regularizer (MUR).

Memory Utilization Regularizer (MUR)

1. Mathematical Foundations: ESS and MUR Construction

Effective State-Size (ESS)

Regularizer Formulation

2. Algorithmic Differentiation and Backpropagation

3. Implementation Workflow and Recommended Hyperparameters

4. Empirical Effects and Observations

5. Limitations and Practical Caveats

6. Theoretical and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Memory Utilization Regularizer (MUR)

1. Mathematical Foundations: ESS and MUR Construction

Effective State-Size (ESS)

Regularizer Formulation

2. Algorithmic Differentiation and Backpropagation

3. Implementation Workflow and Recommended Hyperparameters

4. Empirical Effects and Observations

5. Limitations and Practical Caveats

6. Theoretical and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research