Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory Utilization Regularizer (MUR)

Updated 16 March 2026
  • Memory Utilization Regularizer (MUR) is a differentiable penalty that maximizes internal memory usage in sequence models using an entropy-based Effective State-Size (ESS) metric.
  • It regularizes the gap between the theoretical maximal state size and the observed ESS, thereby promoting efficient long-range recall capabilities.
  • Empirical evaluations show that appropriate MUR settings enhance recurrent model performance while mitigating issues like state collapse and degenerate identity regimes.

A Memory Utilization Regularizer (MUR) is a differentiable penalty designed to encourage sequence models—specifically those parameterized as input-invariant or input-varying linear operators—to maximize their internal memory utilization as quantified by the Effective State-Size (ESS) metric. Developed within the general framework for sequence architecture analysis, MUR promotes the realization of a model’s available memory capacity, enabling more efficient and effective use of recurrent state for long-range recall tasks. The central mechanism relies on regularizing the gap between entropy-based effective state rank and the theoretical maximal state size, and it applies to a range of computational units, including attention variants, convolutions, and recurrent architectures (Parnichkun et al., 28 Apr 2025).

1. Mathematical Foundations: ESS and MUR Construction

Let TRd×dT \in \mathbb{R}^{d\ell \times d\ell} denote a causal linear operator governing a sequence model, mapping an input uu to an output y=Tuy = T u, with dd the hidden dimension and \ell the sequence length. The operator is block-structured with d×dd\times d submatrices TijT_{ij} mapping from step jj to ii.

Effective State-Size (ESS)

  • Input-invariant (LTI/LTV) case: For each time ii, form the strictly lower triangular submatrix Hi=Ti:,:i1Rd(i)×diH_i = T_{i:, :i-1} \in \mathbb{R}^{d(\ell-i) \times d i}. The ESS at time ii is the minimal recurrent state size: ESSi=rankHiESS_i = \operatorname{rank} H_i.
  • Input-varying operators: Where TT depends on uu via a featurizer fT(u)f_T(u), define Hi(u)=[fT(u)]i:,:i1H_i(u) = [f_T(u)]_{i:, :i-1}, and ESSi(u)=rankHi(u)ESS_i(u) = \operatorname{rank} H_i(u).

Since the integer-valued rank is non-differentiable, a smooth surrogate based on singular value decomposition (SVD) is used. Define the normalized spectrum pj=σj/kσkp_j = \sigma_j / \sum_k \sigma_k, where σj\sigma_j are the singular values of HiH_i. The entropy-ESS is ESSient=exp(jpjlogpj)ESS_i^\mathrm{ent} = \exp(-\sum_j p_j \log p_j). This construction yields a differentiable approximation to the rank that is amenable to gradient-based optimization.

Regularizer Formulation

Given the maximum achievable rank rmax=min(rows(Hi),cols(Hi))r_\mathrm{max} = \min(\text{rows}(H_i), \text{cols}(H_i)) or some known theoretical state size TSSTSS, the per-step MUR loss is

LiMUR=(rmaxESSient)2L_i^{MUR} = (r_\mathrm{max} - ESS_i^\mathrm{ent})^2

Aggregating over all time steps, layers, and batch entries, the total MUR penalty becomes

LMUR=λ,b,i(rmax()ESSi,b,ent)2L_{MUR} = \lambda \sum_{\ell,b,i} (r_\mathrm{max}^{(\ell)} - ESS_{i,b,\ell}^{\mathrm{ent}})^2

with regularization weight λ0\lambda \geq 0 (Parnichkun et al., 28 Apr 2025).

2. Algorithmic Differentiation and Backpropagation

All steps in the MUR construction are differentiable and compatible with GPU-based auto-differentiation systems. The main computational steps and gradients are:

  • Entropy-ESS: ESSient=exp(Hent(Pi))ESS_i^{\mathrm{ent}} = \exp(H_\mathrm{ent}(P_i)), Hent(Pi)=mPimlogPimH_\mathrm{ent}(P_i) = -\sum_m P_i^m \log P_i^m, Pi=si/si1P_i = s_i /\|s_i\|_1.
  • Backpropagation steps are:
    • LMURESSient=2λ(rmaxESSient)\frac{\partial L_{MUR}}{\partial ESS_i^{\mathrm{ent}}} = -2\lambda(r_\mathrm{max} - ESS_i^{\mathrm{ent}})
    • ESSientHent=ESSient\frac{\partial ESS_i^{\mathrm{ent}}}{\partial H_\mathrm{ent}} = ESS_i^{\mathrm{ent}}
    • HentPim=(logPim+1)\frac{\partial H_\mathrm{ent}}{\partial P_i^m} = -(\log P_i^m + 1)
    • Pimσim=[1/s1σim/s12]\frac{\partial P_i^m}{\partial \sigma_i^m} = [1/\|s\|_1 - \sigma_i^m/\|s\|_1^2]
    • σimHi=uim(vim)\frac{\partial \sigma_i^m}{\partial H_i} = u_i^m (v_i^m)^\top via the SVD Hi=UΣVH_i = U \Sigma V^\top

This pipeline allows training to directly optimize memory utilization subject to the capacity of each sequence model layer.

A PyTorch implementation computes entropy-ESS over submatrices HiH_i for all relevant time steps, averaging the squared penalty per batch and layer, as in the following template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
def compute_entropy_ess(H, eps=1e-12):
    U, S, V = torch.linalg.svd(H, full_matrices=False)
    S_sum = S.sum(dim=-1, keepdim=True)
    P = S / (S_sum + eps)
    H_ent = -(P * (P+eps).log()).sum(dim=-1)
    ESS = H_ent.exp()
    return ESS
def mur_penalty(T_matrices, lambda_mur, r_max):
    loss = 0.
    for T in T_matrices:
        for i in range(1, seq_len):
            H_i = T[..., d*i:, :d*i]
            ESS = compute_entropy_ess(H_i)
            loss += ((r_max - ESS)**2).mean()
    return lambda_mur * loss
To handle large dd\ell, batched or truncated SVD is recommended. Typical hyperparameters:

  • λ\lambda (MUR weight): 1×1031 \times 10^{-3} to 1×1021 \times 10^{-2}
  • ϵ\epsilon (numerical stability in entropy): 1×1081 \times 10^{-8}
  • rmax()r_\mathrm{max}^{(\ell)}: set to theoretical state size TSS()TSS^{(\ell)} or min(dim(Hi))\min(\text{dim}(H_i))

Initialization of recurrent AA matrices near identity (e.g., A=I+A = I + small noise) is empirically beneficial. For state-space featurizers, normalization may be necessary to ensure ESS tracks TSS growth.

4. Empirical Effects and Observations

Empirical evaluation demonstrates MUR’s substantive impact on memory-heavy tasks. Without MUR, models such as GLA and WLA exhibit “state collapse” and underperform compared to simple LA models. Introducing MUR with moderate λ\lambda enables these models to recover ESS and surpass LA in long-range recall, specifically the multi-query associative recall task. A UU-shaped relationship is observed between test accuracy and λ\lambda: too small has negligible effect, moderate yields optimal performance, while excessive regularization (λ\lambda too large) deteriorates the model towards a degenerate identity regime, reducing task accuracy (Parnichkun et al., 28 Apr 2025).

5. Limitations and Practical Caveats

Using ESS-based regularization introduces several computational and modeling considerations:

  • Cost: Full SVD per time step incurs O(3)O(\ell^3) overhead; truncated or batched SVD should be employed for large-scale models.
  • Differentiability: Tolerance-ESS is not differentiable; entropy-ESS is used for gradient-based optimization but may be numerically unstable if the singular value spectrum is sharply peaked.
  • Applicability: Input-varying operators may not fully satisfy assumptions for causal minimal realization; entropy-ESS is still a lower bound on required state.
  • Regularization Strength: Excessive λ\lambda leads to degenerate solutions (identity regime), defeating the intended memory benefit.
  • Averaging: With long sequences or batch averaging, careful selection of λ\lambda and the averaging window is needed to avoid over-penalizing or under-regularizing.

6. Theoretical and Practical Significance

MUR operationalizes the distinction between a model’s theoretical memory capacity and the portion actually utilized during sequence processing. Unlike raw visualizations or naïve memory capacity measures, entropy-ESS provides a continuous, actionable, and interpretable metric for recurrent state analysis. Leveraging MUR establishes a pathway for principled architecture design, initialization, and distillation, directly targeting the efficiency-accuracy frontier for sequence models. Properly calibrated, MUR facilitates improved long-range information recall without incurring significant computational or efficiency penalties (Parnichkun et al., 28 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory Utilization Regularizer (MUR).