Papers
Topics
Authors
Recent
2000 character limit reached

Output Embedding Centering (OEC)

Updated 6 January 2026
  • Output Embedding Centering (OEC) is a methodology that subtracts the global mean from embedding vectors to reveal relative variations.
  • It enhances spectral properties by eliminating the rank-one spike in uncentered data, leading to more accurate PCA/SVD outcomes.
  • OEC improves LLM training stability through methods like μ-centering and μ-loss, offering robustness with minimal computational overhead.

Output Embedding Centering (OEC) refers to a set of methodologies for subtracting the mean output embedding vector before performing downstream operations such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), or using logits in LLM training. The OEC paradigm ensures that the resulting embeddings capture relative variations rather than being dominated by global mean offsets. This centering process has both theoretical and practical consequences, including improved spectral properties, stable training dynamics, and robust mitigation of logit divergence, particularly in deep learning pipelines and LLM pretraining (Kim et al., 2023, Stollenwerk et al., 5 Jan 2026).

1. Mathematical Foundation of Output Embedding Centering

Let XRn×pX \in \mathbb{R}^{n \times p} be a data matrix with rows xiRpx_i \in \mathbb{R}^p. The mean embedding (row-mean vector) μ\mu is defined as:

μ=1nX1n\mu = \frac{1}{n} X^\top 1_n

where 1nRn1_n \in \mathbb{R}^n is the all-ones vector. The centered data matrix is

Xc=X1nμX_c = X - 1_n \mu^\top

This guarantees each column of XcX_c has mean zero: 1nXc=01_n^\top X_c = 0. In the context of embedding-based models, OEC refers to subtracting this mean embedding from each output vector prior to further processing.

Centering is indispensable for PCA/SVD because the covariance matrix,

XX=XcXc+nμμX^\top X = X_c^\top X_c + n \mu \mu^\top

contains a rank-one "spike" nμμn \mu \mu^\top due to the mean, distorting the principal axes and eigen-spectrum (Kim et al., 2023).

2. OEC in PCA/SVD-Based Embedding Pipelines

SVD of the centered matrix yields Xc=UcΣcVcX_c = U_c \Sigma_c V_c^\top; for uncentered data, X=UΣVX = U \Sigma V^\top with respective spectral values. Absent centering, the first singular vector v1v_1 maximizes xXXxx^\top X^\top X x over x=1\|x\|=1, aligning with the mean direction μ\mu. Proposition 1 ("parallel" condition) establishes that if the first centered singular vector vˉ1\bar{v}_1 is parallel to μ\mu, then v1=vˉ1=μ/μv_1 = \bar{v}_1 = \mu/\|\mu\| (Kim et al., 2023).

If vˉ1μ\bar{v}_1 \parallel \mu, discarding the leading component from the uncentered embedding (v2,,vk+1v_2,\ldots,v_{k+1}) retrieves the centered kk-dimensional embedding up to sign: XV2:k+1=XcVc,1:kX V_{2:k+1} = X_c V_{c, 1:k}. More generally, if the span of the first kk right singular vectors of XcX_c and XX differ only by an orthogonal k×kk \times k change of basis, then their embeddings are related by an orthogonal transformation.

Spectrally, the eigenvalues of XcXcX_c^\top X_c interlace those of XXX^\top X. The top eigenvalue σ12\sigma_1^2 of XX includes the mean component nμ2n \|\mu\|^2, "soaking up" variance and relegating relative structure to lower-order components. Thus, without centering, principal axes are prone to over-represent the mean, obscuring meaningful structure (Kim et al., 2023).

3. OEC for Stable LLM Pretraining: Geometric Diagnosis and Formalism

In LLMs, especially decoder-only architectures, output logits li=eihl_i = e_i \cdot h rely on output embeddings eiRHe_i \in \mathbb{R}^H and hidden state hRHh \in \mathbb{R}^H. At large learning rates, output-logit divergence occurs—some logits ljl_j tend to ++\infty or -\infty, precipitating training collapse. Existing solutions such as z-loss regularization (Lz=αlog2(Z)\mathcal{L}_z = \alpha \log^2(Z), Z=jexp(lj)Z = \sum_j \exp(l_j)) suppress positive logit divergence but allow negative drift (Stollenwerk et al., 5 Jan 2026).

OEC directly targets the source: the anisotropic drift of mean output embedding μ=(1/V)i=1Vei\mu = (1/V) \sum_{i=1}^V e_i. If μ\|\mu\| is uncontrolled, it induces an unbounded global shift in logits. Centering ensures μ0\mu \approx 0, bounding logit values and preventing divergence.

There are two primary OEC variants:

  • μ-centering: Deterministically subtract μ\mu from every eie_i after each optimization step:

ei=eiμe_i^\star = e_i - \mu

Ensures lˉ=0\bar{l}^\star = 0 and does not alter loss or probabilities due to softmax invariance.

  • μ-loss: Regularize the mean norm by adding Lμ=λμ2\mathcal{L}_\mu = \lambda \|\mu\|^2 to the standard negative log-likelihood:

Ltotal=LNLL+Lμ\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{NLL}} + \mathcal{L}_\mu

Penalizes excessive drift in μ\|\mu\|; λ\lambda is typically 10410^{-4}.

4. Theoretical Guarantees and Algorithmic Implementation

Theorem 3.5 of (Stollenwerk et al., 5 Jan 2026) shows μ-centering provably tightens the bound on maximum logits. Assuming dot products eiμe_i \cdot \mu lie in [μ2B,μ2+B+][\|\mu\|^2 - B_-, \|\mu\|^2 + B_+], after centering, the effective spread is reduced by a ratio Bratio1B_{\text{ratio}} \leq 1, enforcing maxilimaxili\max_i |l_i^\star| \leq \max_i |l_i|.

For μ-loss, since unbounded μ\|\mu\| incurs infinite regularization, the optimizer keeps μ\mu close to zero, thereby bounding logits.

A generic PyTorch-style algorithm for OEC implementation is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for batch in data_loader:
    logits, loss_nll = model(batch)
    if use_mu_loss:
        mu = model.output_embeddings.mean(dim=0)
        loss_mu = lambda_mu * (mu.pow(2).sum())
        loss = loss_nll + loss_mu
    else:
        loss = loss_nll
    loss.backward()
    optimizer.step()
    if use_mu_centering:
        with torch.no_grad():
            E = model.output_embeddings
            mu = E.mean(dim=0, keepdim=True)
            E.sub_(mu)
    optimizer.zero_grad()
Computational cost is minimal: μ-centering adds 0.5%\approx 0.5\% overhead; μ-loss is even less (<1%1\%).

5. Best Practices, Hyperparameter Sensitivity, and Experimental Outcomes

OEC does not require changes to learning rate schedules. μ-centering is hyperparameter-free and stable across all tested learning rates. μ-loss recommends λ104\lambda \geq 10^{-4}, insensitive to tuning in the range 10410110^{-4} \ldots 10^{-1}. By contrast, z-loss (α=104\alpha = 10^{-4}) requires careful tuning; optimal α101\alpha \approx 10^{-1} in practice.

Empirical results reported in (Stollenwerk et al., 5 Jan 2026) demonstrate:

  • Optimal test loss: All methods (baseline, z-loss, μ-loss, μ-centering) reach essentially identical minimum values.
  • Learning rate sensitivity (LRS): μ-loss and μ-centering yield the lowest LRS, indicating greatest stability. For the largest Transformer (221M params), LRS for μ-loss is 0.056, for μ-centering is 0.061, for z-loss is 0.109, and for baseline is 0.412.
  • Convergence under large learning rates: μ-based OEC variants remain stable at η=0.3\eta = 0.3, while baseline and z-loss diverge for η0.003\eta \gtrsim 0.003 and η0.1\eta \gtrsim 0.1, respectively.
  • Mean embedding norm diagnostics: μ-centering enforces μ=0\|\mu\| = 0; μ-loss holds μ0\|\mu\| \approx 0. Baseline and z-loss allow μ\|\mu\| to grow with learning rate.
  • Runtime overhead: μ-centering and μ-loss are competitive (<1\% overhead), outperforming z-loss (+0.8\%...6.4\%).
Method Learning Rate Sensitivity (LRS) Runtime Overhead
Baseline 0.412 0 %
z-loss (10⁻⁴) 0.109 +0.8 %
μ-loss (10⁻⁴) 0.056 +0.2 %
μ-centering 0.061 +0.3 %

6. Connections to Broader Embedding Theory and Robustness

OEC unifies several principles underlying dimensionality reduction and representation learning. In embedding-based pipelines (e.g., word embeddings, graph embeddings, deep representation vectors), failure to center results in the first axis being dominated by the global mean. Subtracting the mean ensures alignment of downstream principal axes with genuine structural directions, facilitates orthogonal invariance, and stabilizes spectral properties—removing a rank-one “mean spike” and restoring equivalence between covariance-based and SVD-based PCA (Kim et al., 2023).

When mean centering is impractical, discarding the first singular direction in uncentered SVD is an effective heuristic, closely matching centered PCA outcomes with an error bounded by the top centered singular value.

7. Implications, Limitations, and Prospects

OEC (both μ-centering and μ-loss) offers a theoretically grounded, low-overhead solution to output-logit divergence by addressing the root cause: anisotropic drift of output embeddings. A plausible implication is the broad applicability of OEC to other settings where mean drift affects representations, including vision models, multimodal architectures, and robust unsupervised learning. Hyperparameter robustness and ease of integration suggest immediate utility across modern LLM training frameworks, with minimal alterations to existing pipelines (Stollenwerk et al., 5 Jan 2026).

No significant controversies are identified around the necessity of centering for SVD/PCA or its stabilizing effect on output logits. Nevertheless, the choice between deterministic μ-centering and regularization-based μ-loss may depend on downstream compatibility and system constraints.

Output Embedding Centering constitutes a canonical best practice for modern embedding analysis and stable model training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Output Embedding Centering (OEC).