Output Embedding Centering (OEC)

Updated 6 January 2026

Output Embedding Centering (OEC) is a methodology that subtracts the global mean from embedding vectors to reveal relative variations.
It enhances spectral properties by eliminating the rank-one spike in uncentered data, leading to more accurate PCA/SVD outcomes.
OEC improves LLM training stability through methods like μ-centering and μ-loss, offering robustness with minimal computational overhead.

Output Embedding Centering (OEC) refers to a set of methodologies for subtracting the mean output embedding vector before performing downstream operations such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), or using logits in LLM training. The OEC paradigm ensures that the resulting embeddings capture relative variations rather than being dominated by global mean offsets. This centering process has both theoretical and practical consequences, including improved spectral properties, stable training dynamics, and robust mitigation of logit divergence, particularly in deep learning pipelines and LLM pretraining (Kim et al., 2023, Stollenwerk et al., 5 Jan 2026).

1. Mathematical Foundation of Output Embedding Centering

Let $X \in \mathbb{R}^{n \times p}$ be a data matrix with rows $x_i \in \mathbb{R}^p$ . The mean embedding (row-mean vector) $\mu$ is defined as:

$\mu = \frac{1}{n} X^\top 1_n$

where $1_n \in \mathbb{R}^n$ is the all-ones vector. The centered data matrix is

$X_c = X - 1_n \mu^\top$

This guarantees each column of $X_c$ has mean zero: $1_n^\top X_c = 0$ . In the context of embedding-based models, OEC refers to subtracting this mean embedding from each output vector prior to further processing.

Centering is indispensable for PCA/SVD because the covariance matrix,

$X^\top X = X_c^\top X_c + n \mu \mu^\top$

contains a rank-one "spike" $n \mu \mu^\top$ due to the mean, distorting the principal axes and eigen-spectrum (Kim et al., 2023).

2. OEC in PCA/SVD-Based Embedding Pipelines

SVD of the centered matrix yields $X_c = U_c \Sigma_c V_c^\top$ ; for uncentered data, $X = U \Sigma V^\top$ with respective spectral values. Absent centering, the first singular vector $v_1$ maximizes $x^\top X^\top X x$ over $\|x\|=1$ , aligning with the mean direction $\mu$ . Proposition 1 ("parallel" condition) establishes that if the first centered singular vector $\bar{v}_1$ is parallel to $\mu$ , then $v_1 = \bar{v}_1 = \mu/\|\mu\|$ (Kim et al., 2023).

If $\bar{v}_1 \parallel \mu$ , discarding the leading component from the uncentered embedding ( $v_2,\ldots,v_{k+1}$ ) retrieves the centered $k$ -dimensional embedding up to sign: $X V_{2:k+1} = X_c V_{c, 1:k}$ . More generally, if the span of the first $k$ right singular vectors of $X_c$ and $X$ differ only by an orthogonal $k \times k$ change of basis, then their embeddings are related by an orthogonal transformation.

Spectrally, the eigenvalues of $X_c^\top X_c$ interlace those of $X^\top X$ . The top eigenvalue $\sigma_1^2$ of $X$ includes the mean component $n \|\mu\|^2$ , "soaking up" variance and relegating relative structure to lower-order components. Thus, without centering, principal axes are prone to over-represent the mean, obscuring meaningful structure (Kim et al., 2023).

3. OEC for Stable LLM Pretraining: Geometric Diagnosis and Formalism

In LLMs, especially decoder-only architectures, output logits $l_i = e_i \cdot h$ rely on output embeddings $e_i \in \mathbb{R}^H$ and hidden state $h \in \mathbb{R}^H$ . At large learning rates, output-logit divergence occurs—some logits $l_j$ tend to $+\infty$ or $-\infty$ , precipitating training collapse. Existing solutions such as z-loss regularization ( $\mathcal{L}_z = \alpha \log^2(Z)$ , $Z = \sum_j \exp(l_j)$ ) suppress positive logit divergence but allow negative drift (Stollenwerk et al., 5 Jan 2026).

OEC directly targets the source: the anisotropic drift of mean output embedding $\mu = (1/V) \sum_{i=1}^V e_i$ . If $\|\mu\|$ is uncontrolled, it induces an unbounded global shift in logits. Centering ensures $\mu \approx 0$ , bounding logit values and preventing divergence.

There are two primary OEC variants:

μ-centering: Deterministically subtract $\mu$ from every $e_i$ after each optimization step:

$e_i^\star = e_i - \mu$

Ensures $\bar{l}^\star = 0$ and does not alter loss or probabilities due to softmax invariance.

μ-loss: Regularize the mean norm by adding $\mathcal{L}_\mu = \lambda \|\mu\|^2$ to the standard negative log-likelihood:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{NLL}} + \mathcal{L}_\mu$

Penalizes excessive drift in $\|\mu\|$ ; $\lambda$ is typically $10^{-4}$ .

4. Theoretical Guarantees and Algorithmic Implementation

Theorem 3.5 of (Stollenwerk et al., 5 Jan 2026) shows μ-centering provably tightens the bound on maximum logits. Assuming dot products $e_i \cdot \mu$ lie in $[\|\mu\|^2 - B_-, \|\mu\|^2 + B_+]$ , after centering, the effective spread is reduced by a ratio $B_{\text{ratio}} \leq 1$ , enforcing $\max_i |l_i^\star| \leq \max_i |l_i|$ .

For μ-loss, since unbounded $\|\mu\|$ incurs infinite regularization, the optimizer keeps $\mu$ close to zero, thereby bounding logits.

A generic PyTorch-style algorithm for OEC implementation is:

for batch in data_loader:
    logits, loss_nll = model(batch)
    if use_mu_loss:
        mu = model.output_embeddings.mean(dim=0)
        loss_mu = lambda_mu * (mu.pow(2).sum())
        loss = loss_nll + loss_mu
    else:
        loss = loss_nll
    loss.backward()
    optimizer.step()
    if use_mu_centering:
        with torch.no_grad():
            E = model.output_embeddings
            mu = E.mean(dim=0, keepdim=True)
            E.sub_(mu)
    optimizer.zero_grad()

Computational cost is minimal: μ-centering adds

\approx 0.5\%

overhead; μ-loss is even less (<

1\%

5. Best Practices, Hyperparameter Sensitivity, and Experimental Outcomes

OEC does not require changes to learning rate schedules. μ-centering is hyperparameter-free and stable across all tested learning rates. μ-loss recommends $\lambda \geq 10^{-4}$ , insensitive to tuning in the range $10^{-4} \ldots 10^{-1}$ . By contrast, z-loss ( $\alpha = 10^{-4}$ ) requires careful tuning; optimal $\alpha \approx 10^{-1}$ in practice.

Empirical results reported in (Stollenwerk et al., 5 Jan 2026) demonstrate:

Optimal test loss: All methods (baseline, z-loss, μ-loss, μ-centering) reach essentially identical minimum values.
Learning rate sensitivity (LRS): μ-loss and μ-centering yield the lowest LRS, indicating greatest stability. For the largest Transformer (221M params), LRS for μ-loss is 0.056, for μ-centering is 0.061, for z-loss is 0.109, and for baseline is 0.412.
Convergence under large learning rates: μ-based OEC variants remain stable at $\eta = 0.3$ , while baseline and z-loss diverge for $\eta \gtrsim 0.003$ and $\eta \gtrsim 0.1$ , respectively.
Mean embedding norm diagnostics: μ-centering enforces $\|\mu\| = 0$ ; μ-loss holds $\|\mu\| \approx 0$ . Baseline and z-loss allow $\|\mu\|$ to grow with learning rate.
Runtime overhead: μ-centering and μ-loss are competitive (<1\% overhead), outperforming z-loss (+0.8\%...6.4\%).

Method	Learning Rate Sensitivity (LRS)	Runtime Overhead
Baseline	0.412	0 %
z-loss (10⁻⁴)	0.109	+0.8 %
μ-loss (10⁻⁴)	0.056	+0.2 %
μ-centering	0.061	+0.3 %

6. Connections to Broader Embedding Theory and Robustness

OEC unifies several principles underlying dimensionality reduction and representation learning. In embedding-based pipelines (e.g., word embeddings, graph embeddings, deep representation vectors), failure to center results in the first axis being dominated by the global mean. Subtracting the mean ensures alignment of downstream principal axes with genuine structural directions, facilitates orthogonal invariance, and stabilizes spectral properties—removing a rank-one “mean spike” and restoring equivalence between covariance-based and SVD-based PCA (Kim et al., 2023).

When mean centering is impractical, discarding the first singular direction in uncentered SVD is an effective heuristic, closely matching centered PCA outcomes with an error bounded by the top centered singular value.

7. Implications, Limitations, and Prospects

OEC (both μ-centering and μ-loss) offers a theoretically grounded, low-overhead solution to output-logit divergence by addressing the root cause: anisotropic drift of output embeddings. A plausible implication is the broad applicability of OEC to other settings where mean drift affects representations, including vision models, multimodal architectures, and robust unsupervised learning. Hyperparameter robustness and ease of integration suggest immediate utility across modern LLM training frameworks, with minimal alterations to existing pipelines (Stollenwerk et al., 5 Jan 2026).

No significant controversies are identified around the necessity of centering for SVD/PCA or its stabilizing effect on output logits. Nevertheless, the choice between deterministic μ-centering and regularization-based μ-loss may depend on downstream compatibility and system constraints.

Output Embedding Centering constitutes a canonical best practice for modern embedding analysis and stable model training.

PDF Markdown Chat (Pro)

References (2)

PCA, SVD, and Centering of Data (2023)

Output Embedding Centering for Stable LLM Pretraining (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Output Embedding Centering (OEC).

Output Embedding Centering (OEC)

1. Mathematical Foundation of Output Embedding Centering

2. OEC in PCA/SVD-Based Embedding Pipelines

3. OEC for Stable LLM Pretraining: Geometric Diagnosis and Formalism

4. Theoretical Guarantees and Algorithmic Implementation

5. Best Practices, Hyperparameter Sensitivity, and Experimental Outcomes

6. Connections to Broader Embedding Theory and Robustness

7. Implications, Limitations, and Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Output Embedding Centering (OEC)

1. Mathematical Foundation of Output Embedding Centering

2. OEC in PCA/SVD-Based Embedding Pipelines

3. OEC for Stable LLM Pretraining: Geometric Diagnosis and Formalism

4. Theoretical Guarantees and Algorithmic Implementation

5. Best Practices, Hyperparameter Sensitivity, and Experimental Outcomes

6. Connections to Broader Embedding Theory and Robustness

7. Implications, Limitations, and Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research