FixNorm for Word Embeddings

Updated 11 December 2025

The paper introduces FixNorm, which iteratively centers and renormalizes word embeddings to achieve an isotropic, zero-mean distribution, improving downstream performance.
FixNorm resolves embedding space anisotropy and non-isomorphism by enforcing unit-norms and mean-centering, thereby reducing arbitrary scaling and translation discrepancies.
Empirical results demonstrate that FixNorm boosts translation accuracy by up to 7 percentage points and improves intrinsic similarity measures significantly.

FixNorm is an iterative normalization procedure designed to address geometric incompatibilities in word embedding spaces, particularly for tasks requiring the alignment of monolingual embeddings across languages. It operates by enforcing two constraints on embedding matrices: all word vectors are forced to unit length, and the global mean of all vectors is set to zero. Iteratively applying these constraints until convergence results in an embedding distribution that is maximally isotropic and centered, which significantly improves the quality of downstream tasks such as cross-lingual mapping and intrinsic word similarity evaluations. The mathematical foundation, convergence properties, and empirical gains of FixNorm have been established in the context of both cross-lingual and monolingual embedding optimization (Zhang et al., 2019, Carrington et al., 2019).

1. Mathematical Foundations and Iterative Procedure

Let $W=\{\mathbf{w}_1,\dots,\mathbf{w}_N\}\subset\mathbb{R}^d$ denote a matrix of $N$ word embeddings. FixNorm alternates two core operations:

Mean Centering: The empirical centroid $\boldsymbol\mu = \frac{1}{N}\sum_{j=1}^{N}\mathbf{w}_j$ is computed and subtracted from each vector: $\mathbf{w}_i \leftarrow \mathbf{w}_i - \boldsymbol\mu$ .
Unit-Norm Normalization: Each vector is rescaled to unit Euclidean length: $\mathbf{w}_i \leftarrow \mathbf{w}_i/\|\mathbf{w}_i\|$ .

Because normalizing alters the mean and centering alters the norms, these steps are applied in alternation. Convergence is detected when the maximum norm deviation $|\,\|\mathbf{w}_i\|-1\,|$ and the norm of the mean $\|\boldsymbol\mu\|_2$ fall below a threshold $\epsilon$ (e.g., $1\text{e}^{-4}$ ). Convergence typically occurs within 5--10 iterations.

The key insight is that, while one application suffices to roughly isotropize the embedding cloud, multiple iterations are necessary for strict satisfaction of both constraints, yielding a fixed point where all $\mathbf{w}_i$ have unit norm and mean zero (Zhang et al., 2019).

2. Resolution of Embedding Space Anisotropy and Non-Isomorphism

Pre-trained monolingual word embeddings characteristically exhibit anisotropic distributions, with embeddings forming "blobs" rather than uniform hyperspheres. These anisotropies differ across languages, complicating their alignment via orthogonal transformations, which are the foundation of many cross-lingual mapping techniques.

Mean centering addresses the first moment shift, eliminating global displacement differences between languages.
Unit-length normalization eliminates second-moment variability, neutralizing the effects of variable vector lengths due to differing word frequencies or semantic densities.

Iteration is crucial: recentering and renormalization must be repeated to converge to a configuration in which both constraints are strictly satisfied. At this stage, embeddings from distinct languages can be more effectively aligned via orthogonal mappings, as the primary obstacles (mean and scale discrepancies) have been removed (Zhang et al., 2019).

3. Identifiability of Word Embeddings and Invariance Classes

The identifiability issues in word embedding optimization arise from mismatches between the invariances of the training objective ( $f$ ) and the evaluation metric ( $g$ ) (Carrington et al., 2019):

Optimization objective invariance: $f(X,U,V)$ (e.g., SGNS, GloVe) is invariant under left-multiplication of embeddings $V$ by any $C \in \mathrm{GL}(d)$ . Thus, the solution set is $V^*\cdot C$ for arbitrary $C$ .
Evaluation function invariance: Typical metrics $g$ (e.g., cosine similarity) are invariant only to $c\cdot O(d)$ , i.e., global orthogonal rotations and global scaling.

The implication is that, without normalization, the space of solutions for $V$ contains arbitrary anisotropic scalings which change intrinsic evaluations even though they leave the training objective unchanged. Imposing the FixNorm (unit-norm) constraint, the only residual symmetries are global orthogonal rotations, which are themselves invariances of $g$ . This addresses the identifiability gap and eliminates spurious directions in model selection and comparison (Carrington et al., 2019).

4. Algorithmic Realizations

The standard implementation of FixNorm is an outer loop applying mean centering and unit-norm normalization until convergence. The following pseudocode captures the process (Zhang et al., 2019):

def FixNorm(W, T_max=10, epsilon=1e-4):
    for t in range(T_max):
        mu = W.mean(axis=0)
        W = W - mu
        norms = np.linalg.norm(W, axis=1)
        W = W / norms[:, None]
        if np.max(np.abs(norms - 1)) < epsilon and np.linalg.norm(mu) < epsilon:
            break
    return W

For optimization-based embedding models, post-hoc projection onto the unit sphere (for each vector) or full rowspace orthonormalization (e.g., $V \leftarrow (VV^T)^{-1/2}V$ ) can be adopted as an alternative. For both classes, FixNorm can be imposed as a hard constraint, with the unit-norm projection reducing the solution space invariance from $\mathrm{GL}(d)$ to $O(d)$ (Carrington et al., 2019).

5. Empirical Impact and Ablation Results

In cross-lingual word translation tasks (such as those evaluated with MUSE benchmarks and Procrustes alignment), FixNorm yields robust performance gains:

Across five language pairs, FixNorm increased top-1 word translation accuracy by 3–7 percentage points over baselines performing only single normalization or none.
English–Japanese alignment saw translation accuracy improve from 2% to 44% with FixNorm (Zhang et al., 2019).
Ablations indicate: mean centering alone yields a 1–2 point improvement; unit-norm alone, 2–3 points; single combined step, 4 points; full iterative convergence, 6–7 points.
Comparable gains appear in intrinsic word similarity (WordSim-353, SimLex-999) for monolingual spaces when moving from unconstrained to FixNorm-constrained embeddings (e.g., GloVe Spearman correlation from 0.601 to 0.641), and can be further (though with risk of overfitting) improved by optimizing over permissible scalings (Carrington et al., 2019).

Residual artifacts, such as outlier vectors or persistent mean bias, are effectively eliminated, improving reliability and stability in downstream retrieval and classification.

Setting	Metric	Score
GloVe (unconstrained)	Spearman (WS353)	0.601
GloVe + FixNorm	Spearman (WS353)	0.641
GloVe + FixNorm + scaling opt.	Spearman (WS353)	0.679

6. Practical Recommendations and Hyperparameter Guidance

Empirical usage suggests the following protocol (Zhang et al., 2019):

Ten iterations ( $T_{\max} = 10$ ) and tolerance $\epsilon=1$ e ${}^{-4}$ suffice in most cases, with full convergence achieved within a few seconds for vocabulary sizes $N \approx 200{,}000$ , $d \approx 300$ on modern hardware.
FixNorm should be applied as a preprocessing step before any linear or orthogonal alignment for cross-lingual tasks.
In resource-constrained settings, mini-batch processing can be used for norm computation, but global mean centering requires a full pass.
The procedure can be terminated early if no vectors exhibit grossly non-unit norms or the mean vector norm is sufficiently small; otherwise, perform several additional iterations.
If the embedding cloud reveals remaining anisotropies ("holes" or outliers), further iterations of FixNorm are justified.

A plausible implication is that enforcing FixNorm as a standard preprocessing step harmonizes embedding geometry with the invariance structure of most evaluation functions, removing arbitrary ambiguity from the model selection process (Carrington et al., 2019).

7. Theoretical Significance and Broader Context

FixNorm addresses the fundamental geometric mismatch between embedding optimization objectives and evaluation functions. By reducing the solution space’s symmetry group from $\mathrm{GL}(d)$ to $O(d)$ , it eliminates identifiability issues that otherwise allow arbitrary scaling directions that distort intrinsic evaluations. The process is minimal in computational overhead, devoid of trainable parameters, and universally applicable irrespective of the embedding generation algorithm.

Empirical and theoretical support for FixNorm demonstrates improved comparability, reliability, and interpretability of embedding spaces in both monolingual and cross-lingual contexts, substantiating its utility as a canonical normalization step for word embeddings (Zhang et al., 2019, Carrington et al., 2019).