Static Word Embedding Averaging

Updated 7 April 2026

Static word embedding averaging is a technique that combines multiple vector representations using arithmetic averaging to form a single static embedding.
It involves aligning diverse embeddings (e.g., via Procrustes or GPA) to preserve semantic structures and consistently enhance tasks like word similarity and analogies.
Practical applications include aggregating different training runs and source algorithms, denoising vectors, and distilling contextual embeddings into stable, efficient representations.

Static word embedding averaging denotes a family of techniques in which multiple vector representations of a word are combined, typically via arithmetic averaging, to form a single, fixed-dimensional static embedding. The method applies across different contexts, including aggregating source embeddings from distinct algorithms, pooling over repeated model initializations, and constructing static vectors from contextual models. Despite its conceptual simplicity, careful alignment and weighting can substantially impact the final embedding’s quality, stability, and downstream utility.

1. Mathematical Formulations and Variants

The fundamental operation in static word embedding averaging is, for a vocabulary $\mathcal{V}$ and $K$ source embedding sets $\{E^{(k)}\}$ of dimension $d$ , to produce for each word $w \in \mathcal{V}$ : $z_w = \frac{1}{K} \sum_{k=1}^K x_w^{(k)}$ where $x_w^{(k)} \in \mathbb{R}^d$ is the source embedding of $w$ in the $k$ -th set (Coates et al., 2018). This is often applied to pre-trained, static (non-contextual) embeddings. If source vectors have differing dimensionalities, zero-padding is applied to achieve a uniform $d$ (Coates et al., 2018).

When sources differ by orthogonal transformation (as is common with repeated stochastic model runs or independent embeddings), naive averaging in the raw spaces can destroy geometric structure unless coordinate systems are aligned beforehand (Jawanpuria et al., 2020, Dev et al., 2018, Caciularu et al., 2021).

For contextual embeddings, static aggregation is performed by averaging token-level hidden states extracted from multiple contexts: $K$ 0 with $K$ 1 being the vector for occurrence $K$ 2 at layer $K$ 3 (Li et al., 2020, Sarıtaş et al., 2024, Wada et al., 5 Jun 2025).

2. Space Alignment: Geometric and Procrustes Approaches

Averaging is predicated on embeddings being expressed in a compatible space. Standard algorithms such as CBOW, GloVe, and FastText produce embeddings up to an arbitrary orthogonal rotation or reflection; thus, embeddings from different sources or initialization seeds generally inhabit misaligned coordinate frames.

Alignment strategies include:

Orthogonal Procrustes Analysis: Given two embedding sets $K$ 4, find $K$ 5 to minimize $K$ 6, yielding $K$ 7 with $K$ 8 (Dev et al., 2018, Rettenmeier, 2020).
Generalized Procrustes Analysis (GPA): For $K$ 9 sets, seek orthogonal $\{E^{(k)}\}$ 0 and consensus $\{E^{(k)}\}$ 1 minimizing $\{E^{(k)}\}$ 2, solved by alternating minimization: update $\{E^{(k)}\}$ 3 by averaging aligned $\{E^{(k)}\}$ 4, update $\{E^{(k)}\}$ 5 via SVD (Caciularu et al., 2021).
Mahalanobis-based Alignment: Learn per-source orthogonal rotations $\{E^{(k)}\}$ 6 and a shared symmetric positive-definite metric $\{E^{(k)}\}$ 7 by binary classification loss, maximizing similarity of same-word pairs across sources and dissimilarity otherwise (Jawanpuria et al., 2020).

Averaging proceeds on the transformed vectors: $\{E^{(k)}\}$ 8 This step ensures geometry preservation and amplifies discriminative dimensions (Jawanpuria et al., 2020).

3. Statistical, Geometric, and Theoretical Justification

Naive averaging faces skepticism due to the lack of guaranteed coordinate alignment across independently trained models. However, theoretical analysis shows that in high-dimensional spaces, the angle between independently chosen random vectors converges to $\{E^{(k)}\}$ 9 (“concentration of measure”), so cross-terms vanish in expectation. Thus, for two embedding sources,

$d$ 0

because $d$ 1 (Coates et al., 2018).

Empirically, this justifies why simple averaging—despite coordinate incomparability—preserves most distance relationships (Coates et al., 2018). Proper alignment before averaging, however, further guarantees that synonym and analogy structures are preserved or enhanced (Dev et al., 2018, Jawanpuria et al., 2020).

4. Application Regimes

Aggregating Multiple Embedding Algorithms

Averaging embeddings from distinct algorithms (GloVe, word2vec, fastText), after alignment, yields meta-embeddings that harness complementary statistical properties of each source. Geometry-aware averaging via orthogonal rotation and Mahalanobis scaling consistently outperforms raw averaging and each constituent embedding on word similarity (gains of +1–3 Spearman points) and analogy accuracy (+1–4) (Jawanpuria et al., 2020). Simple coordinate-wise averaging can be competitive, especially when sources are approximately isotropic and dimensions are well-normalized (Coates et al., 2018).

Denoising Across Multiple Training Runs

Averaging over multiple runs (ensembling) of the same embedding algorithm stabilizes local neighborhoods, improves consistency of rare word representations, and enhances analogy task performance by up to +4 points (Rettenmeier, 2020, Caciularu et al., 2021). Alignment via Procrustes or GPA is critical as raw runs are misaligned (Caciularu et al., 2021). Averaging across $d$ 2 aligned models reduces variance of cosine similarities by approximately $d$ 3 (Rettenmeier, 2020).

Creating Static Embeddings from Contextual Models

Contextual models (BERT, Sentence Transformers) admit the derivation of a single static vector for each word by averaging token-level hidden states across sampled occurrences. Simple mean-pooled contextualized vectors (“Aggregate”) can outperform BERT’s static input embeddings and conventional static models on property induction, provided masking is employed to decouple the target word from idiosyncratic co-occurrence (Li et al., 2020). Filtering idiosyncratic occurrences via nearest-neighbor search yields additional gains.

Naive averaging, however, yields poor analogy and similarity scores in morphologically rich settings or with insufficient occurrence diversity. Distillation-based methods (e.g. X2Static) that explicitly optimize a static lookup table against contextual statistics markedly outperform mean pooling (Sarıtaş et al., 2024). Principal component analysis and norm adjustment further refine the static space for semantic and cross-lingual sentence representation, especially when embedded in knowledge distillation from a sentence transformer (Wada et al., 5 Jun 2025).

5. Empirical Comparisons and Benchmarks

Table: Representative empirical results for static word embedding averaging across major regimes.

Regime	Evaluation	Naive AVG	Aligned AVG / Enhanced	Best Constituent
Multi-algorithm (Jawanpuria et al., 2020, Dev et al., 2018)	WS353 (Spearman)	69.3	70.6–71.1	67.6
Multi-run (Caciularu et al., 2021)	RW rare-word	18.1	20.2	19.2
Sentence embedding (Wada et al., 5 Jun 2025)	STS15 (Spearman)	—	83.1	81.5
Twitter sentiment (Samad et al., 2020)	Tweet AUC (%)	86.66	88.37 (sum/weighted)	—

On Turkish, naive averaging of BERT/ELMo contextual vectors yielded MRR ≪ 0.3 on analogies, inferior to Word2Vec/FastText, with distilled X2Static BERT exceeding 0.77 (Sarıtaş et al., 2024).

6. Limitations, Pitfalls, and Best Practices

Alignment Is Critical: Naive averaging assumes aligned coordinate axes; misalignment often degrades structure; proper alignment is tractable for $d$ 4 sources (Dev et al., 2018, Caciularu et al., 2021).
Weighted or Geometry-Aware Averaging: Dimensional weighting (e.g., via Mahalanobis or variance) or pre-removal of high-variance principal components can prevent collapse of discriminative information (Jawanpuria et al., 2020, Samad et al., 2020, Wada et al., 5 Jun 2025).
Domain and Corpus Caution: Pooling sources trained on distinct corpora or languages requires vocabulary intersection and scale normalization (Dev et al., 2018).
Contextual Aggregation Is Not Sufficient: For contextual models, naive mean-pooling across occurrences underperforms compared to distillation-based static embeddings or masking-averaging plus filtering (Li et al., 2020, Sarıtaş et al., 2024).
Diminishing Returns: Variance reduction in multi-run averaging follows $d$ 5 but downstream performance plateaus after $d$ 6; validation on target task is needed (Rettenmeier, 2020, Caciularu et al., 2021).
Bias in Cosine Similarity: Averaging can shift cosine similarities upward; explicit normalization after averaging is recommended (Rettenmeier, 2020).

7. Extensions and Modern Developments

Recent work leverages context-averaged and refined static embeddings for sentence-level semantic representation via principal component removal and knowledge distillation, matching or surpassing transformer sentence encoders on semantic similarity while being computationally efficient at inference time (Wada et al., 5 Jun 2025). Such pipelines entail:

Averaging token vectors across many contexts for each word.
Sentence-level principal component analysis and “all-but-the-top” component removal.
Refinement via knowledge distillation (monolingual) or contrastive learning (cross-lingual).
Downstream sentence representation by simple word averaging and optional L2-normalization.

This trend suggests that static word embedding averaging, when geometrically regularized and integrated with advanced post-processing, remains essential for constructing efficient, high-quality representations in resource-constrained or latency-sensitive settings.