Bag-of-Words Superposition (BOWS)
- BOWS is a representational technique that embeds sparse bag-of-words vectors into lower-dimensional overlapping latent spaces, leveraging feature correlations for constructive interference.
- It employs tied-weight autoencoders and reconstruction loss minimization to reveal semantic clusters and cyclic embedding geometry in both text and image domains.
- BOWS enhances multi-vocabulary image retrieval by probabilistically weighting overlapping visual word assignments to reduce false matches and improve precision.
Bag-of-Words Superposition (BOWS) encompasses a class of representational and algorithmic techniques in which multiple high-dimensional, sparse (or discrete) feature vectors—typically bag-of-words encodings—are embedded within a lower-dimensional latent space such that features overlap, or "superpose," in the representation. Originally motivated by the observation that neural networks and information retrieval systems can represent far more features than available embedding dimensions, BOWS formalizes and systematizes the study and utilization of this phenomenon. Core theoretical advances have clarified how feature correlations in real-world data fundamentally alter the geometry and functionality of superposition beyond the classical, uncorrelated regime. In parallel, BOWS is central to scalable multi-vocabulary image retrieval, where correlated vocabulary overlaps require probabilistic handling to mitigate overcounting.
1. Formal Construction and Mathematical Framework
In the canonical setting for BOWS as introduced by Prieto et al., each data sample—typically a text context—is represented as a binary bag-of-words vector , with the vocabulary size (e.g., ). Encoding to a lower-dimensional latent space of size is performed by a tied-weight autoencoder:
- Encoder:
- Decoder:
Here, contains the embedding vectors as columns, is an optional offset, and is the output nonlinearity (identity for linear AEs, ReLU for nonlinear AEs). Training seeks to minimize the mean-squared reconstruction loss:
- Linear AE:
- ReLU AE:
This setup generalizes to other domains—e.g., image retrieval—where the superposition occurs at the level of multi-vocabulary feature assignments, with each feature potentially participating in multiple "votes" or matches.
2. Correlated Superposition: Signal, Interference, and Constructive Effects
The classical "uncorrelated superposition" paradigm assumes feature covariance is nearly diagonal, leading to the conclusion that interference among superposed features is predominantly harmful noise, to be minimized by orthogonalization and filtered via inhibitory nonlinearities (e.g., ReLU with negative bias).
BOWS explicitly models the impact of realistic correlations via the interference decomposition for reconstructing word :
- In the correlated setting (low-rank ), interference can align with the signal. The optimal encoding (for linear AE, , ) satisfies , projector onto the top- PCs of . Thus:
Here, interference is proportional to (constructive, not random), so sets of mutually correlated features can reinforce rather than corrupt each other's reconstructive signal (Prieto et al., 10 Mar 2026).
3. Feature Vector Geometry Induced by Superposition
The geometry of embedding vectors in BOWS is shaped by both the statistical structure of the data and the dimension of the bottleneck:
- Low-rank regime: Correlated features form clusters or lie on low-dimensional manifolds (principal subspace arrangement). Empirically, UMAP projections of from AEs trained on internet text exhibit clear semantic clusters (verbs, names, etc.) for .
- Antipodal arrangement: As increases, pairs of anti-correlated features become antipodal (), and embeddings resemble regular polytopes. Further increase to yields orthogonal, non-superposed embeddings.
- Cyclic and semantic clusters: The empirical covariance matrix for months of the year is nearly circulant, resulting in top-2 principal components forming a geometric circle. The learned encoder weights for months also trace out an equiangular 12-gon, with related seasonal terms (e.g., "Christmas") aligned near correlated months (Prieto et al., 10 Mar 2026).
| Regime | Embedding Geometry | Emergent Structure |
|---|---|---|
| Correlated/Low-rank | Principal subspace, clusters | Semantic clusters, cycles |
| Sparse/Anticorr. | Antipodal, polytope | Orthogonal axes for rare features |
| High-Dim () | Nearly orthogonal | No superposition |
4. Role of Regularization, Bottleneck, and Empirical Metrics
weight decay strongly biases solutions toward low-rank (principal subspace) encodings when , minimizing for a given reconstruction error. This effect drives both linear and nonlinear AEs towards solutions exploiting constructive interference, leading to observed clustering and cyclical arrangements in weight space and resemblances to phenomena in LLMs (Prieto et al., 10 Mar 2026):
- Empirical protocol: WikiText-103 corpus with words, context window , varied, 20-epoch autoencoder training with and without weight decay.
- Metrics: Per-word (reconstruction), one-hot (no context), Fraction of Explained Variance (FEV) for linear probes.
- Observations: 81% of cases show improved performance from interference; "Beatles" and member surnames, despite near-zero one-hot , achieve in context. Linear decoding of month features yields .
Rare, weakly correlated features revert to classic interference filtering and near-orthogonal packing, while strongly correlated features remain in the low-rank principal subspace. The transition between these regimes is governed by the interplay between , data covariance, and weight-norm regularization.
5. BOWS in Multi-Vocabulary Image Retrieval
BOWS also describes the phenomenon of superposing evidence from multiple codebook vocabularies in large-scale bag-of-words image retrieval. Each SIFT descriptor is assigned to visual words (in vocabularies), with all inverted-list appearances superposed. Correlation among the vocabularies (i.e., overlapping visual word assignments for a feature) leads to over-counting, increasing false matches. The Bayes merging method mitigates this via probabilistic weighting:
- For a database feature found in the intersection of lists for query feature , compute
and use as the feature’s vote.
- The final form for down-weighting exploits cardinality ratios and prior odds:
where , , , and is the database size (Zheng et al., 2014).
Empirical results show that Bayes merging substantially improves mean average precision (mAP) and N–S scores over naïve superposition, at a moderate increase in runtime and memory usage.
6. Revising the Superposition Paradigm and Interpretability Implications
BOWS necessitates a revision of the standard superposition model in machine learning:
- Interference is not universally deleterious. In the presence of correlated data, neural networks can leverage constructive interference, arranging embedding vectors to ensure co-active features amplify the correct decoding, as observed with semantic clusters and cycles in real LLMs.
- Nonlinearities such as ReLU, alongside negative bias, remain essential for filtering harmful interference but do not alone account for the full range of geometric structures observed.
- The distinction between "presence-coding" (binary detection for token presence) and "value-coding" (encoding continuous latent factors) is critical; cyclic, semantic, and low-rank phenomena in real models may arise directly from input statistics and regularization.
BOWS thus serves both as a normative benchmark for mechanistic interpretability research and an analytic framework for understanding emergent feature geometry in both LLMs and computer vision systems (Prieto et al., 10 Mar 2026, Zheng et al., 2014).