From Data Statistics to Feature Geometry: How Correlations Shape Superposition
Abstract: A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real LLMs yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how neural networks โpackโ lots of ideas into a small mental space. In many models, there are more concepts to remember than there are slots to store them. So the model overlaps concepts in the same spaceโa bit like layering multiple transparent images on top of each other. This overlap is called superposition.
Most past work said this overlap creates โinterferenceโ (like static on a radio) that the model must block using a function called ReLU (which keeps only positive signals). But the authors show that, in real data where related ideas often appear together, interference isnโt always bad. Sometimes it helpsโlike two voices harmonizing. They introduce a simple, controlled setup, called BOWS (Bag-of-Words Superposition), to study this. It explains why models form meaningful patterns like clusters (similar words close together) and circles (like the months arranged in a loop).
What questions did the researchers ask?
They asked:
- How do models arrange many related features (like words) when they must share limited space?
- Can interference between features be helpful, not just harmful?
- When will a model choose to โlean intoโ shared patterns instead of trying to separate everything?
- Do these choices explain real patterns seen in LLMs, like semantic clusters and circular structures (months of the year)?
- How can we tell apart patterns caused by co-occurrence (things happen together) from patterns caused by encoding continuous values (like angles or map coordinates)?
How did they study it?
They built a simple but realistic testbed and ran controlled experiments.
- Bag-of-Words Superposition (BOWS): They turned internet text into โbag-of-wordsโ vectors. Each vector says which words appear in a chunk of text (1 = present, 0 = absent). This keeps real-life co-occurrence patterns (e.g., โDecemberโ often appears near โChristmasโ).
- Autoencoders: They trained small โcompress-and-reconstructโ models:
- A linear autoencoder (no nonlinearity), which acts like finding the main trends in the data.
- A ReLU autoencoder, which can block negative interference (think: keep only helpful, positive parts).
- Bottleneck and weight decay: They limited the modelโs โmemory sizeโ (the bottleneck) and sometimes used weight decay (a gentle penalty that encourages simpler, smaller weights). Both push the model to share space efficiently.
- Synthetic tests: They first made fake data with 12 features arranged in a circle (like months) to see how models handle structured correlations.
- Real tests: They applied the approach to real text (WikiText-103) and visualized the learned feature directions using simple 2D projections to see patterns.
- Case studies: They examined:
- Months of the year (do they form a circle?),
- Semantic clusters (do similar words group together?),
- The Beatles (do related words support each otherโs reconstruction?),
- Months vs Roman numerals (do patterns vanish at different model sizes?),
- โValue-codingโ tasks (modular addition circles and city map coordinates), to show when circles/maps appear without co-occurrence.
Key ideas explained in plain terms
- Superposition: Storing more features than there are dimensions by letting them share directions.
- Interference: When overlapping features affect each otherโcan be bad (static) or good (harmony).
- ReLU: A gate that keeps positive signals and blocks negative ones, preventing false alarms.
- Constructive interference: When related features overlap in ways that reinforce each other (e.g., โDecemberโ boosting โChristmasโ).
- Principal components (PCA): The main directions in which the data variesโlike the biggest trends in a crowd.
- Bottleneck: A small number of โlanesโ for many features; forces sharing and clever packing.
- Weight decay: Encourages simpler, smaller weights; nudges the model to find shared structure.
- Presence-coding vs value-coding:
- Presence-coding: โIs this word here?โ (binary detection)
- Value-coding: โWhat is the value/position?โ (like an angle or coordinates). Value-coding can create circles/maps even when there are no co-occurrence patterns.
What did they find, and why is it important?
- Interference can be helpful when features are correlated
- In toy examples with circularly related features and in real text, both linear and ReLU autoencoders often arrange features so that related ones share directions.
- This lets interference become constructive: the overlap helps the model reconstruct whatโs present with less effort.
- Two strategies can work together
- The model often combines:
- Constructive arrangement (pack related features together so they help each other), and
- ReLU filtering (block any leftover harmful interference and avoid false positives).
- Example: With โBeatlesโ words, related names (Lennon, McCartney, etc.) often improve reconstruction when they appear together; when similar context appears without โBeatles,โ the ReLU blocks false activation.
- Real patterns match whatโs seen in LLMs
- Semantic clusters: When the modelโs memory is tight or weights are kept small, word features group by meaning (verbs together, sports terms together, etc.).
- Circular structures: The months of the year form a circle because their co-occurrence follows the seasons. The model mirrors this structure, and โDecemberโ can help reconstruct โChristmasโ in context.
- Not all circles come from co-occurrenceโsometimes itโs value-coding
- In modular addition (math with wrap-around) and city coordinates, circles/maps appear because the model encodes continuous values (like sine/cosine or latitude/longitude), not because words co-occur.
- When the authors โablateโ (remove) everything except these value-coding directions, the model still works wellโproof these value features are doing the heavy lifting.
- Different features โde-superposeโ at different speeds
- As the modelโs memory grows, some groups (like months) become nearly independent (orthogonal) sooner than others (like Roman numerals).
- This shows real data is a mix: some features rely more on constructive sharing; others can be split apart with more capacity.
So whatโs the big picture?
- New perspective: Superposition isnโt just a problem to be fixed; it can be a strategy. When features are related, the model can arrange them so overlap helps rather than hurts.
- Explains real observations: The paper helps explain why LLMs show semantic clusters and neat circles (like months) in their internal spaces.
- Practical impact:
- Better interpretability tools: Knowing when overlap is helpful can guide how we design and evaluate feature-finding methods (like sparse autoencoders).
- Training choices: Tight bottlenecks and weight decay encourage efficient, constructive sharingโuseful for compact or robust models.
- Safer edits and robustness: Understanding feature geometry may help with knowledge editing and defending against adversarial tricks.
Limitations and future work: BOWS is intentionally simple and doesnโt capture everything about LLMs. The next steps include analyzing more realistic settings and precisely predicting when a model will prefer constructive sharing versus strict separation.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paperโs account of how correlations shape superposition.
- Formal dominance conditions: Derive precise, testable conditions (in terms of the spectrum of , latent size , weight decay , and bias ) that predict when constructive interference (linear superposition) will dominate over interference filtering (non-linear superposition) and vice versa, including error bounds on as a function of spectral decay.
- Transition thresholds with capacity: Characterize and validate the critical latent dimension(s) at which the model transitions from PCA-like circular structure to antipodal/non-linear regimes; study how these thresholds scale with , spectral gaps, and feature sparsity.
- Beyond tied, single-layer AEs: Extend analysis and experiments to untied-weight decoders, deeper autoencoders, and architectures closer to transformer blocks to test whether linear superposition emerges under realistic model inductive biases.
- Application to real LM activations: Directly test in pretrained LLM hidden states whether the constructive-interference account explains observed clusters/cycles; quantify how much of the geometry can be attributed to low-rank co-activation structure versus other factors.
- Robustness to distribution shift: Quantify how constructive interference degrades when test-time co-activation patterns deviate from training (e.g., altered seasonality, topic reweighting); report false-positive/false-negative rates as a function of bias thresholds and correlation mismatch.
- Systematic regularization study: Map how different regularizers (L2, L1, sparsity penalties, orthogonality constraints), dropout, and initialization affect the balance between constructive and filtering solutions and the resulting feature geometry.
- Nonlinearity choice: Evaluate how GELU, leaky-ReLU, soft-thresholding, or sigmoid affect the coexistence of constructive interference and filtering, including how learned biases adapt across nonlinearities.
- Quantifying constructive vs destructive interference: Develop a general contribution-decomposition metric that partitions each featureโs reconstruction into self vs contextual contributions and positive vs negative interference, and report distributions across the vocabulary.
- Reliable geometry metrics beyond UMAP: Supplement visualizations with quantitative, reproducible measures (e.g., PCA variance explained, pairwise cosine matrices, anisotropy indices, silhouette scores, mutual information with human annotations) and report seed variability.
- Causal tests on corpus statistics: Perform interventions that preserve marginals but scramble co-occurrences (e.g., permuting time-related words) to causally verify that observed circles/clusters arise from covariance structure rather than visualization artifacts.
- Sensitivity to BOWS design choices: Probe how vocabulary size , context window size , binary vs count features, TF-IDF weighting, and inclusion of stop-words/subwords affect the covariance spectrum and the learned geometry.
- Higher-order interactions: Assess whether higher-order (beyond pairwise) co-activations meaningfully shape geometry; construct datasets with controlled higher-order structure and test resulting arrangements.
- Negative correlations and exclusivity: Study geometry for negatively correlated or mutually exclusive features (e.g., antonyms, disjoint categories) to see if specific antagonistic structures emerge and how ReLUs handle them.
- Rare-word regime under Zipfโs law: Systematically analyze how frequency and context diversity predict whether a feature becomes linear-superposed, non-linear, or nearly orthogonal; model the frequencyโcorrelationโgeometry trade-off.
- Sample complexity and generalization: Establish how many samples are required to learn a projector close to the true principal subspace and to keep constructive interference aligned at test time; provide concentration bounds or empirical scaling laws.
- Bias-setting theory: Develop predictions for learned negative biases from data statistics (e.g., means, variances, off-subspace residuals) to control false positives; test biasโthreshold calibration procedures.
- Downstream impact: Connect geometry to task performance by measuring whether constructive interference improves downstream metrics (e.g., perplexity, classification, probing) and whether forcing PCA-like geometry via regularization helps or hurts.
- Cross-corpora and modalities: Validate findings across multiple corpora (beyond WikiText/OpenWebText) and modalities (vision/audio) where co-occurrence structure differs, and test whether constructive interference consistently creates semantic clusters.
- Relation to classic embeddings: Compare AE-induced geometry with Word2Vec/GloVe/PMI factorization on the same corpora; isolate when similarities arise from shared low-rank co-occurrence structure versus differences due to reconstruction objectives.
- SAEs on BOWS as a benchmark: Operationalize BOWS as a benchmark for sparse autoencoders by defining quantitative geometry recovery metrics (e.g., alignment with known PCs, cluster ordering scores) and evaluating SAEsโ ability to recover ground-truth structures.
- Distinguishing presence- vs value-coding in practice: Propose diagnostics and pipelines to separate manifolds arising from value codes (e.g., ) from correlation-driven superposition in real models; apply to calendar/rotation/geography features in LLMs.
- Ablations in BOWS: Mirror the value-coding ablations by ablating principal-subspace vs orthogonal components in BOWS-trained AEs to quantify how much reconstruction depends on low-rank projections vs ReLU filtering.
- Scaling laws in high dimensions: Explore how geometry evolves as scale to realistic LM sizes; assess whether PCA-like structures persist or break under extreme overcompleteness.
- Training dynamics and path dependence: Investigate whether solutions converge to different geometries depending on optimization schedule, learning rates, or early stopping, and whether phase transitions occur during training.
- Formal links to constrained PCA: Provide a rigorous connection between ReLU AEs with weight decay and constrained PCA (or projectors with thresholds), clarifying when the non-linear model effectively implements a soft projector.
- Interaction with adversarial robustness and editing: Empirically test whether constructive-interference geometries alter adversarial vulnerability or the locality of knowledge edits by measuring transfer/interference among correlated features.
- Tokenization and sequential structure: Move beyond bag-of-words to subword tokenization and sequential models that capture order/syntax; examine whether constructive interference remains a primary driver of geometry when order-sensitive correlations are present.
- Untested hyperparameter spaces: Report systematic sweeps over , weight decay, and biases (rather than snapshots) with confidence intervals to ensure robustness of the claimed regimes.
- Generalized decoders: Compare per-feature linear decoders to joint multi-output decoders and to sparse readouts to ensure the โlinear superpositionโ designation is not an artifact of decoder choice.
- Practical detection algorithms: Develop tools to automatically identify cycles, clusters, and antipodal pairs in learned features and to classify each featureโs regime (linear-superposed vs non-linear vs orthogonal) at scale.
Practical Applications
Immediate Applications
The following applications can be deployed now using the paperโs findings, code, and training recipes. They focus on exploiting constructive interference in correlated features, distinguishing linear vs nonโlinear superposition, and recognizing valueโcoding features.
- Interpretability toolkit upgrade for LLMs and foundation models
- What: Add BOWS-based diagnostics (from the paperโs code) to existing pipelines: (i) linear vs nonโlinear superposition Rยฒ tests via linear decoders on selected feature sets; (ii) geometry dashboards (PCA/UMAP, off-diagonal Frobenius norms) to detect semantic clusters and circular structures; (iii) bias/threshold checks for ReLU filtering.
- Sector(s): Software/AI research, safety.
- Tools/products/workflows: โBOWSBenchโ module integrated into interpretability suites; CI jobs that flag geometry shifts after fineโtuning; feature-geometry reports in model cards.
- Assumptions/dependencies: Linear Representation Hypothesis (LRH) approximately holds for targeted features; access to hidden activations and weights; tied or inspectable weights; negative biases available or emulatable in decoders.
- Better sparse autoencoder (SAE) training recipes for feature discovery
- What: Train SAEs with tighter bottlenecks and modest weight decay to bias toward constructive (lowโrank) superposition when data are correlated; monitor weight norms and rank proxies to avoid over-orthogonalization.
- Sector(s): Software, academia.
- Tools/products/workflows: Updated SAE configs; norm/rank dashboards; curriculum to alternate bottleneck size and weight decay; SAE evaluation using BOWS with known groundโtruth geometry.
- Assumptions/dependencies: Availability of correlated features in the target layer; compute for hyperparameter sweeps; willingness to accept anisotropic feature clusters when beneficial.
- Safer knowledge editing and fineโtuning via cluster-aware adjustments
- What: When editing a concept (e.g., โDecemberโ), propagate edits across its correlated cluster (e.g., other months, season words) to preserve constructive interference and avoid regressions; verify with linear-superposition tests.
- Sector(s): NLP products, AI Ops.
- Tools/products/workflows: Geometryโaware editing scripts; batch reโbiasing for ReLU thresholds; regression tests that compare oneโhot vs contextual reconstructions.
- Assumptions/dependencies: Identifiable clusters/cycles in the target layer; access to model internals; automated evaluation sets reflecting real coโoccurrence patterns.
- Lowโrank compression and distillation guided by data covariance
- What: Use PCA/linear AEs to compress representations where feature covariance is approximately lowโrank; exploit constructive interference to preserve semantics with fewer dimensions; deploy on-device or latencyโsensitive settings.
- Sector(s): Mobile/edge AI, enterprise software.
- Tools/products/workflows: Lowโrank adapters; layerโwise PCA distillation; postโtraining projection matrices embedded as lightweight projectors.
- Assumptions/dependencies: Spectrum concentration in target layers; acceptable accuracyโlatency tradeโoffs; calibration to avoid false positives on residual variance.
- Retrieval and search indexing with bagโofโwords constructive compression
- What: For documentโlevel search or RAG, index documents via BoW+PCA encodings tuned to capture semantic clusters and cycles (e.g., time/season topics), improving recall and memory footprint.
- Sector(s): Information retrieval, enterprise search.
- Tools/products/workflows: Indexers that build coโoccurrence covariance and project onto top PCs; hybrid BoW+embedding pipelines.
- Assumptions/dependencies: Document domains with stable coโoccurrence structure; precomputed covariance remains valid over time; monitoring drift.
- Adversarial robustness and redโteaming diagnostics
- What: Add tests that distinguish constructive vs harmful interference; evaluate whether weight decay and bottlenecks reduce vulnerability by aligning interference with signal; probe features that flip under adversarial contexts.
- Sector(s): Security, AI safety.
- Tools/products/workflows: Robustness evals reporting interference alignment scores; offโdiagonal norm tracking under adversarial prompts.
- Assumptions/dependencies: Attack scenarios that exploit interference; representative adversarial contexts; minimal false sense of securityโnonโcorrelated regimes still need ReLU filtering.
- Timeโseries and seasonal modeling with feature cycles
- What: Encode seasonal/cyclical structure (months/days) using linear superposition to reduce dimensionality while preserving seasonality for forecasting and anomaly detection.
- Sector(s): Energy, retail, finance (forecasting).
- Tools/products/workflows: Preprocessors that project seasonal indicators onto cyclic PCs; lighter models for seasonal components.
- Assumptions/dependencies: Clear periodic components; stable seasonal coโoccurrence; integration with existing ML stacks.
- Recommender systems: capacity sharing for correlated items
- What: Encourage constructive interference by applying weight decay and controlled bottlenecks in item/user embeddings, allowing correlated items to share dimensions without harming ranking.
- Sector(s): Eโcommerce, media.
- Tools/products/workflows: Embedding training with norm constraints; interference-aware negative sampling; cluster-level calibration.
- Assumptions/dependencies: Sufficiently correlated item groups; careful mitigation of popularity bias.
- Robotics and mapping: valueโcoding probes for spatial variables
- What: Use linear probes to verify valueโcoding for coordinates or angles (e.g., sin/cos), ensuring models learned the intended continuous variables, and ablate nonโvalue subspaces for explainability checks.
- Sector(s): Robotics, autonomous systems.
- Tools/products/workflows: Probe libraries; ablation tests that preserve valueโcoding while removing orthogonal subspaces.
- Assumptions/dependencies: Tasks that require continuous variables; robust linear decodability.
- Education and training materials on superposition regimes
- What: Use BOWS notebooks to teach linear vs nonโlinear superposition, constructive interference, and valueโcoding; illustrate cycles (months) and clusters in small AEs.
- Sector(s): Education, workforce upskilling.
- Tools/products/workflows: Course modules, assignments using the paperโs repo; visual dashboards.
- Assumptions/dependencies: Classroom compute and familiarity with AEs.
- Governance and audits: representationโgeometry reports
- What: Include featureโgeometry diagnostics (clusters/cycles, linear-vsโnonโlinear Rยฒ) in model audits and documentation; track shifts postโfineโtuning.
- Sector(s): Policy, compliance, enterprise ML governance.
- Tools/products/workflows: Audit checklists and standardized plots; regression thresholds for geometry metrics.
- Assumptions/dependencies: Access to representations; organizational willingness to include interpretability criteria in go/noโgo gates.
LongโTerm Applications
These applications require further research, scaling, or ecosystem maturation (e.g., broader agreement on interpretability standards, generalization from AEs/BOWS to large-scale LMs and multimodal models).
- Superpositionโaware architectures and regularizers
- What: Build training objectives that explicitly encourage constructive interference for correlated features while penalizing harmful interference; dynamic biasing for ReLU thresholds based on estimated residuals.
- Sector(s): Software/AI platforms.
- Tools/products/workflows: New loss terms (e.g., covarianceโaligned projectors), adaptive weight decay, learnable thresholding layers.
- Assumptions/dependencies: Reliable covariance estimation during training; stability in deep architectures beyond tied AEs.
- Automated geometryโaware knowledge editors
- What: Tools that identify correlated clusters and cyclic structures and apply minimal edits (projector updates, bias shifts) that preserve constructive interference while altering targeted knowledge.
- Sector(s): AI Ops, NLP products.
- Tools/products/workflows: โGeometry Editorโ with constraints/solvers; sandboxed simulation of contextual reconstructions vs oneโhot reconstructions.
- Assumptions/dependencies: Accurate cluster discovery; robust simulators that extrapolate to deployment contexts.
- Standard benchmarks and certifications for feature geometry
- What: Industryโwide benchmarks (extending BOWS) and certifications that require reporting on superposition regimes (linear vs nonโlinear prevalence, interference utilization).
- Sector(s): Policy, standards, procurement.
- Tools/products/workflows: Shared datasets with known geometry; certification rubrics; thirdโparty audit services.
- Assumptions/dependencies: Community consensus; reproducibility across model families; regulator buyโin.
- Hardware and systems that exploit lowโrank constructive interference
- What: Accelerators with fast lowโrank projectors and dynamic biasing to benefit from covarianceโaligned representations; memory hierarchies tuned for projector reuse.
- Sector(s): Semiconductors, cloud providers.
- Tools/products/workflows: Kernel libraries for projector inference; compiler passes that detect and fuse projection patterns.
- Assumptions/dependencies: Persistent lowโrank structure in production workloads; benefits outweigh complexity.
- Multimodal generalization
- What: Apply constructive interference principles to vision/audio (e.g., coโoccurring visual attributes, phonemeโword coโactivations), guiding feature geometry and compression.
- Sector(s): Multimodal AI (vision, speech).
- Tools/products/workflows: Crossโmodal BOWS analogs (e.g., bagโofโattributes); joint projectors across modalities.
- Assumptions/dependencies: Robust coโoccurrence statistics; alignment with downstream task performance.
- Bias and safety interventions using cluster maps
- What: Map and monitor harmful or biased associations as correlated clusters; reconfigure geometry to weaken undesirable constructive interference while preserving utility.
- Sector(s): Public policy, platform safety.
- Tools/products/workflows: โBiasโclusterโ dashboards; geometryโpreserving debiasing operations (e.g., nullspace projections).
- Assumptions/dependencies: Reliable identification of harmful clusters; minimal sideโeffects on benign correlations.
- Clinical NLP with correlationโaware representations
- What: Use constructive interference to represent symptom clusters and comorbidities efficiently, supporting summarization and decision support while minimizing false positives via thresholding.
- Sector(s): Healthcare.
- Tools/products/workflows: Clinical AEs trained with weight decay; safetyโcritical ReLU threshold calibration; audit logs for edits to medical concept clusters.
- Assumptions/dependencies: Highโquality labeled or weakly supervised clinical corpora; regulatory validation; privacy controls.
- Finance: factor modeling via learned constructive interference
- What: Align embeddings with correlated risk/alpha factors using lowโrank projectors; enable compact models that preserve factor structure and improve interpretability of signals.
- Sector(s): Finance.
- Tools/products/workflows: Representation audits against known factor covariance; geometryโaware portfolio construction features.
- Assumptions/dependencies: Stable factor correlations; stringent validation against overfitting and regime changes.
- Continual and multitask learning with shared lowโrank cores
- What: Share capacity across correlated tasks/features through a learned projector core; add taskโspecific ReLU thresholds to suppress residuals.
- Sector(s): Enterprise ML, AutoML.
- Tools/products/workflows: Adapter stacks composed of projectors + threshold layers; taskโaware covariance tracking.
- Assumptions/dependencies: Task correlation; effective avoidance of negative transfer.
- Humanโinโtheโloop UIs for feature geometry exploration
- What: Interfaces that visualize clusters and cycles, let users test oneโhot vs contextual reconstructions, and propose safe edit plans that maintain constructive interference.
- Sector(s): Productivity tools, ML platforms.
- Tools/products/workflows: Interactive PCA/UMAP canvases; Rยฒ probe widgets; โedit impactโ previews.
- Assumptions/dependencies: Usable performance at scale; clear mental models for nonโexperts.
Notes on overarching assumptions and limitations:
- Constructive interference is most effective when feature covariance is approximately lowโrank; spectrum concentration is a key dependency.
- Results are demonstrated in tiedโweight autoencoders with ReLU decoders and bagโofโwords data; generalization to large transformers and untied architectures requires additional validation.
- Valueโcoding vs presenceโcoding distinction is crucial: geometry from value codes can exist without superposition and requires different diagnostics and interventions.
Glossary
- Anisotropic superposition: A feature arrangement where related features cluster rather than minimizing pairwise overlaps, contrary to isotropic/regular arrangements. "as well as anisotropic superposition, where related features cluster together rather than minimizing dot products"
- Antipodal pairs: A geometry where features are placed as opposite (negatively correlated) directions so one suppresses the other via a nonlinearity. "represents features as antipodal pairs"
- Bag-of-Words Superposition (BOWS): A controlled framework that encodes binary bag-of-words text in superposition to study realistic feature correlations. "we introduce Bag-of-Words Superposition (BOWS), a framework in which an autoencoder is trained to encode binary bag-of-words representations of internet text in superposition."
- Bottlenecks (tight bottlenecks): Strong compression regimes where the latent dimension is much smaller than the number of features. "these solutions emerge prominently under tight bottlenecks or weight decay"
- Co-activation patterns: Patterns of features that tend to be active together, used to arrange features so interference is constructive. "arranging features according to their co-activation patterns naturally gives rise to semantic clusters"
- Coefficient of determination (R2): A per-feature measure of reconstruction quality comparing predicted and true values. "we define the per-feature coefficient of determination"
- Constructive interference: Interference among features that aligns with and reinforces the target signal rather than harming it. "interference can be constructive rather than just noise to be filtered out."
- Cyclical structures: Circular arrangements of related features (e.g., months) in representation space. "cyclical structures which have been observed in real LLMs"
- Dictionary learning: Methods that learn sparse, interpretable components (atoms) composing representations. "sparse dictionary learning approaches like sparse autoencoders to decompose model activations into an overcomplete basis of linear features"
- Frobenius norm (off-diagonal): A matrix norm used here to quantify interference via the magnitude of off-diagonal entries. "off-diagonal Frobenius norms"
- Latent dimension (m): The size of the hidden representation (bottleneck) in the autoencoder. "with varying latent dimensions m."
- Linear decoder: A linear mapping used to reconstruct features from the latent representation. "train a linear decoder (without a ReLU) to reconstruct their inputs."
- Linear Representation Hypothesis (LRH): The hypothesis that high-level concepts are linearly represented in model activations. "Definition 4 (Linear Representation Hypothesis)."
- Linear superposition: A regime where correlated features can be recovered with a linear decoder due to low-rank structure. "We refer to the regime in which low-rank structure in the data supports constructive interference as linear superposition."
- Low-rank structure: Data covariance concentrated in a few principal components, enabling efficient projection-based reconstruction. "including approximately low-rank structure"
- Non-linear autoencoder: An autoencoder that uses a nonlinearity (e.g., ReLU) in its reconstruction pathway. "non-linear autoencoders can exploit interference constructively"
- Non-linear superposition: Superposition where accurate recovery requires a non-linear decoder and cannot be achieved linearly. "as an example of non-linear superposition."
- Orthogonal complement: The subspace perpendicular to a chosen feature subspace; often ablated to test reliance on particular codes. "zeros its orthogonal complement"
- Orthogonal projector: A linear operator projecting data onto a subspace, such as the top principal components. "the orthogonal projector onto the top-m principal components"
- Overcomplete basis: A set of representing directions exceeding the ambient dimensionality, necessitating superposition. "over-complete basis"
- PCA (Principal Component Analysis): A method that identifies directions of maximal variance; used to reveal circular structures in features. "PCA applied directly to the 12 month dimensions"
- Pointwise Mutual Information (PMI): A measure of word association used in embedding theory to relate co-occurrence and vector factorization. "Pointwise Mutual Information (PMI) matrix"
- Presence-coding features: Features that act as detectors for discrete properties, decodable by linear classifiers. "Presence-coding features."
- Principal subspace: The subspace spanned by the leading principal components identified by PCA. "projection onto the principal subspace"
- Regular polytopes: Highly symmetric geometric arrangements where pairwise dot products are minimized or uniform. "yielding local structures like regular polytopes."
- ReLU (Rectified Linear Unit): A nonlinearity that zeroes negative inputs, used to suppress harmful interference. "ReLU filters out interference"
- ReLU-based filtering: Using ReLU and biases to eliminate negative or spurious activations arising from interference. "ReLU-based filtering remains important for suppressing harmful interference"
- Semantic clusters: Groupings of features by meaning that emerge in representation space under correlations and constraints. "giving rise to semantic clusters"
- Sparse autoencoders (SAEs): Autoencoders trained with sparsity to recover interpretable features from activations. "sparse autoencoders (SAEs)"
- Superposition: Representing more features than dimensions by sharing directions, allowing interference among features. "arranging them in superposition to form an overcomplete basis"
- Tied weights: An architecture where decoder weights are the transpose of encoder weights. "For tied-weight AEs"
- UMAP: A non-linear dimensionality reduction technique used to visualize learned feature geometries. "UMAP projections of the word embeddings"
- Value-coding features: Features that linearly encode continuous variables (e.g., angles, coordinates) used for computation. "we say that a representation h(x) contains a value-coding feature"
- Weight decay: L2 regularization that penalizes large weights, biasing models toward low-norm solutions. "more prevalent in models trained with weight decay"
Collections
Sign up for free to add this paper to one or more collections.