From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
Abstract: Activation decomposition methods in LLMs are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper explores a new way to understand what’s going on inside LLMs (like Llama and Gemma) when they think. Instead of looking for single “straight-line” directions for ideas inside the model, the authors map the model’s hidden activity into local “regions” that each have their own shape and small set of important directions. They use a method called Mixture of Factor Analyzers (MFA) to do this. Their goal is to find clearer, more useful chunks of meaning and control the model’s behavior more reliably.
Key Objectives
The paper asks:
- Can we break down a LLM’s internal activity into small, understandable regions instead of just single, global directions?
- Do these regions capture complex ideas better, especially when the ideas aren’t simple or linear?
- Can this regional breakdown help us find where specific concepts live in the model and steer the model to produce certain kinds of text more effectively?
Methods and Approach
The authors use a technique from statistics called Mixture of Factor Analyzers (MFA) to study the “activation space” of LLMs. An activation is the model’s internal signal at a specific layer while processing text; you can think of activation space as a giant map of how the model “represents” ideas.
Here’s an everyday analogy:
- Imagine a city map. Instead of one long straight highway representing “sports,” the city has neighborhoods: “football,” “hockey,” “national teams,” “leagues,” etc. Each neighborhood has a center (like the town square) and a few main streets that define how you move around inside that neighborhood.
What MFA does:
- It splits the activation map into many “Gaussian regions” (think bubbles or neighborhoods of related meaning).
- Each region has:
- A centroid: the center point of that neighborhood (the “average” activation for that region).
- A low-dimensional subspace: a small set of directions that describe how things vary inside the region (like main streets pointing toward “league,” “register,” or “commission” inside the “National” neighborhood).
- When the model processes a piece of text, MFA assigns it to one or more regions and explains it as:
- How close it is to the region’s center.
- How it moves within that region (its local offset along those few directions).
They trained large MFA models with thousands of regions on two LLMs (Llama-3.1-8B and Gemma-2-2B), using millions of examples. Then they compared MFA to popular “dictionary learning” methods, especially sparse autoencoders (SAEs), which try to explain activations using a global list of directions.
A few key terms, simplified:
- Activation space: the internal “map” of the model’s thoughts.
- Gaussian region: a bubble where similar meanings cluster together.
- Centroid: the center of a bubble; represents the broad theme of that region.
- Subspace/loadings: the few main directions inside a bubble; these capture structured variations (like genre vs. subgenre).
- Responsibilities: how much a given activation belongs to each region.
Main Findings and Why They Matter
- Regions capture complex ideas better than isolated directions.
- The authors found two types of regions:
- Broad regions: cover big topics like emotions or movie genres.
- Narrow regions: focus on specific tokens or forms, like uses of the word “National.”
- Broad regions tend to show more semantic variation (differences in meaning), while narrow ones show more syntactic variation (differences in format, punctuation, or capitalization).
- The authors found two types of regions:
- Concepts form neighborhoods made of multiple regions.
- Related regions cluster together, like different emotions (“happiness,” “anger,” “surprise”) tiling an “emotions” neighborhood. This matches how real concepts are complex and not always captured by a single straight-line direction.
- MFA’s decomposition is more interpretable than SAEs.
- MFA explains an activation with just a region center and a small local offset. These parts were labeled “interpretable” far more often than the many small features SAEs rely on.
- In tests, about 96% of MFA’s main contributions were interpretable, versus about 29% for SAEs.
- MFA helps find and manipulate concepts inside models.
- Localization: On tasks that check if you can find where a concept is stored (like Continent, Country, Language), MFA beat simple baselines (like PCA and SAEs) and was often competitive with strong supervised methods. It performed best on “Continent” and did well overall on RAVEL and MCQA benchmarks.
- Steering: When trying to make the model output align with a concept (for example, pushing it toward a genre), moving toward MFA centroids often worked better than using SAE features or supervised difference-in-means methods. It produced outputs that were both more concept-aligned and fluent.
- More regions sharpen meaning without always increasing overall steering performance.
- With more components, regions get narrower and more specific. That can split a broad concept across more bubbles, which is useful for understanding but doesn’t always boost steering scores further. Still, MFA’s centroid-based steering generally outperformed alternatives.
Why this matters:
- It shows that “local geometry” (neighborhoods with their own small sets of directions) is a powerful way to understand and control LLMs, especially for concepts that aren’t simple straight lines.
Implications and Potential Impact
- Better interpretability: This local-region view makes model internals easier to understand, helping researchers see where and how concepts live.
- More precise control: You can steer the model by moving toward a region’s center for broad themes or adjusting local directions for fine-grained changes. That’s useful for safer, more targeted generation.
- Scalable tools: MFA works at large scale across different layers and models, and the authors released code and trained models for community use.
- Dual-use caution: The same tools that improve transparency and control could be misused to bypass safeguards or amplify harmful behaviors. Responsible use and strong safety practices are essential.
In short, the paper argues that LLMs organize information in local, low-dimensional regions. By focusing on these regions—rather than single global directions—we can discover concepts more clearly, find where they are stored, and steer models more effectively.
Knowledge Gaps
Below is a concise list of the paper’s unresolved knowledge gaps, limitations, and open questions. Each item is phrased concretely to guide follow-up research.
- Model and layer coverage: Results are reported for only two model families (Llama-3.1-8B, Gemma-2-2B) and two layers per model; evaluate MFA across more architectures (e.g., Mistral, GPT-NeoX), depths, and positions to assess generality.
- Fixed uniform latent rank: The method uses a single rank R=10 for all components; develop adaptive per-component rank estimation (e.g., via MDL/BIC or cross-validation) and test its impact on interpretability and performance.
- Simplified noise model: A component-shared diagonal noise Y=I is assumed; compare against component-specific diagonal noise, full covariance, and heavy-tailed (e.g., Student-t) noise to model non-Gaussian local structure.
- Gaussianity assumption: Validate whether local regions are well-approximated by Gaussians; benchmark alternatives (GMM with full/diagonal covariances, mixture of PPCA, mixture of VAEs, normalizing-flow mixtures) on the same tasks.
- Training procedure: MFA is trained with gradient descent from K-means initialization; compare to EM-based training, analyze convergence behavior, sensitivity to initialization, and robustness to hyperparameters.
- Scalability and inference cost: Computing responsibilities across K up to 32K per activation is expensive; quantify runtime/memory and prototype approximate gating/ANN search to reduce per-token component evaluation.
- Assignment sparsity: Responsibilities are soft over all components; study hard assignment or sparsity-regularized responsibilities to improve interpretability, reduce compute, and assess effects on localization/steering.
- Hyperparameter selection: Provide principled procedures to choose K and R (e.g., via held-out likelihood, BIC/AIC, stability under bootstrap) and characterize how these choices trade off narrow vs. broad components.
- Local intrinsic dimension: Measure intrinsic dimensionality per region and align component rank to empirical estimates; report distributions and how they vary by layer/model.
- Subspace axis selection: Because W is rotationally invariant, propose methods (e.g., varimax, supervised rotations, canonical correlation to labeled variables) to produce stable, interpretable axes within each subspace.
- Loadings-based steering: Steering experiments focus on centroids; systematically evaluate within-region subspace interventions (choosing v data-driven or learned) and compare their coherence and concept alignment to centroid steering.
- On-manifold control: Quantify off-manifold risks of centroid interpolation and additive subspace moves (e.g., likelihood drop, density estimates), and develop constraints or projections that keep interventions in high-density regions.
- Multi-Gaussian concept aggregation: Formalize methods to cluster neighboring components into higher-level “constellations,” and evaluate coverage, purity, and causal efficacy of aggregated concepts for localization and steering.
- Neighborhood semantics quantification: Move beyond qualitative BFS examples by measuring semantic coherence of neighborhoods with external labels, clustering metrics (e.g., NMI/ARI), and retrieval performance.
- Generalization across data and languages: MFAs are trained on The Pile; test generalization to other domains (code, biomedical, legal), multilingual corpora, and OOD distributions, and identify failure modes.
- Position and token-type effects: Analyze how responsibilities and subspace structure vary with sequence position and token categories (function vs. content tokens), beyond the last-position steering setup.
- Causal localization without DBM: MFA’s localization uses DBM to select bases; ablate DBM to measure MFA’s direct causal isolation ability and disentangle contributions of centroids vs. loadings.
- Mechanistic alignment: Link centroids/subspaces to concrete circuits (attention heads, MLP neurons) via causal tracing/patching, and test whether MFA-derived features map to known mechanistic pathways.
- Interpretability labeling validity: The broad/narrow and semantic/syntactic labels rely partly on GPT-5-mini; expand human validation, report inter-annotator agreement at scale, and analyze LLM-judge biases.
- Interpretability fraction metric: The IF metric weights by feature norm; test whether IF correlates with causal impact (mediation analyses, counterfactual influence functions) rather than just reconstruction magnitude.
- Comparison to subspace baselines: Empirically compare MFA against recent subspace-oriented methods (e.g., Huang & Hahn, Sun et al., Tiblias et al.) and classic MoPCA/GMM baselines on the same benchmarks.
- Frequency artifacts: Assess whether centroids primarily capture token frequency or semantics; control for frequency (e.g., reweighting, stratified sampling) and test semantic robustness under frequency shifts.
- Capacity effects: Increasing K did not consistently improve steering; analyze whether capacity primarily splits broad concepts, and design aggregation or regularization to maintain actionable concept granularity.
- Robustness to fine-tuning: Examine how MFAs trained on a base LM transfer after fine-tuning or RLHF; measure drift of centroids/subspaces and propose incremental updating or amortized adapters.
- Safety evaluation: Given dual-use potential, benchmark steering on sensitive concepts, quantify misuse risk, and develop guardrails (e.g., safety filters for component selection, policy constraints on interventions).
Glossary
- Ablation: A controlled experiment that removes or restricts model components to assess their contribution to performance. "We also ablate MFA on Gemma- 2-2B to identify whether the causal variables reside in the loadings or centroids."
- Activation decomposition: The process of breaking model activations into interpretable components or factors. "Activation decomposition methods in LLMs are tightly coupled"
- Activation space: The high-dimensional vector space formed by model activations at a given layer or position. "MFA maps the activation space into Gaussian regions"
- Additive intervention: An operation that steers a model by adding a scaled feature vector to a hidden state. "we use the standard additive intervention (adding & times the feature direction)"
- Causal localization: Identifying and targeting internal variables or representations whose manipulation causally changes model behavior. "Overall, MFA performs strongly on causal localization"
- Centroid: The mean vector of a Gaussian component that anchors a local region in activation space. "utilizing MFA centroids steers better than SAE features"
- Desiderata-Based Masking (DBM): A supervised method that learns sparse masks over a basis to isolate features aligned with a target variable. "we utilize Desiderata-Based Masking (DBM) on top of MFA's components"
- Difference-in-means (DiffMeans): A supervised technique that computes a concept direction as the mean difference between activating and neutral examples. "supervised difference-in-means (DiffMeans)"
- Dictionary learning: Learning a set of basis directions so activations can be represented as sparse combinations of those directions. "the predominant dictio- nary learning method."
- Diagonal noise covariance: An assumption in generative models where noise across observed dimensions is independent, making the covariance matrix diagonal. "set the (component-shared) diagonal noise covariance to Y = ID."
- Factor Analysis (FA): A generative probabilistic model that explains observed covariance via a few latent factors plus diagonal noise. "Factor Analysis (FA) FA is a statistical method"
- Gaussian Mixture Models (GMMs): Probabilistic models representing data as a mixture of Gaussian distributions. "MFA is a low-rank variant of GMMs, making it more efficient and providing a local low-dimensional structure."
- Harmonic mean: An averaging method used to combine multiple scores, especially when balancing trade-offs. "aggregate the concept and fluency scores with a harmonic mean as the final score."
- Intrinsic dimension: The effective dimensionality needed to describe local variation in data. "a conservative ap- proximation to the local intrinsic dimension of each region"
- k-Nearest Neighbors (kNN) graph: A graph where each node connects to its k closest neighbors, used here to connect nearby Gaussian centroids. "we construct a kNN graph using Euclidean distance between centroids"
- Latent coordinates: The coordinates of an activation in a component’s low-dimensional latent subspace. "compute the component's latent coordinates"
- Latent factors: Unobserved variables that generate observed data and explain correlations between dimensions. "with latent factors z ~ N(0, I)"
- Loadings: Columns of the factor loading matrix W that map latent factors to changes in observed dimensions. "commonly referred to as the loadings"
- Manifold hypothesis: The idea that high-dimensional data concentrate near lower-dimensional manifolds. "consis- tent with a manifold hypothesis"
- Mixture of Factor Analyzers (MFA): A mixture model where each component is a factor analyzer modeling a local low-dimensional Gaussian region. "we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative"
- Mixture weights: The prior probabilities of selecting each component in a mixture model. "The mix- ture weights are initialized uniformly, Tk = 1/K for all k."
- Negative log-likelihood: A loss function minimized during training that corresponds to maximizing the model’s likelihood of the data. "minimiz- ing the negative log-likelihood with gradient descent"
- Orthogonal rotation invariance: The property that certain models (like FA) are unchanged under orthogonal rotations of their loading matrices. "W is invariant to orthogonal rotations."
- Posterior mean: The expected value of latent variables given observed data under a probabilistic model. "using the posterior mean under FA"
- Posterior responsibilities: The probabilities that a given sample was generated by each component in a mixture model. "assigning each ac- tivation to components via posterior responsibilities"
- Residual stream: The running hidden state pathway in Transformer architectures that carries information across layers. "ex- tracted from the residual stream at a fixed layer"
- Sparse autoencoders (SAEs): Autoencoders trained with sparsity constraints to discover interpretable feature directions. "sparse autoencoders (SAEs), the predominant dictio- nary learning method."
- Steering: Controlling a model’s outputs by intervening on internal activations or directions. "localization and steering benchmarks show that MFA outperforms"
- Subspace (low-rank subspace): A lower-dimensional linear space capturing the principal modes of variation within a region. "learns a low-rank subspace that captures dominant modes of variation."
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage MFA’s region-and-subspace decomposition, centroid steering, and the demonstrated gains in causal localization and controllable generation.
- Region-aware controllable generation
- What: Integrate centroid interpolation to steer outputs toward desired high-level themes (centroids) and use local subspace offsets for fine-grained adjustments.
- Sectors: software, creative media/marketing, education, customer support, legal.
- Tools/workflows: “CentroidSteer” operator with alpha sweeps; a small “RegionProbe” hook to compute responsibilities and pick candidate centroids; A/B evaluation on concept and fluency scores as in the paper.
- Assumptions/dependencies: Access to hidden states during inference; integration hooks in the inference stack; curated centroid labels for target styles.
- Targeted safety and moderation guardrails
- What: Identify regions associated with toxic/off-policy content and interpolate away from them; optionally apply local subspace offsets to restore coherence after suppression.
- Sectors: content platforms, safety, policy compliance.
- Tools/workflows: Region blacklist/whitelist; centroid-based “pull-away” interventions; dashboard to monitor region activations in production.
- Assumptions/dependencies: Accurate labeling of disallowed regions; continuous monitoring to avoid false positives and mode collapse.
- Bias localization and mitigation
- What: Use MFA+DBM bases to isolate variables like Country/Language/Continent (RAVEL) and attenuate or correct them when undesired.
- Sectors: HR tech, finance, healthcare, public sector.
- Tools/workflows: DBM over MFA components; unit tests using RAVEL/MCQA-style probes; “bias heatmaps” over region neighborhoods.
- Assumptions/dependencies: Clear bias metrics and acceptance thresholds; governance procedure for interventions.
- Mechanistic debugging and regression testing for LMs
- What: Track how region responsibilities and local subspaces change across model updates; catch regressions in internal mechanisms even when external metrics look stable.
- Sectors: software engineering/MLOps, model dev teams, academia.
- Tools/workflows: “RegionDrift” dashboard; per-release MFA diffing; regression gates triggered by region-distribution shifts.
- Assumptions/dependencies: Repeatable activation sampling; storage and versioning of MFA artifacts.
- Prompt-engineering copilot
- What: Recommend the minimal centroid(s) and small local offsets to satisfy a semantic intent, reducing prompt complexity and trial-and-error.
- Sectors: general software, marketing, education, CX.
- Tools/workflows: Intent-to-centroid lookup; neighborhood BFS over centroid graphs to compose multi-Gaussian concepts.
- Assumptions/dependencies: Curated “concept atlas” mapping business intents to centroid clusters.
- Retrieval routing and tool-use orchestration
- What: Use responsibilities to detect when the model is in regions that historically benefit from external tools/RAG (e.g., factual lookup) and route appropriately.
- Sectors: enterprise search, customer support, software with tool-augmented LMs.
- Tools/workflows: Region→tool routing tables; latency-aware pre-checks at selected layers (e.g., 1/3 and 2/3 depth as in the paper).
- Assumptions/dependencies: Mapping between regions and tool efficacy; early-layer activation hooks.
- Concept library construction for interpretability
- What: Build human-readable catalogs of centroids (broad vs. narrow) and local subspaces (semantic vs. syntactic variation) per layer.
- Sectors: academia, safety, model vendors.
- Tools/workflows: “RegionDB” concept atlas; automated sampling of high-likelihood contexts; neighborhood graphs; LLM-aided descriptions validated with human spot checks.
- Assumptions/dependencies: Annotation pipeline; governance for updating labels; acceptance of subspace-over-direction interpretability.
- Efficient task adaptation with region-conditioned adapters
- What: Attach lightweight adapters (e.g., LoRA) that only activate in specific regions; improve sample efficiency and reduce interference.
- Sectors: MLOps, enterprise ML.
- Tools/workflows: Region-gated adapters; per-region finetuning data mined from high-responsibility contexts.
- Assumptions/dependencies: Adapter support in serving stack; high-quality region-to-task alignment.
- Production monitoring and audit for compliance
- What: Log region responsibilities for regulated workflows (e.g., healthcare/legal summaries) to provide post-hoc explanations and audit trails.
- Sectors: healthcare, finance, public sector.
- Tools/workflows: Compliance logs of region activations; periodic audit reports summarizing risky-region exposure.
- Assumptions/dependencies: Data retention policies; privacy-preserving logging; regulator acceptance of region-level explanations.
- Evaluation and benchmarking augmentation
- What: Use MFA as a strong unsupervised baseline for causal localization/mediation in research and internal evaluations; compare against DBM/DAS.
- Sectors: academia, model vendors, evaluation startups.
- Tools/workflows: MIB, RAVEL, MCQA test harnesses; ablations separating centroid vs. loadings contributions.
- Assumptions/dependencies: Reproducible datasets; consistent layer choices.
- Personalization and tone/style control
- What: Map personas (concise, empathetic, formal, playful) to centroid clusters and interpolate accordingly; use subspace offsets for micro-adjustments (hedging, intensity).
- Sectors: CX, education tech, productivity.
- Tools/workflows: Persona→centroid mapping; live alpha tuning based on user feedback loops.
- Assumptions/dependencies: Persona definitions; guardrail overrides if personas conflict with policies.
- Latency and cost optimization via early exits
- What: If responsibilities at mid-layer indicate the target region is confidently reached, skip later expensive computations for certain classes of tasks.
- Sectors: infra/systems.
- Tools/workflows: Confidence thresholds on responsibilities; early-exit policies per endpoint.
- Assumptions/dependencies: Stable correlation between region assignment and final answer quality; careful offline calibration.
- Data curation and synthetic dataset generation
- What: Sample high-likelihood contexts from target regions to build small, high-quality datasets for finetuning or evaluation.
- Sectors: data ops, model training.
- Tools/workflows: Region-centric miners; synthetic data generation with centroid steering to increase coverage.
- Assumptions/dependencies: Avoiding feedback loops and distributional collapse; deduplication and quality filters.
- Safety red-teaming and jailbreak analysis
- What: Identify region neighborhoods activated during known jailbreaks and construct targeted tests/steers to defuse them.
- Sectors: safety, security.
- Tools/workflows: Attack→region mapping; differential steering experiments; logging during red-team sessions.
- Assumptions/dependencies: Ethical handling; dual-use risk management.
Long-Term Applications
These opportunities require additional research, scaling, or ecosystem support (e.g., provider APIs, standards, cross-model alignment).
- Standardized regional control APIs
- What: Model providers expose supported layer hooks to query responsibilities and apply centroid/subspace interventions safely.
- Sectors: model platforms, tooling vendors.
- Dependencies: Provider cooperation; safe-guarded interfaces; latency budgets and rate limits.
- Cross-model region alignment and transfer
- What: Map centroids and local subspaces across models to enable portable steering and consistent behavior in multi-model stacks.
- Sectors: interoperability, enterprise ML.
- Dependencies: Alignment algorithms for heterogeneous spaces; shared benchmarks; possible alignment probes or teacher-student protocols.
- Training-time objectives for local geometry
- What: Architectures and losses that encourage clean, stable regional structure and semantically disentangled local subspaces.
- Sectors: model R&D.
- Dependencies: Scalability; no degradation of core capabilities; evaluation standards for local intrinsic dimension and interpretability.
- Formal causal guardrails
- What: Verified, policy-driven masks over regions/local subspaces that provably block classes of behaviors while preserving task performance.
- Sectors: safety, regulated industries.
- Dependencies: Robust causal abstraction methods; certification criteria; adversarial testing.
- Regulatory audits via concept atlases
- What: Region-based interpretability artifacts accepted by regulators for model-risk, fairness, and transparency audits.
- Sectors: finance, healthcare, public sector.
- Dependencies: Standards bodies; audit methodology; evidence that region metrics correlate with outcomes.
- Multi-modal extensions
- What: Apply MFA to VLMs/robotics policies to get region-aware control over styles of perception/description and tool selection.
- Sectors: robotics, healthcare imaging, media.
- Dependencies: Demonstrations that local subspaces generalize across modalities; efficient activation capture for multi-modal stacks.
- Privacy and PII containment at region level
- What: Detect and suppress regions associated with PII recall; enforce data minimization at the representation layer.
- Sectors: compliance, privacy tech.
- Dependencies: High-precision PII region labeling; verification against leaker benchmarks; low false-positive rate.
- Watermarking and provenance via regional signatures
- What: Embed subtle, robust regional signatures for provenance and tamper detection.
- Sectors: content authenticity, IP protection.
- Dependencies: Robustness to paraphrase and model edits; low impact on quality.
- Region-aware dynamic compute and routing
- What: Combine with MoE to route tokens to experts based on region; allocate compute where local variability is highest.
- Sectors: systems efficiency, cloud ML.
- Dependencies: Serving infrastructure; stability under distribution shift; feedback control to avoid oscillations.
- Continual learning with region gates
- What: Detect emerging domains as new or shifting regions and adapt via targeted data collection/adapters.
- Sectors: enterprise ML, long-lived assistants.
- Dependencies: Drift detectors on responsibilities; safe adaptation policies; forgetting controls.
- Explainable agent design
- What: Use regional states as the agent’s “working memory slots” for planning and tool invocation; expose interpretable state to users.
- Sectors: autonomous agents, enterprise workflows.
- Dependencies: Stable mapping from regions to actionable intentions; human factors research.
- Open community MFA hubs
- What: Shared, versioned MFAs (per model/layer) with vetted annotations, neighborhood graphs, and steering recipes.
- Sectors: open-source ecosystem, academia.
- Dependencies: Maintenance resources; licensing clarity; reproducible training pipelines.
Common Assumptions and Dependencies Across Applications
- Access to hidden activations at selected layers (closed APIs may block this; self-hosted/open models favored).
- Compute/storage to train and serve MFAs (paper used 100M activations; production variants may use fewer with careful sampling).
- Layer selection matters; concepts can concentrate at mid/late layers (paper: ~1/3 and ~2/3 depth).
- Choice of K (components) and rank R trades off specificity vs. coverage; broad concepts may fragment at high K.
- Gaussian/local subspace assumptions: subspaces are meaningful but rotationally invariant; interpret the subspace as a whole, not single loadings.
- Annotation quality for centroids/subspaces (LLM-as-judge can bootstrap, but human validation is advisable in safety-critical domains).
- Dual-use risk: the same tools that enable control and transparency can aid evasion; require governance and red-team review before deployment.
Collections
Sign up for free to add this paper to one or more collections.