Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exemplar Partitioning for Mechanistic Interpretability

Published 14 May 2026 in cs.LG | (2605.14347v2)

Abstract: We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from LLM activations with $\sim 103\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim$20% of EP regions match an SAE feature at $F_1 > 0.5$, and EP one-hot probes retain $\sim$97% of raw-activation probe accuracy at $\ell_0 = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_1$ reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A's 0.911, at $\sim 103\times$ less build compute.

Authors (1)

Summary

  • The paper introduces Exemplar Partitioning, an unsupervised method that builds activation space dictionaries via leader clustering and Voronoi partitions.
  • It demonstrates that EP retains nearly 97% of raw activation probe accuracy while reducing computational cost by roughly 10^3 times compared to sparse autoencoders.
  • The method enables causal manipulation and robust out-of-distribution detection by anchoring dictionary regions with immutable, real observed activations.

Exemplar Partitioning: A Unsupervised Approach for Efficient Mechanistic Interpretability

Methodological Foundations

The paper introduces Exemplar Partitioning (EP), an unsupervised method for extracting interpretable dictionaries from LLM activation spaces via leader clustering. Unlike sparse autoencoders (SAEs), which require extensive training, fixed dictionary sizes, and bundle reconstruction with sparsity objectives, EP constructs its dictionary as a Voronoi partition by streaming activations and anchoring each region with a real observed exemplar. The dictionary size emerges in accordance with the geometry of activation space rather than being imposed a priori.

The clustering utilizes centered cosine distance on the unit sphere, with calibration based on the percentile of pairwise distances within a corpus. This percentile, denoted pp, governs the resolution: lower pp yields finer partitions with more regions, while higher pp produces coarser dictionaries. Each activation is normalized relative to the corpus center, and new regions are spawned only when activations are not within the threshold distance of existing exemplars. Importantly, exemplars are immutable once selected, enabling direct comparisons across layers, checkpoints, and even models.

Empirical Characterization and Comparative Analysis

EP dictionaries are shown to be locally interpretable and causally actionable. Regions in the Gemma-2-2b model are content- and function-coherent, and both content (e.g., ordinal regions) and function (e.g., code position) axes recur across partitions. Empirical investigations demonstrate substantial agreement between EP and SAE methods: approximately 20% of EP regions (at p=10p = 10) attain F1>0.5F_1 > 0.5 in token overlap with SAE features. This confirms a strong shared core, though EP and SAE diverge outside this intersection—EP captures broad density-based regions, SAE splinters them into linearly separable, sparse directions.

EP achieves nearly all linearly-decodable identity measured by probing, with the p=10p = 10 one-hot code retaining 97%\sim97\% of raw activation probe accuracy. Notably, EP does this at 103×\sim10^3\times less computational cost than the SAE dictionaries, dropping the need for gradient steps and vast training budgets.

Behavioural Localisation and Causal Manipulation

EP’s efficacy extends to explicit behavioral localization. Refusal behavior in instruction-tuned Gemma can be concentrated in a uniquely loaded region, and ablation of the region's exemplar reduces held-out refusal rate by as much as 0.96-0.96 at p=12p = 12, while matched non-refusal regions yield null effect. Exemplar directions are more causally precise for ablation than mean-member directions, highlighting the utility of real observed activations as anchors.

Additionally, the nearest-exemplar distance offers a nonparametric out-of-distribution (OOD) signal: activations from random tokens or underrepresented distributions (e.g., Bulgarian Wikipedia) are systematically further from EP dictionary exemplars, with the gap increasing as partition resolution tightens and decreasing with layer depth.

Dynamics Across Layers, Domains, and Training Checkpoints

EP saturation sizes quantify the representational complexity inherent to activation distributions within domains and layers. Chat inputs saturate EP dictionaries at larger sizes with greater growth across network depth; code inputs are consistently more compact. Saturated region counts correlate with representation complexity at each layer and domain.

Cross-checkpoint matching elucidates training-induced drift in activation space. Hungarian matching of EP dictionaries across base and instruction-tuned Gemma models highlights substantial re-anchoring, with only a small subset of persistent universal regions (e.g., mathematical tasks) remaining. Instruction tuning fragments the previously unitary final-position representation, promoting content-discriminating axes and anchoring refusal along decision-time regions.

Stability, Information Retention, and Limitations

Leader clustering via streaming introduces seed dependence, but region stability is quantifiable using the log-concentrated effective sample size (pp0). Regions with high member count and coherence are reproducible across seed shuffles. EP's lossy compression discards nearly all raw coordinates but preserves nearly all linearly-probeable label identity under extremely tight one-hot encoding constraints.

EP is currently committed to centered cosine geometry; the role of magnitude, non-cosine metrics, and hierarchical activation structure is left to further work. Streaming order remains a source of instability for large or uniformly represented regions, and exploratory experiments with alternative exemplar selection protocols are proposed.

Practical and Theoretical Implications

EP sets a new standard for unsupervised feature discovery and mechanistic interpretability in LLMs, achieving compelling interpretability, causal manipulability, cross-checkpoint commensurability, and domain-resolved saturation, all with dramatically reduced computational requirements. The practical utility of EP for concept identification, OOD detection, and model behavior steering is well-demonstrated.

Theoretically, EP exposes the density-vs-linear-separability duality in representation learning, providing a vantage point to interrogate the geometry of activation space without the constraints of reconstruction objectives or fixed dictionary sizes. EP serves not only as a feature dictionary but as a direct measurement of activation space geometry and its evolution under training.

The implications extend to efficient model-wide inventorying, precise causal intervention, and rigorous assessments of representational drift—a substantial advance in transparency and interpretability of large networked systems.

Conclusion

Exemplar Partitioning is a rigorous, tractable, and interpretable object for unsupervised analysis of LLM activation spaces. It preserves most linearly-decodable information, enables efficient region-local causal manipulation, and produces dictionaries commensurable across architectures and training checkpoints. EP’s geometric construction, density-aware commitment, and efficiency open several promising avenues for interpretability research, future metric and protocol extensions, and real-time large-scale activation analysis. Its dual contribution as an interpretability tool and a geometric assay of activation structure is poised to facilitate both practical and theoretical progress in AI transparency and mechanistic understanding.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces a simple, fast way to find and name “features” inside a LLM by looking at its internal signals while it reads text. The method is called Exemplar Partitioning (EP). Instead of training a complicated helper model, EP groups similar internal signals together and uses a real example from each group as its “label.” This makes the groups easy to compare across layers and model versions, works with far less data and compute than common alternatives, and still finds useful, human-meaningful patterns.

Key Questions the Paper Asks

  • Can we build an understandable “dictionary” of what’s going on inside an LLM without heavy training?
  • Do those dictionary entries line up with real ideas or behaviors (like refusal to answer harmful questions)?
  • How does this new method compare to the popular approach called sparse autoencoders (SAEs)?
  • Can these groups help with practical tasks, like telling when an input is out-of-distribution (weird or unfamiliar), or seeing how fine-tuning changes the model?

How the Method Works (in everyday terms)

Think of the model’s internal activity at a moment (an “activation”) as an arrow pointing somewhere in a huge space. Similar meanings or functions often make arrows point in similar directions.

EP builds a map of this space using real examples:

  • First, it peeks at a small sample of activations to set a “closeness” rule. Instead of a fixed number, it uses a percentile p (like “the closest 1% of pairs”), so the rule makes sense across different layers or models.
  • It recenters and scales activations so that “direction” matters most (this is like asking, “which way is the arrow pointing?” rather than “how long is it?”).
  • As activations stream in, EP compares each new activation to the saved “exemplars” (the example arrows that already define regions):
    • If the new activation is close enough to some exemplar, it joins that exemplar’s neighborhood.
    • If it’s not close to any, it becomes a new exemplar and starts a new neighborhood.
  • EP stops when new activations stop creating new neighborhoods (“saturation”).
  • The result is a Voronoi-like partition: every point in the space belongs to the nearest exemplar’s region. You can think of it like a map where every location belongs to the closest post office.

Why this is helpful:

  • The “label” for each region is a real, seen activation (the exemplar), not a learned average. That keeps the regions stable and human-inspectable.
  • Because exemplars are actual saved points, regions can be directly compared across layers, model checkpoints, and even different models.
  • EP requires much less data and compute than training an SAE. It’s a one-pass, streaming clustering process.

A small, helpful detail:

  • EP also tracks how tight each region is (“coherence”) and how many members it has. A simple score that combines these helps predict which regions are stable if you reshuffle the data order.

Main Findings

1) EP finds meaningful, human-readable features

On the AxBench test (which checks if you can detect many named concepts inside models), EP scores 0.881 (on a scale where higher is better), which is:

  • Much better than a common SAE baseline (0.755),
  • Close to a stronger SAE-based method (0.911),
  • And EP does this with about 1,000× less build compute and a smaller search pool.

This means EP’s simple “nearest exemplar” idea is good enough to detect a wide range of concepts.

2) EP can localize and causally affect behavior (refusal)

In an instruction-tuned model, activations related to “refusal” (like not answering harmful questions) bunch up into a specific EP region. If you project activations away from that region’s exemplar direction, refusal on new inputs collapses strongly, while doing the same with matched non-refusal regions doesn’t. This suggests the exemplar captures a precise behavior direction that you can steer.

3) EP preserves most information needed for simple probes

Even if you throw away everything except “which region is closest” (a 1-hot code), EP keeps about 97–98% of the accuracy of a standard linear probe that reads directly from raw activations. Translation: the identity that simple probes read out mostly lives in “which neighborhood you’re in,” not in all the tiny details.

4) EP and SAEs see a shared core but also different things

  • About 20% of EP regions match a sparse autoencoder feature well (at a common overlap score threshold).
  • EP tends to capture dense, content-anchored regions (real clusters in space).
  • SAEs focus on directions that help linearly reconstruct signals. They agree where “dense clusters” are also “linearly separable,” but differ elsewhere, so the methods are complementary.

5) EP gives a built-in out-of-distribution (OOD) signal

How far a new activation is from its nearest exemplar is a free OOD score. Random-token activations are noticeably farther away than normal data; underrepresented data (like some non-English Wikipedia) also tends to be farther than the training-like data. The signal varies by layer and is sharper in earlier layers.

6) EP shows how much “space” a domain uses

The number of regions needed at a fixed scale (saturation size) differs by domain and layer. For example, “chat” needs more regions than “code.” This gives a rough measure of how diversely the model represents different kinds of inputs at different depths.

7) EP reveals how fine-tuning rearranges internal space

Matching regions between a base model and its instruction-tuned version shows many directions get re-anchored. For instance, a direction related to harmful content in the base model seems to be promoted into a decision-time refusal direction after instruction tuning.

Why This Matters

  • Practical and efficient: EP is easy to build (one streaming pass, no heavy training) yet still uncovers useful, causal features. That lowers the barrier to studying model internals.
  • Transparent and comparable: Because entries are real exemplars, you can track the same regions across layers and training checkpoints, making it easier to see how models change.
  • Complements existing tools: EP captures density-based structure; SAEs capture linearly separable factors. Using both can give a fuller picture of what the model has learned.
  • Useful signals for safety and reliability: EP naturally provides an OOD score and can isolate behavior directions (like refusal), which can help with monitoring and steering models.

In short, Exemplar Partitioning turns the model’s internal space into a map of real, example-anchored neighborhoods. It’s simple, fast, and surprisingly powerful—good enough to detect concepts, localize behaviors, and track model changes—making it a promising addition to the interpretability toolkit.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper’s Exemplar Partitioning (EP) approach.

  • Geometry choice is untested: quantify how alternative metrics (Euclidean, magnitude-aware cosine, Mahalanobis, hyperbolic, or a learned metric) affect region semantics, AxBench performance, OOD scores, and causal interventions, especially for magnitude-dependent features.
  • Calibration centre bias: assess sensitivity to how the corpus mean p is estimated (sample size, domain-specific or layer-specific centring) and evaluate adaptive re-centering at inference under distribution shift.
  • Automatic selection of the percentile threshold p: develop unsupervised criteria (e.g., stability Di distribution, MDL/BIC, saturation “elbow,” OOD separation) and multi-scale/hierarchical procedures to avoid manual p sweeps.
  • Saturation stopping rule: characterize how the saturation window W and batch size impact dictionary size, stability, and coverage; propose statistically grounded stopping criteria.
  • Seed/streaming-order dependence: design and test order-robust variants (reservoir sampling, multi-pass k-center, consensus across seeds, merge/prune of near-duplicate regions) and provide variance estimates across seeds.
  • Missing baselines vs standard clustering: compare EP to k-means/mini-batch k-means (cosine), k-center greedy, DP-means, hierarchical clustering, and VQ/VAE codebooks on interpretability, stability, and downstream tasks.
  • Fairness of SAE comparisons: equalize training data (build EP and SAEs on the same activation stream) and include modern SAE variants (TopK, JumpReLU) and full SAEBench metrics to isolate method effects from data differences.
  • Reconstruction/composition capacity: systematically evaluate whether n-sparse EP readouts (top-n distances) can recover reconstruction metrics (core/ravel) and quantify trade-offs between sparsity and fidelity.
  • Formal link between density cones and linear separability: derive conditions under which a linear feature corresponds to a single EP region, and define quantitative measures for density-vs-linearity overlap/mismatch.
  • Cross-layer/cross-model alignment: extend matching beyond base→instruction finetune to different sizes/families; test alignment methods (CCA/Procrustes/OT) to compare dictionaries across representational spaces.
  • Causal intervention robustness: generalize exemplar ablation tests to multiple behaviours, layers, and models; measure side effects on unrelated capabilities, dose-response curves, and compare exemplar vs mean vs SAE feature interventions.
  • OOD detection rigor: report ROC/AUPRC and calibrated thresholds on standard, diverse OOD corpora (multilingual, domain shift, adversarial perturbations); benchmark against Mahalanobis, energy-based, and kNN baselines; study layer-ensemble scoring.
  • Multilingual and code-domain coverage: investigate why code concepts lag and whether domain-specific calibration/centres or metrics improve performance; expand multilingual evaluation beyond a single Bulgarian sample.
  • Sequence-level structure: EP treats tokens independently—evaluate variants that capture multi-token or trajectory-level features (e.g., clustering activation paths across layers, n-gram exemplars).
  • Activation subspace choices: extend analyses beyond residual stream (e.g., MLP activations, attention outputs, K/V caches) and compare dictionary properties and utility across subspaces.
  • Coherence and stability metrics: precisely define Ci and Di, compare to alternatives (silhouette, intra-cluster variance), derive confidence intervals, and set principled thresholds for filtering unstable regions.
  • Exemplar quality and outliers: quantify susceptibility to outlier-induced regions; develop robust seeding, pruning/merging rules, and outlier-resistant distances; evaluate impact on interpretability and downstream tasks.
  • Scalability of nearest-exemplar search: characterize time/memory complexity as K grows; evaluate approximate nearest neighbor indices (e.g., HNSW) and distributed build/inference strategies.
  • Region labeling and human evaluation: standardize post-hoc labeling beyond logit lens (e.g., probe-based summaries, exemplar summaries), conduct human ratings of monosemanticity, and compare to SAE interpretability audits.
  • Multi-scale/hierarchical EP: construct and evaluate hierarchical dictionaries (vary p across levels), define consistent cross-scale mappings (split/merge criteria), and test multi-resolution readouts.
  • Privacy/memorization risks: storing real activation exemplars may encode sensitive content—assess leakage risks and propose privacy-preserving variants (e.g., DP noise, synthetic surrogates, exemplar obfuscation).
  • Centre drift at inference: quantify how stale centres degrade assignment and OOD scoring on novel domains, and evaluate on-the-fly centre updates without rebuilding dictionaries.
  • AxBench selection heuristic: compare the contrastive mean-cosine selector to validation-AUROC, margin-based, or learned meta-selectors; analyze per-concept failure modes and selector generalization.
  • Hyperparameter sensitivity: report sensitivity to calibration batch size, context length, extraction batch size, and activation budget; provide practical guidelines for robust builds.
  • Validating “representation complexity” via saturation: test whether saturated K correlates with downstream difficulty/information content after controlling for token entropy/length; formalize and justify the metric.
  • EP–SAE hybrids: explore initialising SAEs with EP regions, constraining SAE decoders to EP partitions, or using EP distances as features to improve reconstruction while retaining interpretability.
  • Use of distance features: quantify how many nearest-exemplar distances (n) are needed to approach raw-activation probe accuracy and how this affects interpretability and sparsity.
  • Matching methodology: enhance cross-dictionary matching beyond cosine-only Hungarian (e.g., neighborhood-graph-aware matching) and evaluate reliability under seed/metric changes.
  • Handling tokenizer/glyph artifacts: develop preprocessing or filtering to avoid regions dominated by formatting/template/glyph tokens and assess impact on concept detection.
  • Reproducibility and release details: provide full build configs (seeds, budgets, centres, calibration sets) and ablation scripts to enable exact replication and controlled comparisons.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that follow directly from the paper’s findings and method, along with sectors, potential tools/workflows, and key dependencies or assumptions.

  • EP-based out-of-distribution (OOD) monitor for LLM inference
    • Sectors: software platforms, safety & security, enterprise AI, finance, healthcare, education
    • What: Use nearest-exemplar distance as a “free” OOD score at inference (per token or per span) to detect distribution shift, anomalous inputs (e.g., random-token-like), or under-represented languages/domains.
    • Tools/products/workflows: “EP-Guard” runtime library that hooks into a chosen layer (e.g., L12 or L20) and emits OOD scores; dashboards showing per-layer distance histograms by traffic segment.
    • Dependencies/assumptions: Access to model activations at a fixed layer; a calibrated EP dictionary (choice of percentile p, layer); the OOD gap varies by layer and resolution; calibration data should approximate production distribution for threshold setting.
  • Label-free concept detectors for moderation, compliance, and routing
    • Sectors: content moderation, policy/compliance, social media, enterprise knowledge management, code review
    • What: Select EP regions as concept detectors (as in AxBench) for topics like safety categories, sensitive PII, or domain routing (text/code/math). Achieves high AUROC with ~10 less build compute than SAEs.
    • Tools/products/workflows: EP dictionary explorer for concept screening; batch scoring of documents/chats; inference-time hooks for automated routing to specialized models.
    • Dependencies/assumptions: Concept selection uses labels at selection time; performance depends on EP resolution (p), chosen layer, and alignment between build corpus and production data; code concepts were harder than text concepts in reported results.
  • Behavior steering via exemplar-direction interventions
    • Sectors: enterprise assistants, customer support automation, product safety
    • What: Identify and intervene on behavior-localized regions (e.g., refusal) by projecting off an exemplar direction to tune responses (e.g., reduce excessive refusals in compliant domains).
    • Tools/products/workflows: “EP-Steer” operator providing per-turn, per-layer interventions; A/B testing harness for side-effect checks.
    • Dependencies/assumptions: Interventions can have unintended effects; requires careful evaluation; strongest effect reported near specific layers (e.g., L20) and models (instruction-tuned Gemma).
  • Model update auditing and change tracking across checkpoints
    • Sectors: MLOps, governance, policy/regulatory audits
    • What: Use cross-checkpoint EP matching to quantify which activation directions persist vs. are re-anchored after finetuning; localize where alignment changes happen (e.g., refusal routing emergence).
    • Tools/products/workflows: “EP-Drift” dashboard showing Hungarian-matched region cosines, per-region stability (D_i), and behavioral cross-tabs.
    • Dependencies/assumptions: Requires dictionaries built with the same protocol and data stream; seed effects mitigated by filtering on region stability (D_i).
  • Representation complexity and coverage monitoring via saturation
    • Sectors: data/ML platform engineering, LLMOps, education/research
    • What: Track saturated dictionary size and growth curves by domain (e.g., chat/code/math) as a coarse measure of learned representation complexity and domain coverage at a fixed geometric scale.
    • Tools/products/workflows: “EP-Saturation” reports during model eval and data selection; alerts when new traffic induces unsaturated growth.
    • Dependencies/assumptions: Fixed calibration percentile p; differences are layer- and domain-dependent; saturation window choice affects reported counts.
  • Lightweight probing and analytics using EP one-hot codes
    • Sectors: research, evaluation/benchmarking, product analytics
    • What: Replace raw activations with 1-sparse EP codes that retain ~97–98% of probe accuracy at l0=1 for fast label-decoding analytics and regression monitoring without training SAEs.
    • Tools/products/workflows: Rapid regression probes for new model checkpoints; quick-turnaround audits for concept decodability.
    • Dependencies/assumptions: Probing evaluates linear decodability (not reconstruction); performance depends on layer choice and partition resolution.
  • Failure forensics and explainability-by-examples
    • Sectors: safety, QA, customer support tools
    • What: For a misbehavior, map its activation to a nearest region and inspect the region’s exemplars/members to understand “what the model thought” (content/function-coherent neighborhoods).
    • Tools/products/workflows: EP dictionary browser that shows region exemplars, member counts, coherence, and nearest-neighbor tokens/snippets.
    • Dependencies/assumptions: Requires curated build corpus and representative exemplars; interpretability depends on region coherence and stability.
  • Data curation and targeted finetuning
    • Sectors: data engineering, model training
    • What: Use nearest-exemplar distance and region statistics (coherence, counts) to identify under-covered parts of activation space; prioritize data acquisition or finetuning on those regions.
    • Tools/products/workflows: “EP-Curation” scoring for active data selection; region-aware sampling strategies.
    • Dependencies/assumptions: Assumes calibration corpus approximates intended use; region stability filtering recommended.
  • Activation-cache clustering and retrieval hygiene
    • Sectors: serving infrastructure, systems
    • What: Cluster KV-cache entries or activation caches by EP region IDs to improve cache locality, eviction policies, or route similar contexts together.
    • Tools/products/workflows: Cache sharding/eviction tuned by region indices; per-region hit-rate analytics.
    • Dependencies/assumptions: Engineering integration with serving stack; empirical validation required on latency and hit-rate trade-offs.
  • Fairness and multilingual distribution monitoring
    • Sectors: policy/compliance, global product operations
    • What: Track nearest-exemplar distance by language/user segment to surface under-represented groups (e.g., larger distances observed for Bulgarian Wikipedia vs. Pile).
    • Tools/products/workflows: Segment-level OOD dashboards; data augmentation triggers for high-distance segments.
    • Dependencies/assumptions: Sensitive to build corpus composition; distance gaps vary by layer/resolution; requires fairness-aware thresholds and reviews.

Long-Term Applications

These use cases need additional research, scaling, or productization beyond what the paper demonstrates.

  • EP-guided training and regularization
    • Sectors: model training, safety
    • What: Incorporate EP signals (e.g., coherence, saturation, behavior-region localization) into training objectives to increase separability of safe vs. risky regions or to encourage desired geometry.
    • Dependencies/assumptions: Requires differentiable proxies or periodic EP rebuilds; careful evaluation to avoid performance regressions.
  • Standardized interpretability artifacts for governance
    • Sectors: policy/regulation, auditing, certification
    • What: Publish EP dictionaries (with exemplars, stability metrics, behavioral cross-tabs) as a standardized, comparable artifact across versions/vendors for update disclosures and audits.
    • Dependencies/assumptions: Sector consensus on protocol (layer, percentile p, corpora); privacy controls for exemplars; regulator acceptance.
  • EP-anchored guardrails and stateful controllers
    • Sectors: safety, robotics, automotive, enterprise assistants
    • What: Build runtime controllers that watch activation-region trajectories, applying policy when entering risky regions (e.g., escalating to human, sandboxing tools).
    • Dependencies/assumptions: Requires robust trajectory features and low-latency hooks; side-effect analysis and adversarial testing (e.g., jailbreaks).
  • Representation compression and on-device inference
    • Sectors: edge computing, mobile AI
    • What: Replace or augment intermediate representations with EP codes for low-bit, interpretable inference layers where linear decodability suffices.
    • Dependencies/assumptions: Reconstruction is lossy; performance on end tasks must be validated; likely model- and layer-specific.
  • Multimodal EP (vision, audio, robotics policies)
    • Sectors: multimodal AI, robotics
    • What: Extend exemplar partitioning to non-text modalities (alternative metrics, magnitude-aware geometry) to build unified, interpretable feature partitions.
    • Dependencies/assumptions: Geometry and calibration choices will differ (e.g., Mahalanobis or hyperbolic distances); requires large-scale empirical validation.
  • Automated red teaming and jailbreak detection
    • Sectors: safety, security
    • What: Use EP to define “risky” regions and search for prompts that reliably route activations into those regions; flag or block when distances spike or when entering specific neighborhoods.
    • Dependencies/assumptions: Generalization to adversarial settings; threshold stability across updates; integration with other detectors.
  • EP-informed curriculum learning and data marketplaces
    • Sectors: data vendors, training ops
    • What: Buy or generate data specifically to densify sparse, high-distance, or low-coherence regions; track ROI via saturation and OOD before/after curves.
    • Dependencies/assumptions: Requires shared metrics and interoperable EP protocols across parties.
  • Cross-model feature alignment and transfer
    • Sectors: model distillation, federated learning
    • What: Align EP regions across different architectures or sizes to transfer behavior controls or probes (e.g., refusal routing) with minimal additional training.
    • Dependencies/assumptions: Matching quality varies; build corpora and geometry must be standardized; seed instability mitigated via stability filtering.
  • Code intelligence and developer tooling
    • Sectors: software engineering
    • What: EP-based concept detectors for code patterns or smells, routing code assistance by detected region (e.g., refactoring vs. synthesis vs. test generation).
    • Dependencies/assumptions: Reported alignment with code concepts is weaker than text; requires focused EP builds on code distributions and layer tuning.
  • Safety-tuned, magnitude-aware EP variants
    • Sectors: safety, high-stakes domains (healthcare, finance)
    • What: Incorporate magnitude, Mahalanobis, or gradient-aware geometry into EP to capture features dependent on activation norm or rare-but-critical directions.
    • Dependencies/assumptions: Methodological research and calibration protocols; rigorous domain-specific validation.

Notes on Assumptions and Dependencies (applies broadly)

  • Geometry choice matters: centered-cosine was used; magnitude-aware or alternative metrics could change results.
  • Resolution parameter p and layer choice are critical and task-dependent; practical deployments should sweep and validate.
  • Seed instability exists; use the provided stability statistic D_i to filter/weight regions.
  • Access to model activations is required; some hosted APIs may preclude this without vendor support.
  • Reported benchmarks focus on Gemma-2-2B and specific corpora (Pile); generalization to larger or different models requires testing.
  • Causal interventions can have side effects; deploy only with strong safety evaluations and rollback plans.

Glossary

  • Activation manifold: the typical set of activation vectors a model produces, viewed as a learned geometric surface in representation space; "late-layer processing appears to pull heterogeneous inputs back toward the typical activation manifold."
  • Activation space: the high-dimensional vector space of model activations where features/regions are analyzed; "EP regions and Gemma Scope SAE features decompose activation space differently"
  • Activation steering: modifying model behavior by adding or subtracting learned activation directions; "activation steering and refusal-direction work use activation differences or learned directions to control behaviour"
  • AxBench: a benchmark for evaluating latent concept detection and steering in LLMs; "AxBench gives a direct external test of the question this section asks"
  • AUROC: Area Under the Receiver Operating Characteristic curve, a performance metric for binary classification; "mean AUROC 0.881"
  • Centred unit sphere: the space of mean-subtracted, L2-normalized activations used with cosine distance; "a Voronoi partition of the centred unit sphere under cosine distance"
  • Centroid: the average (mean) point of a cluster used for distance comparisons in clustering; "averaging causes centroids to shift over time as their cluster absorbs new members."
  • Contrastive-cosine proxy: a selection heuristic that prefers regions maximizing the difference in cosine similarity between positives and contrasts; "selects by a contrastive-cosine proxy"
  • Cosine distance: a distance metric based on one minus cosine similarity, suitable for direction-only comparisons; "cosine distance is the natural metric on the resulting unit sphere."
  • Decoder direction: a vector in a linear dictionary that maps activation components back to input space during reconstruction; "each feature is a learned decoder direction optimised for sparse reconstruction."
  • Distance-to-cover: the nearest-exemplar distance returned by EP that scores how well an activation is covered by the dictionary; "native nearest-exemplar assignment is 1-sparse and buys identity, retrieval, and a native distance-to-cover"
  • Effective sample size (concentrated): a stability statistic combining member count and coherence to estimate reliability of a region; "acts as a concentrated effective sample size."
  • Exemplar: a real observed activation used as a fixed anchor for a region in EP; "Thus we cluster on fixed exemplars."
  • Exemplar Partitioning (EP): an unsupervised method that partitions activation space into exemplar-anchored regions; "We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries"
  • Feature dictionary: a collection of reusable, addressable features or regions derived from activations; "interpretable feature dictionaries"
  • Hungarian matching: an algorithm for optimally pairing elements across sets (e.g., regions across checkpoints) to compare them; "Hungarian matching finds low median matched cosines"
  • JumpReLU: a sparse autoencoder variant that alters nonlinearity to improve reconstruction-sparsity trade-offs; "TopK, JumpReLU, and related variants improve the reconstruction-sparsity tradeoff"
  • k-sparse: a representation where at most k features are active per input; "SAEs are k-sparse with k typically in the tens to low hundreds"
  • kNN-LM: a nearest-neighbor LLM that augments generation using stored activation exemplars; "kNN-LM has shown that stored activation exemplars can be useful for language modelling"
  • Latent concept detection: identifying unlabeled, human-meaningful concepts from activations using unsupervised dictionaries; "AxBench latent concept detection at Gemma-2-2B-it L20"
  • Leader-clustering: an online clustering method that creates new clusters when points exceed a threshold from existing leaders; "Leader-clustering requires a distance threshold"
  • Linear probe: a simple linear classifier trained on activations to test if a label is linearly decodable; "Linear probes test whether a specified label is decodable from activations"
  • Linear separability: the property that classes/features can be separated by linear hyperplanes; "SAEs commit to linear separability"
  • Logit-lens: a technique for decoding hidden activations into vocabulary logits to interpret intermediate representations; "its exemplar's logit-lens [nostalgebraist, 2020] decode"
  • Mahalanobis distance: a distance metric that accounts for covariance structure, often used for distribution-shift detection; "Mahalanobis, energy-based, and nearest-neighbour methods"
  • Member coherence: a measure of how tightly aligned a region’s members are around its representative direction; "the member coherence"
  • Monosemanticity: the property that individual features correspond to single, human-interpretable concepts; "Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet"
  • Nearest-exemplar distance: the minimum distance from an activation to any region exemplar, used as an OOD signal; "Nearest-exemplar distance provides a free out-of-distribution signal at inference."
  • One-hot probe: a representation that uses a single active indicator per region assignment for probing tasks; "EP one-hot probes retain ~97% of raw-activation probe accuracy at lo = 1"
  • Out-of-distribution (OOD): inputs that differ from the training distribution, often detected via distance metrics; "for downstream tasks like OOD detection (§7)"
  • PCA (Principal Component Analysis): a dimensionality-reduction technique used here to visualize partitions; "a PCA-projected 3D rendering"
  • Percentile-calibrated threshold: a clustering radius picked as a chosen percentile of pairwise distances in calibration data; "The distance threshold Op is set to the p-th percentile of these distances"
  • Reconstruction objective: the loss used to train autoencoders to reproduce inputs from latent representations; "learning a linear basis under a reconstruction objective"
  • Saturation window: the number of batches with no new regions after which EP construction stops; "saturation window W (default W = 1)"
  • Sparse autoencoder (SAE): an autoencoder trained to produce sparse latent features for interpretability; "Sparse autoencoders (SAEs) - currently the dominant approach for feature discovery"
  • Superposition: multiple features encoded in overlapping directions, making them entangled in representations; "recovers superposition-style multi-region overlap"
  • TopK: a sparsity mechanism that keeps only the highest-k activations, used in some SAE variants; "TopK, JumpReLU, and related variants improve the reconstruction-sparsity tradeoff"
  • Vector quantisation: discretizing vectors by assigning them to a finite codebook of representatives; "Vector quantisation ... differs in learning representatives and fixing codebook size in advance."
  • Voronoi partition: a division of space into regions where each point is closest to a specific center (exemplar); "An EP dictionary is a Voronoi partition of activation space"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 68 likes about this paper.