Exemplar Partitioning (EP) Overview
- Exemplar Partitioning (EP) is an unsupervised method that partitions high-dimensional data into hard, anchor-defined Voronoi regions via leader clustering.
- The technique achieves competitive interpretability compared to sparse autoencoders while reducing computational cost and enabling flexible, nonparametric clustering.
- EP supports mechanistic analysis of deep neural activations by facilitating cross-checkpoint comparisons and causal interventions for model interpretability.
Exemplar Partitioning (EP) denotes a class of unsupervised methods that construct hard, non-parametric partitions of high-dimensional data by selecting a subset of observed data points, called exemplars, to serve as anchors for “Voronoi” regions. Each data point is assigned to its nearest exemplar according to a task-appropriate geometry or similarity. EP is notably leveraged for interpretable feature discovery in deep neural activations, including mechanistic analysis of LLMs, and for exemplar-based clustering with flexible, nonparametric priors on cluster structure. EP produces directly comparable, anchor-based dictionaries for analysis and intervention, requires no gradient optimization, and, when applied to model activations, achieves comparable interpretability and causal utility to sparse autoencoders at orders-of-magnitude lower computational cost (Rumbelow, 14 May 2026, Tarlow et al., 2012).
1. Mathematical Construction and Algorithmic Specification
EP partitions the input space through a sequential, distance-thresholded leader clustering. Consider a stream of activation vectors , where each denotes the (possibly normalized) activations at a particular model layer and token position. EP proceeds as follows:
- Centering & Normalization:
The data mean is computed on a calibration set. Each activation is mapped to the unit sphere: so that the dictionary is formed over normalized directions.
- Leader Clustering:
For each normalized activation streamed in, compute its minimum Euclidean distance from the current exemplar set . If (where is a distance threshold), create a new exemplar ; otherwise assign 0 to the closest existing 1. In concise pseudocode: 3
- Voronoi Dictionary: Each 2 defines a region 3; the collection 4 partition the unit sphere into encoder-determined cells.
- Threshold Calibration: The cluster threshold 5 is set as the 6-th percentile of pairwise distances over 7 calibration activations. Reporting the percentile 8 (e.g., 9 for the 0 percentile) normalizes cluster resolution across models, layers, and datasets.
Emergent dictionary size is determined by activation geometry at fixed 1, with the process terminating at batch “saturation”—when a full batch yields no new exemplars.
2. Probabilistic and Prior-Driven Extensions
EP is generalized by coupling with nonparametric priors on partitions, notably Dirichlet process (DP) or Pitman–Yor priors, as a framework for flexible exemplar-based clustering (Tarlow et al., 2012):
- Let 2 data points 3. Each cluster is defined by its exemplar, and each point is assigned to one exemplar.
- The generative model is:
4
where 5 is the partition prior (e.g., DP), 6 enforces one exemplar per non-empty cluster, and 7 specifies emission from exemplars.
- The DP prior allows the number of clusters to be determined adaptively,
8
with 9 the concentration parameter, 0 clusters and cluster sizes 1.
- MAP inference is solved via max-product belief propagation on a structured factor graph over the 2 assignment variables, with computational cost 3 for 4 rounds of message passing and 5 space (Tarlow et al., 2012).
- Flexible priors 6 encode different cluster-size behaviors, e.g., Pitman–Yor or power-law.
3. Mechanistic Interpretability in Model Activations
EP directly supports mechanistic interpretability in deep models by constructing feature dictionaries aligned to observed activation geometry (Rumbelow, 14 May 2026):
- Exemplar Anchoring:
Each region anchor is a true observed activation. Thus, dictionaries constructed from the same stream are directly comparable across layers, training checkpoints, or model variants.
- Cross-Checkpoint Comparison:
Matching exemplars across model checkpoints (e.g., base vs. instruction-tuned) via Hungarian algorithm and cosine similarity reveals which activation directions persist across fine-tuning. For example, in Gemma-2-2B, only a small fraction of high-cosine matches (7) survived across checkpoints, implying that most activation geometry is re-anchored by fine-tuning.
- Intervention Experiments:
Projecting activations off an exemplar associated with a specific behavior (such as refusal in instruction-tuned LLMs) can causally suppress that behavior (e.g., baseline refusal 0.98 drops to 0.02 upon ablation of the corresponding region anchor, a difference 8).
- Quantitative Feature Alignment:
EP regions exhibit partial overlap with sparse autoencoder (SAE) features: 9 of EP regions match an SAE feature at 0, with mean 1 at 2. Conversely, only 3 of SAE features match an EP region at the same threshold, with higher coverage at finer percentile granularity.
- One-Hot Probe Accuracy & AUROC:
Encoding activations into EP one-hot sparse codes preserves 4 of linear probe accuracy compared to using raw activations. For latent concept detection (AxBench), EP at 5 achieves mean AUROC 6, exceeding standard SAE (7) and closely approaching label-supervised SAE-A (8).
- Out-of-Distribution Signal:
The nearest-exemplar distance serves as a free measure of distributional shift; activations on random or under-represented inputs display significantly greater mean distance to nearest anchor than in-distribution samples.
4. Comparative Analysis: EP vs Sparse Autoencoders and Other Methods
EP and sparse autoencoders (SAEs) impose fundamentally distinct geometric constraints on learned representations:
| Feature | Exemplar Partitioning (EP) | Sparse Autoencoders (SAE) |
|---|---|---|
| Partition geometry | Hard Voronoi cells (unit sphere, L2) | Linear coding, soft selection |
| Anchor type | Observed activations | Learned weights |
| Dictionary size | Emergent, dictated by threshold and data geometry | Prespecified, e.g., 9k |
| Compute requirements | Single-stream, zero backward passes (~0–1 activations) | Millions–billions of tokens, backpropagation, many gradient steps |
| Inter-dictionary match | Direct comparison across layers, models, or checkpoints | No inherent cross-model alignment |
| Shared coverage | 2 of EP regions match an SAE feature at 3 | SAE features more fragmented, less likely to coincide with EP at coarse resolution |
| Interpretability | Region cause directly intervenable (projection off exemplar collapses associated features) | Demands indirect or aggregate intervention |
EP efficiently yields dictionaries with 4 to 5 with no learned parameters, compared to SAE's resource-intensive requirement for explicit gradient optimization and fixed basis size. This suggests EP is 6 more token-efficient for unsupervised feature generation at comparable interpretability (Rumbelow, 14 May 2026).
5. Computational and Statistical Properties
- Streaming and Online Construction:
EP dictionaries are constructed in a single forward pass, suitable for streaming and online adaptations.
- Emergent Resolution and Stopping Condition:
Dictionary growth halts after a batch produces no new exemplars; size reflects intrinsic density and dispersion of activation space as parametrized via 7.
- Prior-Driven Clustering:
By incorporating flexible priors (DP, Pitman–Yor, etc.) over partitions, one can bias cluster numbers and size profiles appropriate to the task (Tarlow et al., 2012). The parameter 8 in DP controls expected number of regions, and the prior can avoid pathologies of vanilla affinity propagation (such as poor modeling of heterogeneous cluster-size distributions).
- Computational Complexity:
For prior-based EP with affinity propagation and max-product message passing, main computational costs are 9 time (0 iterations), 1 space.
6. Practical Considerations and Example Use Cases
- Activation Geometry and Model Analysis:
EP is now foundational in activation-space analysis of LLMs, supporting direct, cross-comparable, and interpretable region dictionaries for feature tracing, intervention, and model-family studies (Rumbelow, 14 May 2026).
- Cluster-Size Control and Priors:
In applications where knowledge of the cluster-size profile is available or desired, prior-based EP can be controlled by 2 (or other hyperparameters), tuned via likelihood or empirical Bayes.
- Image Segmentation:
When applied to image superpixel graphs, DP-EP yields segmentations that reflect true underlying structure (e.g., avoids oversegmentation seen in unregularized methods).
- Resource Efficiency:
As an unsupervised dictionary discovery method, EP has become especially attractive where compute and data limitations preclude large-scale gradient optimization, while offering high accuracy and direct interpretability with minimal cost.
7. Impact and Future Directions
EP bridges the gap between interpretable dictionary learning and scalable, resource-efficient partitioning of high-dimensional model activations and general data. Its anchoring in observed activations makes cross-layer, cross-model, and cross-checkpoint comparisons tractable. Prior-based formulations provide powerful flexibility in shaping the solution space and statistical properties of the clusters, enabling a wide range of applications beyond interpretability, including unsupervised representation learning, anomaly detection, and clustering with domain-informed structure (Rumbelow, 14 May 2026, Tarlow et al., 2012).
Continued research seeks to further improve the scalability of inference in prior-driven EP, expand the interpretability and intervention toolkit enabled by region-anchored dictionaries, and formalize the conditions under which Voronoi/cone and linear/subspace-based features yield convergent or divergent decompositions of latent space.