Exemplar Partitioning (EP) Overview

Updated 20 May 2026

Exemplar Partitioning (EP) is an unsupervised method that partitions high-dimensional data into hard, anchor-defined Voronoi regions via leader clustering.
The technique achieves competitive interpretability compared to sparse autoencoders while reducing computational cost and enabling flexible, nonparametric clustering.
EP supports mechanistic analysis of deep neural activations by facilitating cross-checkpoint comparisons and causal interventions for model interpretability.

Exemplar Partitioning (EP) denotes a class of unsupervised methods that construct hard, non-parametric partitions of high-dimensional data by selecting a subset of observed data points, called exemplars, to serve as anchors for “Voronoi” regions. Each data point is assigned to its nearest exemplar according to a task-appropriate geometry or similarity. EP is notably leveraged for interpretable feature discovery in deep neural activations, including mechanistic analysis of LLMs, and for exemplar-based clustering with flexible, nonparametric priors on cluster structure. EP produces directly comparable, anchor-based dictionaries for analysis and intervention, requires no gradient optimization, and, when applied to model activations, achieves comparable interpretability and causal utility to sparse autoencoders at orders-of-magnitude lower computational cost (Rumbelow, 14 May 2026, Tarlow et al., 2012).

1. Mathematical Construction and Algorithmic Specification

EP partitions the input space $\mathbb{R}^d$ through a sequential, distance-thresholded leader clustering. Consider a stream of activation vectors $a \in \mathbb{R}^d$ , where each $a$ denotes the (possibly normalized) activations at a particular model layer and token position. EP proceeds as follows:

Centering & Normalization:

The data mean $\mu \in \mathbb{R}^d$ is computed on a calibration set. Each activation is mapped to the unit sphere: $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ so that the dictionary is formed over normalized directions.

Leader Clustering:

For each normalized activation $u = \phi(a)$ streamed in, compute its minimum Euclidean distance from the current exemplar set $E$ . If $\min_{e \in E} \|u-e\|_2 > \tau$ (where $\tau$ is a distance threshold), create a new exemplar $u$ ; otherwise assign $a \in \mathbb{R}^d$ 0 to the closest existing $a \in \mathbb{R}^d$ 1. In concise pseudocode: $E$ 3

Voronoi Dictionary: Each $a \in \mathbb{R}^d$ 2 defines a region $a \in \mathbb{R}^d$ 3; the collection $a \in \mathbb{R}^d$ 4 partition the unit sphere into encoder-determined cells.
Threshold Calibration: The cluster threshold $a \in \mathbb{R}^d$ 5 is set as the $a \in \mathbb{R}^d$ 6-th percentile of pairwise distances over $a \in \mathbb{R}^d$ 7 calibration activations. Reporting the percentile $a \in \mathbb{R}^d$ 8 (e.g., $a \in \mathbb{R}^d$ 9 for the $a$ 0 percentile) normalizes cluster resolution across models, layers, and datasets.

Emergent dictionary size is determined by activation geometry at fixed $a$ 1, with the process terminating at batch “saturation”—when a full batch yields no new exemplars.

2. Probabilistic and Prior-Driven Extensions

EP is generalized by coupling with nonparametric priors on partitions, notably Dirichlet process (DP) or Pitman–Yor priors, as a framework for flexible exemplar-based clustering (Tarlow et al., 2012):

Let $a$ 2 data points $a$ 3. Each cluster is defined by its exemplar, and each point is assigned to one exemplar.
The generative model is:

$a$ 4

where $a$ 5 is the partition prior (e.g., DP), $a$ 6 enforces one exemplar per non-empty cluster, and $a$ 7 specifies emission from exemplars.

The DP prior allows the number of clusters to be determined adaptively,

$a$ 8

with $a$ 9 the concentration parameter, $\mu \in \mathbb{R}^d$ 0 clusters and cluster sizes $\mu \in \mathbb{R}^d$ 1.

MAP inference is solved via max-product belief propagation on a structured factor graph over the $\mu \in \mathbb{R}^d$ 2 assignment variables, with computational cost $\mu \in \mathbb{R}^d$ 3 for $\mu \in \mathbb{R}^d$ 4 rounds of message passing and $\mu \in \mathbb{R}^d$ 5 space (Tarlow et al., 2012).
Flexible priors $\mu \in \mathbb{R}^d$ 6 encode different cluster-size behaviors, e.g., Pitman–Yor or power-law.

3. Mechanistic Interpretability in Model Activations

EP directly supports mechanistic interpretability in deep models by constructing feature dictionaries aligned to observed activation geometry (Rumbelow, 14 May 2026):

Exemplar Anchoring:

Each region anchor is a true observed activation. Thus, dictionaries constructed from the same stream are directly comparable across layers, training checkpoints, or model variants.

Cross-Checkpoint Comparison:

Matching exemplars across model checkpoints (e.g., base vs. instruction-tuned) via Hungarian algorithm and cosine similarity reveals which activation directions persist across fine-tuning. For example, in Gemma-2-2B, only a small fraction of high-cosine matches ( $\mu \in \mathbb{R}^d$ 7) survived across checkpoints, implying that most activation geometry is re-anchored by fine-tuning.

Intervention Experiments:

Projecting activations off an exemplar associated with a specific behavior (such as refusal in instruction-tuned LLMs) can causally suppress that behavior (e.g., baseline refusal 0.98 drops to 0.02 upon ablation of the corresponding region anchor, a difference $\mu \in \mathbb{R}^d$ 8).

Quantitative Feature Alignment:

EP regions exhibit partial overlap with sparse autoencoder (SAE) features: $\mu \in \mathbb{R}^d$ 9 of EP regions match an SAE feature at $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 0, with mean $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 1 at $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 2. Conversely, only $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 3 of SAE features match an EP region at the same threshold, with higher coverage at finer percentile granularity.

One-Hot Probe Accuracy & AUROC:

Encoding activations into EP one-hot sparse codes preserves $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 4 of linear probe accuracy compared to using raw activations. For latent concept detection (AxBench), EP at $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 5 achieves mean AUROC $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 6, exceeding standard SAE ( $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 7) and closely approaching label-supervised SAE-A ( $\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}$ 8).

Out-of-Distribution Signal:

The nearest-exemplar distance serves as a free measure of distributional shift; activations on random or under-represented inputs display significantly greater mean distance to nearest anchor than in-distribution samples.

4. Comparative Analysis: EP vs Sparse Autoencoders and Other Methods

EP and sparse autoencoders (SAEs) impose fundamentally distinct geometric constraints on learned representations:

Feature	Exemplar Partitioning (EP)	Sparse Autoencoders (SAE)
Partition geometry	Hard Voronoi cells (unit sphere, L2)	Linear coding, soft selection
Anchor type	Observed activations	Learned weights
Dictionary size	Emergent, dictated by threshold and data geometry	Prespecified, e.g., $\phi(a) = \frac{a - \mu}{\\|a - \mu\\|_2}$ 9k
Compute requirements	Single-stream, zero backward passes (~ $u = \phi(a)$ 0– $u = \phi(a)$ 1 activations)	Millions–billions of tokens, backpropagation, many gradient steps
Inter-dictionary match	Direct comparison across layers, models, or checkpoints	No inherent cross-model alignment
Shared coverage	$u = \phi(a)$ 2 of EP regions match an SAE feature at $u = \phi(a)$ 3	SAE features more fragmented, less likely to coincide with EP at coarse resolution
Interpretability	Region cause directly intervenable (projection off exemplar collapses associated features)	Demands indirect or aggregate intervention

EP efficiently yields dictionaries with $u = \phi(a)$ 4 to $u = \phi(a)$ 5 with no learned parameters, compared to SAE's resource-intensive requirement for explicit gradient optimization and fixed basis size. This suggests EP is $u = \phi(a)$ 6 more token-efficient for unsupervised feature generation at comparable interpretability (Rumbelow, 14 May 2026).

5. Computational and Statistical Properties

Streaming and Online Construction:

EP dictionaries are constructed in a single forward pass, suitable for streaming and online adaptations.

Emergent Resolution and Stopping Condition:

Dictionary growth halts after a batch produces no new exemplars; size reflects intrinsic density and dispersion of activation space as parametrized via $u = \phi(a)$ 7.

Prior-Driven Clustering:

By incorporating flexible priors (DP, Pitman–Yor, etc.) over partitions, one can bias cluster numbers and size profiles appropriate to the task (Tarlow et al., 2012). The parameter $u = \phi(a)$ 8 in DP controls expected number of regions, and the prior can avoid pathologies of vanilla affinity propagation (such as poor modeling of heterogeneous cluster-size distributions).

Computational Complexity:

For prior-based EP with affinity propagation and max-product message passing, main computational costs are $u = \phi(a)$ 9 time ( $E$ 0 iterations), $E$ 1 space.

6. Practical Considerations and Example Use Cases

Activation Geometry and Model Analysis:

EP is now foundational in activation-space analysis of LLMs, supporting direct, cross-comparable, and interpretable region dictionaries for feature tracing, intervention, and model-family studies (Rumbelow, 14 May 2026).

Cluster-Size Control and Priors:

In applications where knowledge of the cluster-size profile is available or desired, prior-based EP can be controlled by $E$ 2 (or other hyperparameters), tuned via likelihood or empirical Bayes.

Image Segmentation:

When applied to image superpixel graphs, DP-EP yields segmentations that reflect true underlying structure (e.g., avoids oversegmentation seen in unregularized methods).

Resource Efficiency:

As an unsupervised dictionary discovery method, EP has become especially attractive where compute and data limitations preclude large-scale gradient optimization, while offering high accuracy and direct interpretability with minimal cost.

7. Impact and Future Directions

EP bridges the gap between interpretable dictionary learning and scalable, resource-efficient partitioning of high-dimensional model activations and general data. Its anchoring in observed activations makes cross-layer, cross-model, and cross-checkpoint comparisons tractable. Prior-based formulations provide powerful flexibility in shaping the solution space and statistical properties of the clusters, enabling a wide range of applications beyond interpretability, including unsupervised representation learning, anomaly detection, and clustering with domain-informed structure (Rumbelow, 14 May 2026, Tarlow et al., 2012).

Continued research seeks to further improve the scalability of inference in prior-driven EP, expand the interpretability and intervention toolkit enabled by region-anchored dictionaries, and formalize the conditions under which Voronoi/cone and linear/subspace-based features yield convergent or divergent decompositions of latent space.

Markdown Report Issue Upgrade to Chat

References (2)

Exemplar Partitioning for Mechanistic Interpretability (2026)

Flexible Priors for Exemplar-based Clustering (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exemplar Partitioning (EP).