Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exemplar Partitioning (EP) Overview

Updated 20 May 2026
  • Exemplar Partitioning (EP) is an unsupervised method that partitions high-dimensional data into hard, anchor-defined Voronoi regions via leader clustering.
  • The technique achieves competitive interpretability compared to sparse autoencoders while reducing computational cost and enabling flexible, nonparametric clustering.
  • EP supports mechanistic analysis of deep neural activations by facilitating cross-checkpoint comparisons and causal interventions for model interpretability.

Exemplar Partitioning (EP) denotes a class of unsupervised methods that construct hard, non-parametric partitions of high-dimensional data by selecting a subset of observed data points, called exemplars, to serve as anchors for “Voronoi” regions. Each data point is assigned to its nearest exemplar according to a task-appropriate geometry or similarity. EP is notably leveraged for interpretable feature discovery in deep neural activations, including mechanistic analysis of LLMs, and for exemplar-based clustering with flexible, nonparametric priors on cluster structure. EP produces directly comparable, anchor-based dictionaries for analysis and intervention, requires no gradient optimization, and, when applied to model activations, achieves comparable interpretability and causal utility to sparse autoencoders at orders-of-magnitude lower computational cost (Rumbelow, 14 May 2026, Tarlow et al., 2012).

1. Mathematical Construction and Algorithmic Specification

EP partitions the input space Rd\mathbb{R}^d through a sequential, distance-thresholded leader clustering. Consider a stream of activation vectors aRda \in \mathbb{R}^d, where each aa denotes the (possibly normalized) activations at a particular model layer and token position. EP proceeds as follows:

  1. Centering & Normalization:

The data mean μRd\mu \in \mathbb{R}^d is computed on a calibration set. Each activation is mapped to the unit sphere: ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2} so that the dictionary is formed over normalized directions.

  1. Leader Clustering:

For each normalized activation u=ϕ(a)u = \phi(a) streamed in, compute its minimum Euclidean distance from the current exemplar set EE. If mineEue2>τ\min_{e \in E} \|u-e\|_2 > \tau (where τ\tau is a distance threshold), create a new exemplar uu; otherwise assign aRda \in \mathbb{R}^d0 to the closest existing aRda \in \mathbb{R}^d1. In concise pseudocode: EE3

  1. Voronoi Dictionary: Each aRda \in \mathbb{R}^d2 defines a region aRda \in \mathbb{R}^d3; the collection aRda \in \mathbb{R}^d4 partition the unit sphere into encoder-determined cells.
  2. Threshold Calibration: The cluster threshold aRda \in \mathbb{R}^d5 is set as the aRda \in \mathbb{R}^d6-th percentile of pairwise distances over aRda \in \mathbb{R}^d7 calibration activations. Reporting the percentile aRda \in \mathbb{R}^d8 (e.g., aRda \in \mathbb{R}^d9 for the aa0 percentile) normalizes cluster resolution across models, layers, and datasets.

Emergent dictionary size is determined by activation geometry at fixed aa1, with the process terminating at batch “saturation”—when a full batch yields no new exemplars.

2. Probabilistic and Prior-Driven Extensions

EP is generalized by coupling with nonparametric priors on partitions, notably Dirichlet process (DP) or Pitman–Yor priors, as a framework for flexible exemplar-based clustering (Tarlow et al., 2012):

  • Let aa2 data points aa3. Each cluster is defined by its exemplar, and each point is assigned to one exemplar.
  • The generative model is:

aa4

where aa5 is the partition prior (e.g., DP), aa6 enforces one exemplar per non-empty cluster, and aa7 specifies emission from exemplars.

  • The DP prior allows the number of clusters to be determined adaptively,

aa8

with aa9 the concentration parameter, μRd\mu \in \mathbb{R}^d0 clusters and cluster sizes μRd\mu \in \mathbb{R}^d1.

  • MAP inference is solved via max-product belief propagation on a structured factor graph over the μRd\mu \in \mathbb{R}^d2 assignment variables, with computational cost μRd\mu \in \mathbb{R}^d3 for μRd\mu \in \mathbb{R}^d4 rounds of message passing and μRd\mu \in \mathbb{R}^d5 space (Tarlow et al., 2012).
  • Flexible priors μRd\mu \in \mathbb{R}^d6 encode different cluster-size behaviors, e.g., Pitman–Yor or power-law.

3. Mechanistic Interpretability in Model Activations

EP directly supports mechanistic interpretability in deep models by constructing feature dictionaries aligned to observed activation geometry (Rumbelow, 14 May 2026):

  • Exemplar Anchoring:

Each region anchor is a true observed activation. Thus, dictionaries constructed from the same stream are directly comparable across layers, training checkpoints, or model variants.

  • Cross-Checkpoint Comparison:

Matching exemplars across model checkpoints (e.g., base vs. instruction-tuned) via Hungarian algorithm and cosine similarity reveals which activation directions persist across fine-tuning. For example, in Gemma-2-2B, only a small fraction of high-cosine matches (μRd\mu \in \mathbb{R}^d7) survived across checkpoints, implying that most activation geometry is re-anchored by fine-tuning.

  • Intervention Experiments:

Projecting activations off an exemplar associated with a specific behavior (such as refusal in instruction-tuned LLMs) can causally suppress that behavior (e.g., baseline refusal 0.98 drops to 0.02 upon ablation of the corresponding region anchor, a difference μRd\mu \in \mathbb{R}^d8).

  • Quantitative Feature Alignment:

EP regions exhibit partial overlap with sparse autoencoder (SAE) features: μRd\mu \in \mathbb{R}^d9 of EP regions match an SAE feature at ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}0, with mean ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}1 at ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}2. Conversely, only ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}3 of SAE features match an EP region at the same threshold, with higher coverage at finer percentile granularity.

  • One-Hot Probe Accuracy & AUROC:

Encoding activations into EP one-hot sparse codes preserves ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}4 of linear probe accuracy compared to using raw activations. For latent concept detection (AxBench), EP at ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}5 achieves mean AUROC ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}6, exceeding standard SAE (ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}7) and closely approaching label-supervised SAE-A (ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}8).

  • Out-of-Distribution Signal:

The nearest-exemplar distance serves as a free measure of distributional shift; activations on random or under-represented inputs display significantly greater mean distance to nearest anchor than in-distribution samples.

4. Comparative Analysis: EP vs Sparse Autoencoders and Other Methods

EP and sparse autoencoders (SAEs) impose fundamentally distinct geometric constraints on learned representations:

Feature Exemplar Partitioning (EP) Sparse Autoencoders (SAE)
Partition geometry Hard Voronoi cells (unit sphere, L2) Linear coding, soft selection
Anchor type Observed activations Learned weights
Dictionary size Emergent, dictated by threshold and data geometry Prespecified, e.g., ϕ(a)=aμaμ2\phi(a) = \frac{a - \mu}{\|a - \mu\|_2}9k
Compute requirements Single-stream, zero backward passes (~u=ϕ(a)u = \phi(a)0–u=ϕ(a)u = \phi(a)1 activations) Millions–billions of tokens, backpropagation, many gradient steps
Inter-dictionary match Direct comparison across layers, models, or checkpoints No inherent cross-model alignment
Shared coverage u=ϕ(a)u = \phi(a)2 of EP regions match an SAE feature at u=ϕ(a)u = \phi(a)3 SAE features more fragmented, less likely to coincide with EP at coarse resolution
Interpretability Region cause directly intervenable (projection off exemplar collapses associated features) Demands indirect or aggregate intervention

EP efficiently yields dictionaries with u=ϕ(a)u = \phi(a)4 to u=ϕ(a)u = \phi(a)5 with no learned parameters, compared to SAE's resource-intensive requirement for explicit gradient optimization and fixed basis size. This suggests EP is u=ϕ(a)u = \phi(a)6 more token-efficient for unsupervised feature generation at comparable interpretability (Rumbelow, 14 May 2026).

5. Computational and Statistical Properties

  • Streaming and Online Construction:

EP dictionaries are constructed in a single forward pass, suitable for streaming and online adaptations.

  • Emergent Resolution and Stopping Condition:

Dictionary growth halts after a batch produces no new exemplars; size reflects intrinsic density and dispersion of activation space as parametrized via u=ϕ(a)u = \phi(a)7.

  • Prior-Driven Clustering:

By incorporating flexible priors (DP, Pitman–Yor, etc.) over partitions, one can bias cluster numbers and size profiles appropriate to the task (Tarlow et al., 2012). The parameter u=ϕ(a)u = \phi(a)8 in DP controls expected number of regions, and the prior can avoid pathologies of vanilla affinity propagation (such as poor modeling of heterogeneous cluster-size distributions).

  • Computational Complexity:

For prior-based EP with affinity propagation and max-product message passing, main computational costs are u=ϕ(a)u = \phi(a)9 time (EE0 iterations), EE1 space.

6. Practical Considerations and Example Use Cases

  • Activation Geometry and Model Analysis:

EP is now foundational in activation-space analysis of LLMs, supporting direct, cross-comparable, and interpretable region dictionaries for feature tracing, intervention, and model-family studies (Rumbelow, 14 May 2026).

  • Cluster-Size Control and Priors:

In applications where knowledge of the cluster-size profile is available or desired, prior-based EP can be controlled by EE2 (or other hyperparameters), tuned via likelihood or empirical Bayes.

  • Image Segmentation:

When applied to image superpixel graphs, DP-EP yields segmentations that reflect true underlying structure (e.g., avoids oversegmentation seen in unregularized methods).

  • Resource Efficiency:

As an unsupervised dictionary discovery method, EP has become especially attractive where compute and data limitations preclude large-scale gradient optimization, while offering high accuracy and direct interpretability with minimal cost.

7. Impact and Future Directions

EP bridges the gap between interpretable dictionary learning and scalable, resource-efficient partitioning of high-dimensional model activations and general data. Its anchoring in observed activations makes cross-layer, cross-model, and cross-checkpoint comparisons tractable. Prior-based formulations provide powerful flexibility in shaping the solution space and statistical properties of the clusters, enabling a wide range of applications beyond interpretability, including unsupervised representation learning, anomaly detection, and clustering with domain-informed structure (Rumbelow, 14 May 2026, Tarlow et al., 2012).

Continued research seeks to further improve the scalability of inference in prior-driven EP, expand the interpretability and intervention toolkit enabled by region-anchored dictionaries, and formalize the conditions under which Voronoi/cone and linear/subspace-based features yield convergent or divergent decompositions of latent space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exemplar Partitioning (EP).