CE-Bench: Contrastive Evaluation Benchmark

Updated 7 September 2025

CE-Bench is a benchmark that evaluates the interpretability of sparse autoencoders using paired contrastive story datasets.
It employs deterministic contrastive and independence scores to measure semantic differences without relying on external LLMs.
Empirical tests across 36 SAE architectures show strong alignment with established benchmarks, highlighting its robustness and reproducibility.

CE-Bench is a contrastive evaluation benchmark designed to provide a deterministic, reproducible, and LLM-free framework for measuring the interpretability of sparse autoencoders (SAEs) trained to probe internal representations of LLMs. It leverages a curated dataset of contrastive story pairs and introduces quantitative measures to assess whether neurons in an SAE reliably encode human-interpretable features. CE-Bench addresses prior limitations associated with simulation-based and LLM-based evaluation approaches, aiming to standardize comparisons and stimulate methodological advances in the interpretability of model internals (Gulko et al., 31 Aug 2025).

1. Motivation and Context

Interpretability in the context of neural LLMs involves uncovering features or activations in the model’s internal representations that can be mapped to human-understandable concepts. Sparse autoencoders have emerged as a central tool for this purpose, enabling the extraction of disentangled features from high-dimensional activations. Existing interpretability evaluation methods frequently rely on simulated explanations or automatic scoring via external LLMs. These methods are inherently non-deterministic, vulnerable to model drift, and subject to biases from the evaluating LLM, thereby complicating reproducibility and benchmarking.

CE-Bench is motivated by the need for a lightweight, automated, and fully controllable metric for SAE interpretability. By dispensing with the requirement for an external LLM in the evaluation loop, CE-Bench ensures reproducibility and objective assessment, directly leveraging structured contrastive inputs.

2. Dataset Construction and Design

The dataset underlying CE-Bench comprises 5,000 contrastive story pairs covering 1,000 distinct subjects. Each subject is selected from a filtered set of WikiData entities, focusing on widely recognizable concepts to ensure semantic clarity. For every subject, two stories are generated:

One story presents a high-intensity interpretation of the subject.
The other story presents its conceptual opposite.

Story generation uses GPT‑4.1 to construct semantically clear contrasts based on carefully engineered prompts, followed by human validation for quality control. This approach ensures that each pair encapsulates a concrete semantic axis while spanning a broad range of topics.

3. Evaluation Methodology

Contrastive Evaluation Pipeline

The core of the benchmark is a deterministic contrastive pipeline, encompassing the following steps:

Each story in a pair is independently encoded through a frozen LLM and a pretrained SAE to yield per-token neuron activations.
For each story, activations are averaged across tokens, resulting in two mean activation vectors, $V_1$ and $V_2$ .
The contrastive score is the maximal coordinate of the element-wise absolute difference:

$C = |V_1 - V_2|;\quad \text{Contrastive Score} = \max(C)$

This reflects the neuron exhibiting the most pronounced activation change between contrasting contexts, hypothesized to correspond to encoded semantic differences.

The independence score evaluates neuronal specificity:
- The sum $I_1 = V_1 + V_2$ is computed for each pair.
- The global average $I_{\text{avg}}$ is determined over all pairs.
- The maximal coordinate of $|I_1 - I_{\text{avg}}|$ is taken as the independence score, reflecting how target neuron activations depart from dataset-wide baselines.

Aggregation Strategies

CE-Bench proposes two principal approaches to synthesizing these scores into a final interpretability metric:

Proxy Learning: A supervised regression model is trained with three inputs — contrastive score, independence score, and SAE sparsity value — to predict interpretability labels from SAE-Bench.
Simple Averaging: A deterministic, unsupervised score obtained by averaging contrastive and independence scores, optionally penalized by the model’s sparsity for greater discriminativeness.

4. Experimental Protocols and Results

CE-Bench evaluation spans 36 pretrained SAE architectures and thoroughly probes the impacts of design choices and hyperparameters:

Architectural Variants: Standard, top‑k, $p$ -anneal, and jumprelu families are evaluated. Top‑k and $p$ -anneal models consistently achieve higher interpretability scores.
Latent Space Width: Increasing dimensionality from 4k to 65k reliably improves both contrastive and independence scores due to enhanced disentanglement, with correlating reductions in sparsity.
Transformer Layer Depth and Type: Evaluation across attention, MLP, and residual stream sub-layers demonstrates minimal inter-type variation; however, interpretability notably increases at deeper transformer layers (beyond the 10th).
Ranking Consistency (CRPR Metric): The Correct Ranking Pair Ratio captures agreement between CE-Bench and SAE-Bench interpretability orderings. Proxy learning achieves 75.98%, simple averaging reaches 70.12%, and an unsupervised sparsity-penalized variant attains 74.82%.

Empirical findings confirm that CE-Bench's unsupervised ranks are robust and in strong concordance with LLM-mediated labeling, despite the lack of external annotators. This suggests that the contrastive and independence metrics effectively capture core interpretability properties that are also detected by more laborious, simulation-based methods.

5. Implementation and Open Resources

CE-Bench is fully open-sourced under the MIT License, with code repositories linked directly in the associated publication. Accompanying resources include:

The 5,000-pair contrastive story dataset, publicly released via the Hugging Face platform.
Detailed instructions for evaluating arbitrary sparse autoencoder models and reproducing the benchmark pipeline.
Train/test splits for supervised regression and prompt templates used for story generation.

This facilitates transparent benchmarking, cross-laboratory replication, and extension to new interpretability paradigms.

6. Limitations and Prospective Directions

CE-Bench’s principal advantages are its reproducibility, LLM-independence, and robust correlation with established interpretability benchmarks. Nevertheless, the aggregation method based on simple averaging—especially when including sparsity corrections—remains an area for refinement. Expanding the diversity of semantic contexts and representation types within the dataset stands as a logical extension.

Future development directions include the integration of additional unsupervised aggregation strategies, broadening semantic coverage, and application to interpretability evaluation beyond neuron-level features (e.g., probing groupings of features or compositional internal structures).

7. Significance within Interpretability Research

CE-Bench provides a principled, systematic, and efficient pathway for the quantitative evaluation of interpretability in sparse autoencoders trained on LLM internals. By removing LLM annotator dependence, it sets a reproducibility standard and facilitates direct comparisons across research groups and model configurations. Its methodology consequently informs future benchmark design for probing, feature discovery, and efficient architecture selection in the pursuit of human-understandable neural representations.

PDF Markdown Chat (Pro)

References (1)

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders (2025)

Follow Topic

Get notified by email when new papers are published related to CE-Bench.