Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

CE-Bench: Contrastive Evaluation Benchmark

Updated 7 September 2025
  • CE-Bench is a benchmark that evaluates the interpretability of sparse autoencoders using paired contrastive story datasets.
  • It employs deterministic contrastive and independence scores to measure semantic differences without relying on external LLMs.
  • Empirical tests across 36 SAE architectures show strong alignment with established benchmarks, highlighting its robustness and reproducibility.

CE-Bench is a contrastive evaluation benchmark designed to provide a deterministic, reproducible, and LLM-free framework for measuring the interpretability of sparse autoencoders (SAEs) trained to probe internal representations of LLMs. It leverages a curated dataset of contrastive story pairs and introduces quantitative measures to assess whether neurons in an SAE reliably encode human-interpretable features. CE-Bench addresses prior limitations associated with simulation-based and LLM-based evaluation approaches, aiming to standardize comparisons and stimulate methodological advances in the interpretability of model internals (Gulko et al., 31 Aug 2025).

1. Motivation and Context

Interpretability in the context of neural LLMs involves uncovering features or activations in the model’s internal representations that can be mapped to human-understandable concepts. Sparse autoencoders have emerged as a central tool for this purpose, enabling the extraction of disentangled features from high-dimensional activations. Existing interpretability evaluation methods frequently rely on simulated explanations or automatic scoring via external LLMs. These methods are inherently non-deterministic, vulnerable to model drift, and subject to biases from the evaluating LLM, thereby complicating reproducibility and benchmarking.

CE-Bench is motivated by the need for a lightweight, automated, and fully controllable metric for SAE interpretability. By dispensing with the requirement for an external LLM in the evaluation loop, CE-Bench ensures reproducibility and objective assessment, directly leveraging structured contrastive inputs.

2. Dataset Construction and Design

The dataset underlying CE-Bench comprises 5,000 contrastive story pairs covering 1,000 distinct subjects. Each subject is selected from a filtered set of WikiData entities, focusing on widely recognizable concepts to ensure semantic clarity. For every subject, two stories are generated:

  • One story presents a high-intensity interpretation of the subject.
  • The other story presents its conceptual opposite.

Story generation uses GPT‑4.1 to construct semantically clear contrasts based on carefully engineered prompts, followed by human validation for quality control. This approach ensures that each pair encapsulates a concrete semantic axis while spanning a broad range of topics.

3. Evaluation Methodology

Contrastive Evaluation Pipeline

The core of the benchmark is a deterministic contrastive pipeline, encompassing the following steps:

  1. Each story in a pair is independently encoded through a frozen LLM and a pretrained SAE to yield per-token neuron activations.
  2. For each story, activations are averaged across tokens, resulting in two mean activation vectors, V1V_1 and V2V_2.
  3. The contrastive score is the maximal coordinate of the element-wise absolute difference:

C=V1V2;Contrastive Score=max(C)C = |V_1 - V_2|;\quad \text{Contrastive Score} = \max(C)

This reflects the neuron exhibiting the most pronounced activation change between contrasting contexts, hypothesized to correspond to encoded semantic differences.

  1. The independence score evaluates neuronal specificity:
    • The sum I1=V1+V2I_1 = V_1 + V_2 is computed for each pair.
    • The global average IavgI_{\text{avg}} is determined over all pairs.
    • The maximal coordinate of I1Iavg|I_1 - I_{\text{avg}}| is taken as the independence score, reflecting how target neuron activations depart from dataset-wide baselines.

Aggregation Strategies

CE-Bench proposes two principal approaches to synthesizing these scores into a final interpretability metric:

  • Proxy Learning: A supervised regression model is trained with three inputs — contrastive score, independence score, and SAE sparsity value — to predict interpretability labels from SAE-Bench.
  • Simple Averaging: A deterministic, unsupervised score obtained by averaging contrastive and independence scores, optionally penalized by the model’s sparsity for greater discriminativeness.

4. Experimental Protocols and Results

CE-Bench evaluation spans 36 pretrained SAE architectures and thoroughly probes the impacts of design choices and hyperparameters:

  • Architectural Variants: Standard, top‑k, pp-anneal, and jumprelu families are evaluated. Top‑k and pp-anneal models consistently achieve higher interpretability scores.
  • Latent Space Width: Increasing dimensionality from 4k to 65k reliably improves both contrastive and independence scores due to enhanced disentanglement, with correlating reductions in sparsity.
  • Transformer Layer Depth and Type: Evaluation across attention, MLP, and residual stream sub-layers demonstrates minimal inter-type variation; however, interpretability notably increases at deeper transformer layers (beyond the 10th).
  • Ranking Consistency (CRPR Metric): The Correct Ranking Pair Ratio captures agreement between CE-Bench and SAE-Bench interpretability orderings. Proxy learning achieves 75.98%, simple averaging reaches 70.12%, and an unsupervised sparsity-penalized variant attains 74.82%.

Empirical findings confirm that CE-Bench's unsupervised ranks are robust and in strong concordance with LLM-mediated labeling, despite the lack of external annotators. This suggests that the contrastive and independence metrics effectively capture core interpretability properties that are also detected by more laborious, simulation-based methods.

5. Implementation and Open Resources

CE-Bench is fully open-sourced under the MIT License, with code repositories linked directly in the associated publication. Accompanying resources include:

  • The 5,000-pair contrastive story dataset, publicly released via the Hugging Face platform.
  • Detailed instructions for evaluating arbitrary sparse autoencoder models and reproducing the benchmark pipeline.
  • Train/test splits for supervised regression and prompt templates used for story generation.

This facilitates transparent benchmarking, cross-laboratory replication, and extension to new interpretability paradigms.

6. Limitations and Prospective Directions

CE-Bench’s principal advantages are its reproducibility, LLM-independence, and robust correlation with established interpretability benchmarks. Nevertheless, the aggregation method based on simple averaging—especially when including sparsity corrections—remains an area for refinement. Expanding the diversity of semantic contexts and representation types within the dataset stands as a logical extension.

Future development directions include the integration of additional unsupervised aggregation strategies, broadening semantic coverage, and application to interpretability evaluation beyond neuron-level features (e.g., probing groupings of features or compositional internal structures).

7. Significance within Interpretability Research

CE-Bench provides a principled, systematic, and efficient pathway for the quantitative evaluation of interpretability in sparse autoencoders trained on LLM internals. By removing LLM annotator dependence, it sets a reproducibility standard and facilitates direct comparisons across research groups and model configurations. Its methodology consequently informs future benchmark design for probing, feature discovery, and efficient architecture selection in the pursuit of human-understandable neural representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CE-Bench.