Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 193 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

EigenBench: AI Value Alignment Benchmark

Updated 3 September 2025

EigenBench is a benchmarking framework designed for quantitative AI value alignment measurement using black-box peer judgments.
It employs a Bradley–Terry–Davidson model and EigenTrust algorithm to aggregate comparative judgments into continuous alignment scores.
Empirical findings reveal that prompt conditions explain 79% of variance while inherent model traits account for 21%, guiding refinement of alignment strategies.

EigenBench is a black-box comparative benchmarking framework for quantifying value alignment in LLMs. It establishes a systematic method for producing continuous scores that represent each model’s alignment to a specified constitution of values. EigenBench is notable for operating without the need for ground-truth labels, instead leveraging peer judgments among models to aggregate subjective behavioral traits. The method is foundational for research in AI value alignment, enabling performance measurement and leaderboard construction for models with respect to ethical or behavioral criteria where consensus is inherently elusive (Chang et al., 2 Sep 2025).

1. Objectives and Conceptual Foundation

EigenBench was developed to address the lack of quantitative metrics for AI value alignment, especially in domains where human values and behavioral traits (e.g., kindness, loyalty, conservatism) are inherently subjective. Traditional approaches reliant on human labeling face challenges of cost, scalability, and ambiguity in ground truths. EigenBench deviates from these approaches by structuring a process in which models both produce and evaluate responses, employing comparative assessments anchored to a user-specified constitution. Its primary aim is to output a vector of scores for an ensemble of models, reflecting an aggregate alignment to the given value system. This comparative structure is designed for multipolar scenarios, where multiple agents interact and average-case behavior may be as crucial as detecting worst-case failures.

2. Comparative Judgment via Black-Box Protocol

The evaluation protocol in EigenBench is constructed around three core elements:

Model Ensemble ( $\mathcal{M} = \{M_1, ..., M_n\}$ ): Each model serves both as an evaluee (whose output is assessed) and as a judge (who evaluates others).
Constitution ( $\mathcal{C}$ ): A list of value statements or criteria (e.g., "Universal Kindness") describing the system to which alignment is measured.
Prompted Scenario Dataset ( $\mathcal{E}$ ): Real-world scenario prompts designed to elicit relevant model behavior.

For each scenario, two models generate responses, and a third model—acting in the role of judge—compares these against the constitution via a double-blind process. Evaluees are unaware of how they are being judged; judges are unaware of the origin of the responses. This pairwise arrangement allows for the collection of rich comparative data without reliance on external or fixed labels.

3. Latent Behavioral Model: Bradley–Terry–Davidson Framework

The inter-model evaluation data is modeled through a low-rank Bradley–Terry–Davidson (BTD) framework. In contrast to scalar "strength" assignments, each candidate model’s latent behavioral alignment is encoded as a vector $v_j \in \mathbb{R}^d$ , while each judge ( $u_i \in \mathbb{R}^d$ ) applies a distinct weight across these latent dimensions. The tie propensity $\lambda_i$ reflects a judge’s likelihood of indifference. For judge $i$ and candidate models $j$ and $k$ , the probabilities for tie, preference, and aversion are expressed as:

$P_i(j \approx k) = \frac{1}{Z} \cdot \lambda_i \cdot \exp\left(\frac{1}{2} u_i^\top (v_j + v_k)\right)$
$P_i(j \succ k) = \frac{1}{Z} \cdot \exp(u_i^\top v_j)$
$P_i(k \succ j) = \frac{1}{Z} \cdot \exp(u_i^\top v_k)$

where $Z$ normalizes probabilities. Parameter estimation proceeds via maximum log-likelihood over observed pairwise judgments, enabling identification of both model dispositions and judge lenses.

4. Aggregation via EigenTrust and Score Computation

Following parameter estimation, a trust matrix $T$ is constructed, where entry $T_{ij}$ encapsulates the extent to which judge $i$ “trusts” candidate $j$ ’s adherence to the constitution. The matrix is transformed to a right-stochastic form suitable for further aggregation. EigenBench applies the EigenTrust algorithm: the principal left eigenvector $t$ of $T$ is computed, satisfying $t = t T$ . This vector contains normalized scores for each model, which are further mapped to Elo ratings via $Elo_j = 1500 + 400 \log_{10}(N t_j)$ , establishing a leaderboard that reflects weighted community judgment across the model ensemble.

5. Evaluation Protocol, Variance Analysis, and Findings

In practice, evaluation involves each model generating scenario-specific responses and acting as a judge over other responses, with constitutional reflections provided to guide judgment. Aggregation proceeds via principal eigenvector extraction from the trust matrix, circumventing the need for external ground truth labels.

Empirical analyses revealed that the primary source of variance in EigenBench scores is the scenario prompt persona rather than inherent model disposition, with approximately 79% of variance explained by persona and 21% by model-specific differences. This indicates a strong prompt sensitivity, but a consistent residual component attributable to each model’s underlying character.

Variance Attribution Table

Source	Percentage of Variance	Context
Persona	79%	Prompt-dependent trait
Model	21%	Inherent disposition

This minimal but persistent model-driven variance demonstrates EigenBench’s sensitivity to latent behavioral tendencies even when prompt conditions vary.

6. Implications for AI Alignment and Future Research

EigenBench provides a systematic benchmarking tool for value alignment in AI. It enables developers and researchers to construct leaderboards over arbitrary value systems, facilitating direct comparison among models for purposes of selection, fine-tuning, or operational deployment. Because the method synthesizes model feedback rather than relying on labor-intensive external annotation, it is conducive to scalable "character training"—the process of nudging models toward stronger adherence to specific values.

The framework suggests that further research is warranted to enhance aggregation mechanisms, mitigate adversarial effects such as the greenbeard effect, and isolate genuine model dispositions from prompt sensitivity. A plausible implication is that expanding the model population or refining prompt construction could further disentangle model character from scenario influence, improving both interpretability and diagnostic value for alignment research.

EigenBench establishes a robust methodology for quantifying subjective traits in LLMs, specifically in the context of AI value alignment where ground truths are intrinsically contentious or undefined. As alignment research shifts toward multipolar, judgment-based strategies, this approach is positioned to underpin both empirical paper and practical deployment in a broad range of ethical and operational contexts.

PDF Markdown Chat (Pro)

References (1)

EigenBench: A Comparative Behavioral Measure of Value Alignment (2025)

Follow Topic

Get notified by email when new papers are published related to EigenBench.