ValueEval'24 / ValuesML Overview

Updated 7 February 2026

ValueEval'24 / ValuesML is a comprehensive set of frameworks, datasets, and paradigms aimed at evaluating large language models' alignment with human values and stylistic preferences.
It employs diverse value taxonomies, rigorous metrics like ValueDCG and EVALUESTEER, and a calibrated approach to assess multi-label value detection and reward model steerability.
Empirical findings underscore the trade-offs between detection accuracy and interpretability, highlighting the need for robust calibration, ensembling, and bias auditing.

ValueEval’24 and ValuesML collectively denote a set of evaluation frameworks, datasets, and modeling paradigms aimed at probing, benchmarking, and advancing the understanding and steerability of LLMs and related systems with respect to human values, preferences, and stylistic attributes. These efforts span the automated detection of values in text, the assessment of model value comprehension (“know what”/“know why”), the steerability of reward models towards user value/style profiles, and the development of calibration, transferability, and robustness for value-related tasks. The following sections chronologically and thematically organize the major methodologies, datasets, evaluation metrics, findings, and future recommendations that define the ValueEval’24 / ValuesML landscape.

1. Datasets and Value Taxonomies

The foundation of ValueEval’24 and ValuesML rests on large, annotated corpora encoding a variety of value taxonomies, value expressions, and linguistic contexts. Central resources include:

ValueEval’23 / Touché’23-ValueEval: Contains 9,324 premise–conclusion–stance arguments drawn from six culturally and stylistically heterogeneous sources: IBM-ArgQ, CoFE, GDI, Zhihu, Nahj al-Balagha, and NYT. Each argument is annotated on 54 Level-1 values (binary), grouped into 20 Level-2 categories following the Schwartz (1994) framework and its contemporary expansions. Annotation relied on three crowdworkers per argument and MACE for label fusion (Mirzakhmedova et al., 2023).
ValueEval’24 / ValuesML: Scales to 74,231 English sentences, primarily for sentence-level value detection with Schwartz values. Label spaces include 19 basic values (e.g., Self-direction: thought, Security: societal) and 8 higher-order (HO) categories derived via logical ORs over the corresponding basic values. Additionally, a binary “Presence” label marks the existence of any value (Yeste et al., 31 Jan 2026).
ValEval Dataset (CLAVE framework): Comprises 13,000+ (text, value, label) tuples, supporting three value systems: Social Risk, Schwartz Basic Values (10 classic types), and Moral Foundations (5 dimensions). Extensive three-way annotation (majority vote; IAA 85–88%) and coverage across domains, perturbation, and generalization splits (Yao et al., 2024).

These corpora enable robust evaluation across registers, domains, topics, and linguistic/cultural backgrounds by capturing both fine-grained and abstracted value signals.

2. Task Formulations and Formal Metrics

Multiple related but distinct task formulations underpin ValueEval’24 / ValuesML efforts:

Value detection: Multi-label classification over value taxonomies, either at argument/stance (Touché’23) or sentence level (ValueEval’24), with precision, recall, F₁, accuracy, and macro-F₁ as central metrics (Mirzakhmedova et al., 2023, Yeste et al., 31 Jan 2026).
Comprehensive value understanding: Quantified by ValueDCG, measuring both the model’s ability to pick the correct value label ("discriminator"—“know what”) and to generate a rationale aligned with expert annotation ("critique"—“know why”). The absolute mean gap (ValueDCG) between the two scores is a metric of holistic value understanding (Zhang et al., 2023).
Reward model steerability: Assessed via EVALUESTEER, quantifying whether RMs or LLM-judges rank candidate responses in alignment with a user’s composite value and stylistic profile. Pairwise accuracy over $N$ controlled preference pairs is the main metric, with context- and oracle-conditioned protocols (Ghate et al., 7 Oct 2025).
Reference-free value labeling: As in CLAVE, assign "adhere", "oppose", or "unrelated" for a system response given a value definition and context. Measured by accuracy, macro-F1, and expected calibration error (ECE) (Yao et al., 2024).

Mathematical formalizations dictate clear, reproducible evaluation and benchmarking for each sub-task.

3. Model Architectures, Baselines, and Calibration

A diversity of model architectures and strategies are benchmarked in ValueEval’24 / ValuesML:

Supervised encoders: E.g., DeBERTa-Base fine-tuned with binary cross-entropy for value multi-labeling; BERT-based multilabel classifiers with example-level concatenation (Mirzakhmedova et al., 2023, Yeste et al., 31 Jan 2026).
Instruction-tuned LLMs: Llama-3.1 8B, Gemma 2 9B, and others evaluated zero-shot and few-shot, with outputs parsed into multi-hot vectors (Yeste et al., 31 Jan 2026).
Hierarchical and hard-gated architectures: Multi-phase pipelines enforcing HO category structure via masking or gating at inference (HO→values, Presence→HO→values). Empirical findings indicate that hard constraints often reduce Macro-F₁ via error compounding and recall suppression (Yeste et al., 31 Jan 2026).
Ensembling: Soft-voting and cross-family ensemble approaches (transformer + instruction-tuned LLM) yield modest but reliable gains in Macro-F₁, up to +0.02 absolute depending on slice (Yeste et al., 31 Jan 2026).
Calibration: Label-wise threshold tuning (per-label $\tau_k$ search) outperforms global thresholds and unlocks up to +0.05 Macro-F₁, especially for imbalanced or ambiguous labels (Yeste et al., 31 Jan 2026).
CLAVE dual-model framework: Large LLMs extract “value concepts” from sparse human annotation, enabling calibration and transfer of a smaller, fine-tuned recognizer that maintains accuracy and robustness on minimal data (Yao et al., 2024).

Notably, small LLMs ( $\le$ 10B) trail supervised transformers in single-label settings but contribute complementary error patterns when ensembled.

4. Advanced Evaluation Frameworks and Steerability

Two recent innovations have extended the evaluation regime well beyond raw multi-label accuracy:

ValueDCG (Zhang et al., 2023): For each model and question, the Discriminator (D) score is the semantic similarity between the model’s answer and the correct baseline value answer, while the Critique (C) score is the similarity between the model’s explanation and the expert rationale. The ValueDCG metric is their average absolute difference:

$\mathcal{Q}_{\mathrm{dcg}(m)} = \mathbb{E}_{x\sim S} \Bigl| \mathcal{Q}_{\mathrm{dis}(m,x,v_c)} - \mathcal{Q}_{\mathrm{cri}(m,x,v_c)} \Bigr|.$

“Know what” scales with model size, but “know why” shows only minimal scaling. Persistent gaps suggest plausible rationalization without deep value alignment.

EVALUESTEER (Ghate et al., 7 Oct 2025): Systematically generates 165,888 preference pairs by crossing value dimensions (Inglehart–Welzel map) and stylistic dimensions (verbosity, difficulty, confidence, warmth). Steerability is quantified as the pairwise accuracy of RMs given user profiles U:

$\textrm{Accuracy} = \frac{1}{N} \sum_{n=1}^N 1[ \hat i_n = i_n^* ]$

where $\hat i_n$ is the model-chosen candidate and $i_n^*$ is ground-truth aligned with U. Even the best systems (GPT-4.1-Mini, CoT, "prefer values") attain $<75\%$ steerability, sharply below the oracle ( $>99\%$ ) with only relevant info, revealing major limitations in context utilization, value–style disentanglement, and implicit bias.

5. Empirical Findings, Biases, and Error Analysis

Core empirical discoveries and systematic trends include:

Scaling effects: Larger models improve “know what” (value discrimination) more than “know why” (rationale quality). ValueDCG shrinks as model size grows but remains non-zero. Context priming with value induction lowers ValueDCG for all models, emphasizing strong sensitivity to explicit cues (Zhang et al., 2023).
Dataset size and diversity: Enlarging and diversifying (Touché’23 vs. Webis-ArgValues-22) exposes the limits of trivial baselines (1-Baseline F₁ drops from 0.16→0.13) but enables more effective supervised learning (BERT macro-F₁ rises from 0.25→0.26 at Level 1, 0.34→0.44 at Level 2) (Mirzakhmedova et al., 2023).
Effectiveness of hierarchy: While HO categories in the Schwartz framework are learnable at the sentence level, enforcing them via hard gating or presence cascades suppresses end-task recall and offers no reliable Macro-F₁ advantage. Calibration and ensembling remain more effective under compute constraints (Yeste et al., 31 Jan 2026).
Stylistic and value biases: LLM-based RMs show systematic biases towards secular-rational, self-expression, verbose, high-confidence, and formal/cold style options (verbosity bias, style-over-substance). When values and style conflict, systems default to style preference by up to 33 percentage points (Ghate et al., 7 Oct 2025).
Adaptability and overfitting: CLAVE’s concept abstraction layer allows accurate, robust alignment to new value definitions with $<100$ samples per value, outperforming pure fine-tuned baselines especially OOD or when training data is scarce (Yao et al., 2024).

6. Practical Recommendations and Future Directions

Synthesis across recent studies suggests:

Prefer robust calibration and ensembling over hard constraints: Label-wise threshold optimization and lightweight model ensembling yield the most consistent gains for value detection at sentence or argument level (Yeste et al., 31 Jan 2026).
Integrate value-grounded reasoning in RM training: Application of chain-of-thought and explicit value balancing in multi-objective architectures (e.g., Mixture-of-Experts on value/style) is recommended (Ghate et al., 7 Oct 2025).
Broaden cultural-ethical axes and dataset coverage: Extend beyond Schwartz/Inglehart to include underrepresented ontologies (Indigenous, non-Western) and native language corpora (Mirzakhmedova et al., 2023, Ghate et al., 7 Oct 2025).
Develop transparent, interpretable systems: Emphasize alignment between predictions and rationale explanations (e.g., via attention rationales or integrated gradients), enabling justification as well as accurate labeling (Mirzakhmedova et al., 2023, Zhang et al., 2023).
Exploit value concept abstraction in evaluation and transfer: Future ValuesML explorations may harness concept-driven transfer for explainability, domain adaptation, and cross-value-system calibration (Yao et al., 2024).
Audit and mitigate model biases: Routine analysis for implicit cultural, value, and style biases is necessary as model deployment widens (Ghate et al., 7 Oct 2025).

7. Limitations, Open Challenges, and Perspective

The current ValueEval’24 / ValuesML toolkits represent a substantial advance in systematic, large-scale, and fine-grained value evaluation for LLMs. However, persistent challenges remain:

End-task accuracy on steerability and value detection is well below the oracle or inter-annotator ceiling, even for the largest models and best calibrated ensembles.
Hard-coded hierarchies, while attractive theoretically, risk compounding errors and suppressing minority value expressions in practical settings.
LLMs continue to exhibit context- and prompt-driven behavior, with value “understanding” largely contingent on explicit cues or training distribution properties rather than internalized semantic knowledge.
The gap between plausible rationalization and genuine value alignment (as quantified by ValueDCG) signals risk for applications demanding high safety, normative compliance, or pluralistic adaptability.

Ongoing research aims to bridge these gaps through hierarchical soft conditioning, cross-cultural dataset enrichment, mechanistic interpretability, and improved concept-based calibration frameworks. As the ValuesML field matures, ValueEval’24 benchmarks, datasets, and analytic methodologies will constitute the reference standard for quantifying, auditing, and advancing human value alignment in LLMs (Mirzakhmedova et al., 2023, Zhang et al., 2023, Yao et al., 2024, Ghate et al., 7 Oct 2025, Yeste et al., 31 Jan 2026).