ValuesML: Evaluating Human Values in ML
- ValuesML is an evaluation framework, dataset suite, and methodological paradigm that rigorously assesses human values in ML models using diverse social science theories and adaptive benchmarks.
- It employs fine-grained taxonomies and value grounding protocols—drawing from Schwartz theory and moral foundations—to enable both text and multimodal value detection.
- The framework integrates closed-loop generative benchmarking, label calibration, and retrieval-augmented methods to improve model value alignment across diverse communicative contexts.
ValuesML is an evaluation framework, dataset suite, and methodological paradigm designed for the rigorous, pluralistic, and adaptive assessment of human values in machine learning models—most prominently, LLMs and multimodal LLMs (MLLMs). Drawing upon advances in social science value theories, closed-loop generative benchmarking, and culture-aware metrics, ValuesML provides both practical tools (ValueEval’24/ValuesML, ValueGround) and a conceptual blueprint (Value Compass, full-stack ValuesML) for investigating not only whether models "know" about human values, but whether they can recognize, reason about, and conform to these values as expressed across diverse communicative contexts.
1. Taxonomies and Value Grounding
ValuesML is grounded in a fine-grained ontology of basic and higher-order value systems. The canonical instantiation draws from the Schwartz theory of basic human values and its 19-value refinement:
- Basic Schwartz values (19-way taxonomy): Self-direction (thought/action), Stimulation, Hedonism, Achievement, Power (dominance/resources), Face, Security (personal/societal), Tradition, Conformity (rules/interpersonal), Humility, Benevolence (caring/dependability), Universalism (concern/nature/tolerance) (Yeste et al., 31 Jan 2026, Yeste et al., 21 May 2026).
- Higher-order (HO) values: Aggregations such as Growth, Self-Protection, Social Focus, Personal Focus, Openness, Conservation, Self-Transcendence, Self-Enhancement, derived via group-ORs from basic values (Yeste et al., 31 Jan 2026).
- Other frameworks: Moral Foundations Theory (Care/Harm, Fairness, etc.), LLM-unique value factors (Competence, Character, Integrity), and safety-related value taxonomies (e.g., SALAD-Bench) (Yao et al., 13 Jan 2025).
Operationalization involves concise formal definitions for each value, coupled with prompt templates or decision rubrics designed to elicit or test value-conforming and value-violating behaviors in target models (Yao et al., 13 Jan 2025). Annotation protocols (e.g., ValueEval) instruct annotators to label both explicit and implicit value signalling, collapsing "attained" and "constrained" indications into a binary "value present" flag (Yeste et al., 21 May 2026).
2. Dataset Design and Task Construction
ValuesML datasets are structured to probe values at scale in both text and multimodal settings. Key instantiations include:
- Sentence-level multi-label value detection (ValueEval’24/ValuesML):
- 74,231 sentences (44.8k train, 14.9k dev, 14.6k test) annotated for up to 19 Schwartz values.
- Sources: interview transcripts, online arguments, essays, news, etc. (Yeste et al., 31 Jan 2026).
- Each sentence is represented as ; HO values via group-OR.
- Value grounding in political documents:
- Each sentence contextualized at sentence, window, and document level; supports retrieval-augmented inputs with external moral knowledge (Yeste et al., 21 May 2026).
- Visual and cross-modal value grounding (ValueGround):
- Each instance comprises a country , a WVS-derived question (collapsed to binary endpoints ), and a paired visualization of value contrasts.
- The model must select the image aligning with the country's empirically derived value tendency (Wang et al., 7 Apr 2026).
- "Option-image alignment" and "text-only" regimes test knowledge without visual bottlenecks.
Datasets are constructed via multi-agent pipelines: planners specify scene attributes, specialized agents generate images, editors produce minimal contrasts, and critic agents/human raters enforce semantic alignment and absence of shortcut cues (Wang et al., 7 Apr 2026). Item pools are periodically regenerated and filtered to ensure entropy, novelty, and robustness against contamination (Yao et al., 13 Jan 2025).
3. Evaluation Methodologies and Metrics
ValuesML emphasizes closed-loop, generative, and pluralistic evaluation protocols:
- Generative Evolving Evaluation: At each epoch, the item-generator produces novel, discriminative prompts or visual scenarios tailored to the current model landscape. Discriminative entropy is used to guide regeneration, with item difficulty adjusted via computerized adaptive testing to maintain informative score distributions (Yao et al., 13 Jan 2025).
- Scoring Functions: For each value and model , conformity is scored as
0
where 1 is an LLM-based recognizer decomposed into: - 2: extract key value concepts. - 3: fine-tuned classifier yields 4.
- Pluralistic Alignment: Overall alignment is a weighted sum over value dimensions,
5
with strategy-dependent or population-informed weight vectors 6 (Yao et al., 13 Jan 2025).
- Classification Metrics (ValueEval):
- Macro-F₁ over 19 basic values; micro-F₁ for aggregate precision/recall (Yeste et al., 21 May 2026).
- Per-label 7 and analysis by label frequency and confusion.
- For visual tasks, per-country and per-model accuracy on main (visual), text-only, and alignment conditions (Wang et al., 7 Apr 2026).
- Calibration and Aggregation:
- Label-wise threshold tuning (8) is critical for handling imbalanced labels (+0.03–0.05 Macro-F₁ vs. 9) (Yeste et al., 31 Jan 2026).
- Small soft-voting ensembles of calibrated models yield further consistent gains (+0.01–0.02 Macro-F₁) (Yeste et al., 31 Jan 2026).
4. Model Families, Architectures, and Strategies
ValuesML evaluations span encoder-based, decoder-only, and retrieval-augmented architectures:
- Supervised encoders (DeBERTa-v3-base/large): Trained in multi-label mode with fine-tuned thresholds (Yeste et al., 21 May 2026, Yeste et al., 31 Jan 2026).
- Small LLMs (010B): Instruction-tuned; lag behind encoders when used alone but exhibit error diversity in hybrid ensembles (Yeste et al., 31 Jan 2026).
- Zero-shot instruction LLMs (12–123B): Direct prompt-based inference; scaling is non-monotonic, and performance saturates (Yeste et al., 21 May 2026).
- Retrieval-Augmented Generation (RAG): Early fusion of retrieved moral knowledge consistently yields gains in all model families and context regimes; late and cross-attention fusions offer no additional benefit (Yeste et al., 21 May 2026).
- Contextual Variation: Document-level context helps supervised encoders (macro-F₁ +0.038–0.048) but can harm zero-shot LLMs; window-level results are mixed (Yeste et al., 21 May 2026).
- Hierarchical Gating and HO Pipelines: Hard masking via HO categories (HO→values, Presence→HO→values) sharply reduces recall and is not beneficial, despite strong slice-level learnability for HO pairs (Macro-F₁ ≈ 0.58 for best) (Yeste et al., 31 Jan 2026).
- Multimodal MLLMs: Visual value alignment in ValueGround exposes a consistent drop from text-only (72.8%) to visual grounding (65.8%), with strong models (e.g., Gemini 3 Flash) most robust (Wang et al., 7 Apr 2026).
A plausible implication is that Value-sensitive NLP must balance the benefits of contextual expansion, retrieval, and ensembling, tailoring model architecture to both task form and label statistics.
5. Empirical Findings and Best Practices
Recent studies reveal nuanced effects across context, retrieval, hierarchy, and architecture:
- Context and Retrieval: Document context aids encoders, especially for intricate or contextually situated values (e.g., Hedonism +0.10); retrieval-augmented input especially helps for less contextually confounded values (e.g., Benevolence: caring +0.064) (Yeste et al., 21 May 2026).
- Label Calibration and Ensembles: Per-label threshold tuning is the dominant compute-frugal lever; ensemble selection further boosts performance, particularly among hard HO slices (Yeste et al., 31 Jan 2026).
- Model Scale: Larger parameter counts do not guarantee improved value detection; careful ablation is needed to validate effectiveness (Yeste et al., 21 May 2026).
- Visual Value Grounding: MLLMs experience substantial cross-modal prediction reversals (e.g., up to 36.9% in Claude Haiku), often failing to integrate text-conditioned priors and visual evidence (Wang et al., 7 Apr 2026).
- Cultural and Domain Adaptivity: Pluralistic metric weighting via cultural survey data or user specification ensures that scores are interpretable and valid across heterogeneous populations (Yao et al., 13 Jan 2025, Wang et al., 7 Apr 2026).
Best practices emphasize periodic refreshing of item generators/recognizers, routine calibration, adversarial item generation to counteract memorization, culturally informed metric weighting, and explainable scoring via rationale extraction (Yao et al., 13 Jan 2025).
6. Limitations and Open Challenges
ValuesML faces several practical and principled challenges:
| Challenge | Manifestation | Mitigations |
|---|---|---|
| Value ambiguity & overlap | Fine-grained labels are easily confusable (e.g., Humility, Conformity: interpersonal) (Yeste et al., 21 May 2026) | Knowledge retrieval, expert guidelines |
| Data contamination | Static benchmarks lose validity as LLMs ingest test sets (Yao et al., 13 Jan 2025) | Generative item pools, item novelty checks |
| Cultural specificity | Value salience and meaning shift across groups (Wang et al., 7 Apr 2026) | Pluralistic weighting, local item seeds |
| Cross-modal concept drift | MLLMs fail to preserve text-based value alignment in vision (Wang et al., 7 Apr 2026) | Multimodal fine-tuning, improved image-text alignment |
| Calibration for rare labels | Thresholds for low-prevalence values undermine F₁ (Yeste et al., 31 Jan 2026) | Label-wise tuning, rebalancing, joint loss |
| Hierarchical gating errors | HO masking reduces recall, propagates errors (Yeste et al., 31 Jan 2026) | Soft conditioning, auxiliary HO objectives |
A plausible implication is that future work should prioritize adaptive, generative, and explanation-rich protocols that can robustly accommodate changing model capabilities, data distributions, and cultural interpretation regimes.
7. Extensions and Future Directions
Emerging research directions for ValuesML include multimodal extension, richer relational value structures, and finer granularity:
- Multimodal and Cross-modal Value Assessment: Extending the testbed beyond text and static binary endpoints to support multi-way, relational, and photographic scenarios (Wang et al., 7 Apr 2026); minimizing shortcut heuristics in visual contrasts.
- Dynamic and Domain-Specific Value Sets: Incorporation of new value taxonomies (e.g., Hofstede dimensions, sectoral codes for professions), domain-specific recognizers; adaptation to new societal priorities (Yao et al., 13 Jan 2025).
- Crowd- and Machine-in-the-Loop Evaluation: Periodic human meta-evaluation to recalibrate scoring functions, maintain alignment with evolving human judgments (Yao et al., 13 Jan 2025).
- Explainability: Systematic extraction and reporting of highlighted rationales and concept-level explanations for all model decisions to enable error analysis and bias detection (Yao et al., 13 Jan 2025).
- Hierarchical and Soft Structural Conditioning: Advancing from hard-gated hierarchies to joint learning and soft attention-over-hierarchy approaches for rare or compositional values (Yeste et al., 31 Jan 2026).
- Longitudinal Tracking: Monitoring drift in value alignment and response under model, dataset, and cultural change (Yao et al., 13 Jan 2025).
Together, these directions position ValuesML as a foundational paradigm for the future of value-sensitive, culture-aware, and adaptive machine learning evaluation.