CLAVE Framework: Evaluating LLM Value Alignment
- CLAVE is an adaptive evaluation framework for assessing value alignment in LLMs using a dual-model pipeline and minimal annotation.
- It employs a large LLM as a concept extractor and a fine-tuned small LLM as a value recognizer to ensure robust generalization across scenarios.
- Empirical results on the ValEval dataset demonstrate enhanced performance under textual perturbation and out-of-distribution scenarios compared to baseline methods.
CLAVE is an adaptive evaluation framework designed to assess the value alignment of LLMs through a dual-model pipeline. It explicitly addresses the two primary challenges of open-ended value evaluation: adaptability to evolving human value definitions with minimal annotation and robust generalization across diverse scenarios and textual expressions. CLAVE leverages a large LLM as a Concept Extractor and a smaller, fine-tuned open-source LLM as a Value Recognizer, integrating concept calibration and minimal supervision to maximize flexibility and robustness. The framework is benchmarked on the ValEval dataset, which spans three major value systems and over 13,000 labeled examples, demonstrating empirical advantages over both prompt-based evaluators and direct fine-tuning approaches (Yao et al., 2024).
1. System Architecture and Pipeline
CLAVE implements a two-stage evaluation architecture, consisting of:
- Concept Extractor (Large LLM): Typically a closed-source model (e.g., GPT-4 or ChatGPT), responsible for extracting high-level, scenario-agnostic value concepts from responses given a prompt template. The template specifies the evaluation instruction, value definition, scenario, and response. The extractor generates a set of concepts characterizing behaviors or implications indicating the target value.
- Concept Pool and Mapping: Concepts are aggregated from human-annotated training samples (100 per value type). Samples are embedded and clustered (K-Means) to organize similar behaviors, then distilled into a deduplicated pool by hierarchical clustering. At inference, each new concept is mapped to its closest pool entry if (); otherwise, is used unmodified, supporting concept re-use and model calibration.
- Value Recognizer (Small LLM): An open-source decoder-only LLM (e.g., Llama-2-7B, Mistral-7B), fine-tuned via LoRA, receives prompts incorporating value definitions and mapped concepts, outputting a decision . Fine-tuning targets the negative log-likelihood of label tokens.
The operational flow per sample is:
- 0 (Concept Extraction)
- 1 (Concept Mapping)
- 2 (Value Recognition)
Concept Pool Construction Algorithm
1
2. Concept Extraction, Calibration, and Training
The extraction prompt, as defined in the paper’s appendix, instructs the extractor LLM to enumerate “essential, generic” concepts per sample. This approach isolates generalizable indicators of values, supporting robustness to surface variation.
Calibration is performed by mapping extracted concepts during inference to the closest entry in the pool 3 if semantic similarity exceeds 4. This process restricts the set of concepts input to the Value Recognizer to those aligned with previously annotated human-labeled data, minimizing distribution shift and off-policy decision boundaries.
Fine-tuning of the small model employs LoRA with default rank, batch size 8, learning rate 5, bf16 precision, and standard weight decay. Training continues until convergence on the development set with no additional explicit regularization.
3. ValEval Dataset
ValEval, developed for the CLAVE evaluation, consists of over 13,000 (text, value, label) examples covering three value frameworks:
- Social Risk Categories: 14 hazard classes sourced from BeaverTails.
- Schwartz Basic Human Values: 10 core motivational dimensions.
- Moral Foundations: 5 primary moral axes.
Each system is split into original training (100 samples/value x labels), original test, perturbation test (systematic textual edits), and out-of-distribution (OOD) generalization (new data source). Labels (adhere/oppose/unrelated) are balanced, with three expert annotators and majority aggregation; inter-annotator agreement is reported at 80–88%.
| Value System | #train | #test_orig | #test_pert | #test_gen |
|---|---|---|---|---|
| Social Risks | 2,800 | 1,000 | 668 | 370 |
| Schwartz Theory | 2,463 | 1,000 | 603 | 399 |
| Moral Foundations | 1,500 | 1,000 | 300 | 1,000 |
4. Empirical Evaluation and Findings
Performance is quantified primarily by accuracy across original, perturbation, and OOD test splits. Benchmarked baselines include prompt-based LLM evaluators (various prompting paradigms using GPT-4), fine-tuned LMs (GPT-2-Large, Phi-3, Llama-2-7B, Mistral-7B), and human annotators.
Summary accuracy from Table 2 (best in bold):
| Approach | Social Risks (orig/pert/gen) | Schwartz (orig/pert/gen) | Moral Foundations (orig/pert/gen) |
|---|---|---|---|
| Prompt-based… | 84.9/86.9/91.1 | 55.5/81.5/82.4 | 56.2/93.0/47.5 |
| Tuning-based Mistral | 88.6/76.5/53.5 | 76.3/70.9/76.2 | 56.1/93.7/48.0 |
| CLAVE–Llama | 85.0/82.1/83.7 | 69.9/82.1/83.7 | 56.8/93.7/53.8 |
| CLAVE–Mistral | 88.4/84.0/88.7 | 75.3/75.1/82.5 | 57.4/88.7/49.3 |
Key findings:
- Prompt-based LLMs attain high in-domain accuracy but generalize poorly to unfamiliar value theories (adaptability problem).
- Directly fine-tuned models achieve strong original-set accuracy but degrade under textual perturbation or OOD (generalizability problem).
- CLAVE consistently outperforms both in robust adaptation (across value systems) and retention under scenario/perturbation shift.
Additional results demonstrate superior data efficiency when training samples per value 6, and component ablations reveal significant gains whenever both large extractor and small recognizer models are utilized. Concept similarity distributions are stable across perturbation and generalization splits, supporting robust abstraction.
5. Methodological Limitations and Open Issues
CLAVE’s success hinges on the ability of the large LLM to propose generic concepts that accurately encode value dimensions; deficiencies in extraction directly impact recognizer reliability. Construction of the concept pool introduces pre-processing latency, requiring iterative clustering and deduplication, and concept mapping may be sensitive to the choice of similarity threshold 7.
Rare failure modes include edge cases where concept mapping fails to properly account for semantic drift or outlier values, indicating a potential need for dynamic, active expansion of the pool. The framework is agnostic to pairing choices for large/small models, and implications for cross-lingual and cross-cultural value transfer remain to be substantiated. Interpretability is enhanced by exposing intermediate concept reasoning, but deeper study of human-model alignment is pending (Yao et al., 2024).
6. Implementation Guidelines and Practical Use
Recommended deployment for a custom value system involves the following steps:
- Collect 50–100 annotated (scenario, response, value, label) samples per value.
- Construct the concept pool via embedding, K-Means clustering (e.g., 8), and batched concept extraction (4 samples per large LLM prompt).
- Deduplicate pool entries at the concept text level by hierarchical clustering.
- Fine-tune the small model using LoRA (batch 8, learning rate 9, bf16).
- Set the mapping threshold 0, tuning on a development set.
- At inference, extract and map concepts for each new sample, then run the value recognizer.
Practitioner recommendations include:
- Prefer the most cost-effective large LLM available; empirical results are robust to using ChatGPT vs. GPT-4.
- Prioritize sampling diversity at the concept level rather than solely at the text level to maximize coverage.
- Regularly monitor concept embedding distributions to detect drift in novel domains.
7. Prospects and Extensions
CLAVE's architecture enables both interpretability (by surfacing intermediate concepts) and targeted adaptation across value systems with minimal annotation. Open research avenues include transparent studies of human–model alignment at the concept level, systematic evaluation of multi-model or multi-tier architectures, active or user-driven expansion of the concept pool, and cultural or multilingual extensions. Integration of dynamic pool management and further benchmarking across additional value ontologies are proposed future directions (Yao et al., 2024).