Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

ConflictScope: Automated LLM Value Conflict Evaluation

Updated 2 October 2025

ConflictScope is an automated pipeline that constructs dilemmas by pitting pairs of values against each other to empirically recover LLM value rankings.
It systematically generates, deduplicates, and filters scenarios using LLM-based prompt templating and cosine similarity to ensure realistic and discriminative dilemmas.
Empirical findings reveal that evaluation formats significantly influence observed LLM value priorities, with explicit system prompts moderately steering model behavior.

ConflictScope is an automated pipeline for evaluating how LLMs prioritize competing values when presented with explicit dilemmas. Designed in response to the scarcity of explicit value conflicts in standard alignment benchmarks, ConflictScope systematically constructs dilemmas between pairs of values from a user-defined set, prompts models with these scenarios, and elicits preferences to recover an empirical value ranking. Its dual-mode evaluation—in both controlled multiple-choice settings and more interactive, open-ended prompts—yields insight into how LLMs trade off between values under realistic conditions and investigates the degree to which explicit value prioritizations (articulated in system prompts) can steer LLM decisions under conflict (Liu et al., 29 Sep 2025).

1. Motivation and Overview

Traditional alignment datasets rarely expose LLMs to explicit value conflicts, leaving open questions about how these models arbitrate among competing priorities when deployed in scenarios where values such as helpfulness, harmlessness, and user autonomy must be traded off. ConflictScope addresses this by automating the generation of dilemmas, each presenting a binary choice between two values. Given a value set (e.g., HHH—helpfulness, honesty, harmlessness; “Personal-Protective”; or ModelSpec), ConflictScope generates, filters, and deduplicates scenarios, then evaluates LLM choices and free-text responses against these conflicts.

This pipeline enables quantification of value prioritization across models, interaction formats, and prompting strategies, highlighting discrepancies between “expressed” and “revealed” value preferences.

2. Scenario Generation and Filtering Pipeline

ConflictScope employs a top-down, value-guided scenario synthesis:

For each unordered pair of values in the set, prompts to a strong LLM (e.g., Claude 3.5 Sonnet) generate numerous high-level summaries of value conflicts, varying scenario templates (severity of potential harm/benefit, context, user background).
Each summary is expanded into a detailed scenario including:
- User persona and context
- Two mutually exclusive actions, one supporting each value
Automatic deduplication uses cosine similarity over sentence embeddings with a 0.8 threshold to filter out near-duplicates.
Filtering employs a separate LLM judge (GPT-4.1) to verify six properties: realism, specificity, feasibility, non-impossibility, explicit value-guidedness, and the presence of a genuine dilemma.

Ablation studies in the paper show that the full pipeline (with prompt templating and filtering) produces scenarios that are more challenging, as reflected in lower inter-model agreement, compared to less rigorous generation or filtering.

3. Evaluation Methodology

Two core formats are used to probe model preferences:

Multiple-Choice (MCQ):
- Each scenario presents two concrete actions, and the model is forced to select one.
- This format exposes the “expressed preference” under an explicit trade-off.
Open-Ended:
- The model is presented with a scenario as a natural-language user prompt (written by another LLM).
- Its free-text response is judged (again by an LLM) to determine which action (and hence which value) it actually advanced—capturing the model’s “revealed preference.”

Pairwise preferences extracted across dilemmas are aggregated into a ranking using the Bradley–Terry model:

$P(V_i > V_j) = \frac{\exp(\theta_i)}{\exp(\theta_i) + \exp(\theta_j)}$

where $\theta_i$ is a latent parameter representing the “priority” of value $V_i$ . Repeated pairwise judgements across scenarios yield maximum likelihood rankings for each model and evaluation format.

4. Empirical Findings and Value Prioritization

ConflictScope uncovers substantive discrepancies in model prioritization depending on evaluation format:

Under MCQ, models tend to select “protective” values (e.g., harmlessness, compliance), congruent with RLHF objectives that emphasize risk and safe deployment.
In open-ended settings simulating deployment, models systematically shift toward “personal” values such as user autonomy or helpfulness, even at the expense of protective values. This reflects a more liberal, user-driven stance under looser prompting conditions.
Including explicit value orderings and conflict resolution guidelines in the model’s system prompt nudges value rankings closer to a target ordering, achieving an average improvement in alignment of 14% in value ranking agreement.

These findings reveal that alignment evaluated via MCQ formats may systematically misrepresent deployed behavior, and that system prompting can partially—but not fully—steer model priorities under real dilemma conditions.

5. Technical Contributions and Methodological Insights

ConflictScope establishes a robust, automated methodology:

Component	Function	Details
Scenario Generation	Create value-focused dilemmas	LLM-based, prompt templates, deduplication, filtering
Evaluation Modes	Elicit model preferences	MCQ (expressed); Open-ended (revealed)
Aggregation	Compute empirical value ranking	Bradley–Terry model over pairwise win/loss

The top-down, value pair–guided construction of dilemmas ensures that every conflict is traceable to specific principles, avoiding accidental or ambiguous dilemmas.
Deduplication and fine-grained LLM-based scenario filtering produce higher-quality and more discriminative dilemmas, as validated by decreased inter-model agreement.
The methodology supports extensibility: new value sets can be substituted, and the pipeline re-applied for different alignment audits.

6. Implications for Alignment and Deployment

ConflictScope demonstrates that:

The format of evaluation (MCQ vs. open-ended) materially impacts observed LLM value priorities. This suggests that benchmark scores reported in MCQ settings may systematically overestimate alignment with certain values under actual interaction conditions.
Explicit value orderings in system prompts can “steer” but not fully determine model behavior under conflict, indicating moderate, actionable leverage for post-training value alignment.
The dual-mode evaluation architecture quantifies both “expressed” and “revealed” preference gaps, which is critical for safe LLM deployment in ethically sensitive or high-stakes domains.

These observations point to the need for richer, value-conflict–focused benchmarks in both research and practical evaluation of LLM alignment.

7. Outlook and Research Directions

ConflictScope establishes a foundation for future work in LLM value alignment:

Systematic and automated scenario generation for new value sets or culturally grounded value codifications.
Analysis of steerability and plasticity of value rankings with respect to various post-training interventions, including instruction tuning and advanced prompt engineering.
Empirical audits of model behavior over time—tracking internal and external factors (model version, training updates) affecting value prioritization.
Deployment-oriented simulations that bridge the gap between “expressed” and “revealed” preferences to yield reliable alignment guarantees in real-world settings.

By foregrounding value conflict in evaluation, ConflictScope provides essential tooling and methodology for moving beyond surface-level alignment and into the domains where LLMs make consequential tradeoffs among core human principles.

PDF Markdown Chat (Pro)

References (1)

Generative Value Conflicts Reveal LLM Priorities (2025)

Follow Topic

Get notified by email when new papers are published related to ConflictScope.