Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

ConflictScope: Automated LLM Value Conflict Evaluation

Updated 2 October 2025
  • ConflictScope is an automated pipeline that constructs dilemmas by pitting pairs of values against each other to empirically recover LLM value rankings.
  • It systematically generates, deduplicates, and filters scenarios using LLM-based prompt templating and cosine similarity to ensure realistic and discriminative dilemmas.
  • Empirical findings reveal that evaluation formats significantly influence observed LLM value priorities, with explicit system prompts moderately steering model behavior.

ConflictScope is an automated pipeline for evaluating how LLMs prioritize competing values when presented with explicit dilemmas. Designed in response to the scarcity of explicit value conflicts in standard alignment benchmarks, ConflictScope systematically constructs dilemmas between pairs of values from a user-defined set, prompts models with these scenarios, and elicits preferences to recover an empirical value ranking. Its dual-mode evaluation—in both controlled multiple-choice settings and more interactive, open-ended prompts—yields insight into how LLMs trade off between values under realistic conditions and investigates the degree to which explicit value prioritizations (articulated in system prompts) can steer LLM decisions under conflict (Liu et al., 29 Sep 2025).

1. Motivation and Overview

Traditional alignment datasets rarely expose LLMs to explicit value conflicts, leaving open questions about how these models arbitrate among competing priorities when deployed in scenarios where values such as helpfulness, harmlessness, and user autonomy must be traded off. ConflictScope addresses this by automating the generation of dilemmas, each presenting a binary choice between two values. Given a value set (e.g., HHH—helpfulness, honesty, harmlessness; “Personal-Protective”; or ModelSpec), ConflictScope generates, filters, and deduplicates scenarios, then evaluates LLM choices and free-text responses against these conflicts.

This pipeline enables quantification of value prioritization across models, interaction formats, and prompting strategies, highlighting discrepancies between “expressed” and “revealed” value preferences.

2. Scenario Generation and Filtering Pipeline

ConflictScope employs a top-down, value-guided scenario synthesis:

  • For each unordered pair of values in the set, prompts to a strong LLM (e.g., Claude 3.5 Sonnet) generate numerous high-level summaries of value conflicts, varying scenario templates (severity of potential harm/benefit, context, user background).
  • Each summary is expanded into a detailed scenario including:
    • User persona and context
    • Two mutually exclusive actions, one supporting each value
  • Automatic deduplication uses cosine similarity over sentence embeddings with a 0.8 threshold to filter out near-duplicates.
  • Filtering employs a separate LLM judge (GPT-4.1) to verify six properties: realism, specificity, feasibility, non-impossibility, explicit value-guidedness, and the presence of a genuine dilemma.

Ablation studies in the paper show that the full pipeline (with prompt templating and filtering) produces scenarios that are more challenging, as reflected in lower inter-model agreement, compared to less rigorous generation or filtering.

3. Evaluation Methodology

Two core formats are used to probe model preferences:

  • Multiple-Choice (MCQ):
    • Each scenario presents two concrete actions, and the model is forced to select one.
    • This format exposes the “expressed preference” under an explicit trade-off.
  • Open-Ended:
    • The model is presented with a scenario as a natural-language user prompt (written by another LLM).
    • Its free-text response is judged (again by an LLM) to determine which action (and hence which value) it actually advanced—capturing the model’s “revealed preference.”

Pairwise preferences extracted across dilemmas are aggregated into a ranking using the Bradley–Terry model:

P(Vi>Vj)=exp(θi)exp(θi)+exp(θj)P(V_i > V_j) = \frac{\exp(\theta_i)}{\exp(\theta_i) + \exp(\theta_j)}

where θi\theta_i is a latent parameter representing the “priority” of value ViV_i. Repeated pairwise judgements across scenarios yield maximum likelihood rankings for each model and evaluation format.

4. Empirical Findings and Value Prioritization

ConflictScope uncovers substantive discrepancies in model prioritization depending on evaluation format:

  • Under MCQ, models tend to select “protective” values (e.g., harmlessness, compliance), congruent with RLHF objectives that emphasize risk and safe deployment.
  • In open-ended settings simulating deployment, models systematically shift toward “personal” values such as user autonomy or helpfulness, even at the expense of protective values. This reflects a more liberal, user-driven stance under looser prompting conditions.
  • Including explicit value orderings and conflict resolution guidelines in the model’s system prompt nudges value rankings closer to a target ordering, achieving an average improvement in alignment of 14% in value ranking agreement.

These findings reveal that alignment evaluated via MCQ formats may systematically misrepresent deployed behavior, and that system prompting can partially—but not fully—steer model priorities under real dilemma conditions.

5. Technical Contributions and Methodological Insights

ConflictScope establishes a robust, automated methodology:

Component Function Details
Scenario Generation Create value-focused dilemmas LLM-based, prompt templates, deduplication, filtering
Evaluation Modes Elicit model preferences MCQ (expressed); Open-ended (revealed)
Aggregation Compute empirical value ranking Bradley–Terry model over pairwise win/loss
  • The top-down, value pair–guided construction of dilemmas ensures that every conflict is traceable to specific principles, avoiding accidental or ambiguous dilemmas.
  • Deduplication and fine-grained LLM-based scenario filtering produce higher-quality and more discriminative dilemmas, as validated by decreased inter-model agreement.
  • The methodology supports extensibility: new value sets can be substituted, and the pipeline re-applied for different alignment audits.

6. Implications for Alignment and Deployment

ConflictScope demonstrates that:

  • The format of evaluation (MCQ vs. open-ended) materially impacts observed LLM value priorities. This suggests that benchmark scores reported in MCQ settings may systematically overestimate alignment with certain values under actual interaction conditions.
  • Explicit value orderings in system prompts can “steer” but not fully determine model behavior under conflict, indicating moderate, actionable leverage for post-training value alignment.
  • The dual-mode evaluation architecture quantifies both “expressed” and “revealed” preference gaps, which is critical for safe LLM deployment in ethically sensitive or high-stakes domains.

These observations point to the need for richer, value-conflict–focused benchmarks in both research and practical evaluation of LLM alignment.

7. Outlook and Research Directions

ConflictScope establishes a foundation for future work in LLM value alignment:

  • Systematic and automated scenario generation for new value sets or culturally grounded value codifications.
  • Analysis of steerability and plasticity of value rankings with respect to various post-training interventions, including instruction tuning and advanced prompt engineering.
  • Empirical audits of model behavior over time—tracking internal and external factors (model version, training updates) affecting value prioritization.
  • Deployment-oriented simulations that bridge the gap between “expressed” and “revealed” preferences to yield reliable alignment guarantees in real-world settings.

By foregrounding value conflict in evaluation, ConflictScope provides essential tooling and methodology for moving beyond surface-level alignment and into the domains where LLMs make consequential tradeoffs among core human principles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ConflictScope.