Stress-Testing Model Specs Reveals Character Differences among Language Models (2510.07686v1)
Abstract: LLMs are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at how today’s AI chatbots (like those from Anthropic, OpenAI, Google, and xAI) follow their “rulebooks.” These rulebooks—often called model specifications or AI constitutions—tell the AI what values to follow and how to behave safely and helpfully. The authors build a way to “stress-test” these rulebooks by giving AIs tough questions where two good values clash (for example, “be helpful” vs. “be safe”). By seeing where different AIs disagree, the paper uncovers gaps, contradictions, and fuzzy parts in the rules.
What questions does the paper ask?
The paper tries to answer simple, practical questions:
- Do AI rulebooks contain conflicts where following one principle breaks another?
- Are the rules detailed enough to guide responses in tricky, real-world situations?
- When AIs disagree a lot on tough questions, does that reveal problems in the rulebook?
- Do different AI models show different “personalities” or value preferences?
- How do AIs behave on sensitive topics (like safety, politics, or mental health)?
How did the researchers test this?
The team created a huge set of challenging situations and watched how 12 leading AIs responded. Here’s the approach in everyday terms:
- Building tough scenarios: They started with a long list of values (over 3,000, like “fairness,” “national sovereignty,” “emotional support,” “business effectiveness”). Then they paired values and asked models to generate questions that force a tradeoff between the pair—meaning you can’t perfectly satisfy both values at once. Think of it like asking: “Write a persuasive speech that’s exciting and confident, but also perfectly cautious and neutral”—a hard balance!
- Making it more challenging: They added variations that nudge the question to favor one value over the other (like a user pushing for a specific outcome), so the AI has to handle bias and pressure.
- Getting lots of data: They generated over 300,000–410,000 scenarios using different “reasoning” models, then filtered and deduplicated them to keep the most diverse, useful examples.
- Measuring disagreement: They asked 12 frontier models to answer each question. For each answer, they used a scoring rubric (0–6) to judge how much the response favors each value. If different AIs landed on very different scores for the same scenario, that shows disagreement. They used a simple idea from statistics (standard deviation) to measure how spread out those scores were—the more spread, the more disagreement.
- Checking rulebook compliance: For the only publicly detailed rulebook (OpenAI’s), they had three strong AI judges (from different providers) read the rulebook and rate whether each answer followed it, didn’t follow it, or was ambiguous. They also checked how often the judges themselves disagreed—this tells you if the rules are hard to interpret.
What did they find?
Here are the main results, explained simply:
- High disagreement points to rulebook problems: When AIs disagree a lot on a question, that scenario is much more likely to expose contradictions or unclear parts in the rulebook. The authors found tens of thousands of such cases.
- Some rules conflict with each other: For example, “assume best intentions” can conflict with safety rules, or “stay neutral” can clash with responding to ethical dilemmas. In those situations, any choice the AI makes can break some part of the rulebook.
- Rules aren’t detailed enough: Sometimes very different answers are all judged “acceptable,” even though some are far more helpful than others. This shows the rulebook often doesn’t say what “better” looks like.
- Judges disagree too: The three AI judges had only moderate agreement on whether answers followed the rules. That means the rulebook leaves a lot of room for interpretation—different readers (or models) can reasonably disagree on what it means.
- Misalignment and false refusals: In sensitive areas (like chemical or biological risks, child safety, politics), some models refused too often (even when a helpful, safe answer was possible), while others gave overly compliant answers. This reveals both misalignment (not following intended values) and overly cautious behavior (false positives in refusal).
- Different “personalities” across providers: The paper saw patterns like Anthropic’s Claude often prioritizing ethical responsibility, Gemini leaning into emotional depth, and OpenAI/Grok optimizing for efficiency. On some themes (like business effectiveness or social equity), the models were more mixed.
Why is this important?
- Safety and trust: If rulebooks have contradictions or are too vague, AIs will behave inconsistently—especially on hard, real-world questions. That can be risky and makes it hard for users to trust the system.
- Better design of AI values: The findings help identify exactly where rules need fixes—clearer definitions, more examples, and explicit guidance for edge cases.
- Tools for improving alignment: The stress-test method gives AI companies a way to find and fix problems before deployment, making AIs safer and more reliable.
What could this change?
- Clearer, more consistent rulebooks: Companies can use these stress-tests to patch contradictions, add concrete examples, and specify how to balance values in tough cases.
- Smarter training and evaluation: AIs can be trained with better signals, and evaluators can be tuned to reduce judge disagreement. This helps align models not just to “pass” compliance but to give genuinely helpful, safe, and fair answers.
- Understanding model character: Knowing each model’s value preferences lets users and developers choose the right AI for the job—and helps companies intentionally shape a model’s “character.”
In short, the paper shows a practical way to put AI rulebooks under pressure, spot where they break, and learn how different models choose values when perfect answers aren’t possible. That’s a big step toward building AIs that are both helpful and trustworthy in the real world.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable follow-up research.
- Causal attribution of disagreement to specification issues is unproven
- Disagreement may stem from model-specific training, safety classifiers, refusal policies, or sampling randomness rather than spec contradictions. Ablate with: same model retrained under different specs; different models trained under the same spec; turn off/on provider safety layers and refusal classifiers; control for decoding parameters.
- External validity beyond the OpenAI model spec is unclear
- Only one detailed public spec is evaluated for compliance. Replicate with other providers’ internal or newly constructed public specs and with open-source models paired to public constitutions to test generality.
- Human ground truth for “spec ambiguity” and “compliance” is missing
- Compliance judgments rely on three LLM evaluators with only moderate agreement (Fleiss’ kappa ≈ 0.42). Establish expert-human panels (policy, safety, legal, ethics) as adjudicators, estimate inter-rater reliability, and calibrate LLM judges to human gold standards.
- Validity of the disagreement metric and thresholds is unvalidated
- The 0–6 value-classification rubric and the STD ≥ 1.5 threshold lack sensitivity/specificity studies. Perform robustness checks varying scales, thresholds, and scoring schemes; compare against alternative disagreement measures (e.g., pairwise ranking disagreement, entropy over strategy classes).
- Potential circularity and model bias in rubric construction
- Claude 4 Opus generates the strategy spectrum and also participates as a response model, potentially biasing classifications. Test with diverse rubric generators (including non-participating models and human-authored rubrics) and measure impact on disagreement rates.
- Assumption that pairwise value conflicts capture real-world complexity
- Many real cases involve triads or larger sets of competing principles. Extend to multi-principle conflict generation and test whether higher-order conflicts reveal additional spec gaps.
- Dependence on a single value taxonomy derived from Claude traffic
- The 3,307-value taxonomy may encode provider/model-specific biases. Cross-validate with independent taxonomies (e.g., moral foundations, legal/clinical standards, social science frameworks) and test convergence.
- Scenario realism and user-intent representativeness are not established
- Generated prompts (including biased variants) may be adversarial or unrealistic. Compare with real user logs (appropriately anonymized), run human realism ratings, and quantify coverage across real-world domains, stakes, and user profiles.
- Topic classification reliability is uncertain
- Topic tags are assigned via LLM prompts. Validate with human annotation and/or rule-based classifiers to estimate precision/recall on safety-critical categories (bio/chem/cyber/child grooming/etc.).
- Deduplication and diversity claims rely on embedding choices
- Gemini embeddings and a weighted k-center heuristic may shape scenario selection. Benchmark against alternative embeddings and selection objectives; quantify how choices affect disagreement clusters and identified spec issues.
- Compliance evaluation misses response quality gradations
- Unanimous “compliant” labels mask large helpfulness/harmlessness differences. Define graded quality rubrics, add utility/satisfaction scores, and evaluate whether enhanced specs reduce variance while improving user outcomes.
- Cultural, legal, and jurisdictional context is not modeled
- Specs and “factuality” can be jurisdiction-dependent. Introduce context-aware evaluation (region, law, norms) and test whether contextualized specs reduce disagreements without increasing harm.
- Temporal drift and update sensitivity are unanalyzed
- Specs and models evolve. Longitudinal studies are needed to measure how disagreement hotspots and spec ambiguities change across model/spec versions and to assess stability of findings.
- Linking disagreements to concrete spec edits is not closed-loop
- The paper identifies ambiguity but does not propose, implement, and test specific spec revisions. Conduct interventions: author clarifying clauses/examples, retrain or steer models, re-measure disagreement/compliance and user utility.
- Measurement of “false-positive refusals” and “misalignment” lacks human adjudication
- Claims rely on model disagreements. Establish human labels for appropriate refusal vs. over-refusal and for alignment failures; report precision/recall of model-based flags.
- Confounds from provider safety stacks and classifiers
- Observed refusal differences (e.g., constitutional classifier effects) confound spec diagnostics. Isolate by evaluating raw base models, toggling safety layers, or using standardized sandboxed inference stacks.
- Unclear impact of decoding choices and system prompts
- Sampling temperature, system prompts, and instruction variants can shift behaviors. Systematically vary these and quantify their effect on disagreement and compliance rates.
- No analysis of demographic or harm distribution impacts
- Which user groups or topics are disproportionately affected by ambiguity-induced variance? Conduct fairness and harm-distribution analyses across demographics, languages, and domains.
- Value-prioritization aggregation pipeline is weakly validated
- It uses Opus to extract values and nearest-neighbor mapping to 26 categories without ground-truth checks. Validate with human coding, compute mapping errors, and test alternative aggregation methods.
- Limited statistical rigor for “strong prediction” claims
- The link between disagreement and spec violations is reported as fold-increases without predictive metrics. Provide ROC/PR curves, calibration analyses, CIs across resamples, and preregistered statistical tests.
- Reproducibility and openness of data/code are unspecified
- Clarify dataset/code release, licensing, redaction of sensitive scenarios, and instructions for safe replication; provide seeds/configs for scenario generation and evaluation.
- Safety of dataset release and red-teaming implications not addressed
- Stress-test prompts may facilitate misuse. Define redaction pipelines, usage policies, and a harm-minimization plan for public release.
- Generalization to multimodal and tool-augmented settings is unknown
- Extend methodology to image/audio/code-tools/agents where specs and safety constraints differ; test whether disagreement remains predictive of spec issues.
- Lack of ablations on biased vs. neutral scenario variants
- Quantify how biasing toward a value affects disagreement and compliance; test for framing effects and susceptibility to user preconceptions.
- No comparison to alternative uncertainty-discovery methods
- Benchmark against other active-learning or query-by-committee strategies (e.g., diverse-beam search, adversarial prompting, counterfactual rewriting) to see which best surfaces spec gaps.
- Unclear cost–benefit and scalability profile
- Report compute/time costs per surfaced issue; explore cheaper proxies (smaller models, distilled judges) and active sampling to reduce evaluation overhead.
Practical Applications
Immediate Applications
Below are practical, deployable applications that leverage the paper’s methods (scenario generation, disagreement-based diagnostics, compliance judging, k-center deduplication) and findings (spec contradictions, granularity gaps, evaluator ambiguities, refusal patterns).
- Spec Stress-Test Suite for AI assistants (software, AI/ML platforms)
- A turnkey pipeline to generate value-tradeoff scenarios from the 3,307-value taxonomy, classify responses across a roster of models, compute disagreement scores, and flag spec contradictions/ambiguities in model behavior.
- Potential tools/products: “SpecStress” (scenario generator + rubric-based scorer), “Disagreement Radar” (dashboard showing high-STD cases), “SpecLint” (automated spec QA that surfaces contradictions and missing edge-case coverage).
- Assumptions/dependencies: Access to the model spec (public or internal); API access to multiple LLMs; acceptance that inter-model disagreement is a meaningful proxy for spec gaps.
- Pre-deployment QA and continuous monitoring in MLOps pipelines (software, AI operations)
- Integrate weighted k-center deduplication and disagreement scoring into CI/CD to build representative, high-diversity eval suites; track drift in model character, refusal rates, and spec compliance over releases.
- Potential workflows: “Eval-on-commit” runs, regression dashboards, automated triage of rising disagreement clusters.
- Assumptions/dependencies: Stable embedding models for k-center selection; compute budget for regular multi-model runs; versioned spec documents.
- Targeted spec revision and example authoring (industry, policy, compliance)
- Use evaluator disagreement and frequent non-compliance clusters to rewrite principles, add concrete examples, and clarify edge cases (e.g., disclaimers requirements, transformation exceptions, neutrality vs “no agenda” tradeoffs).
- Potential products: “Spec Authoring Assistant” that proposes clarifications/examples for ambiguous principles found via disagreement hotspots.
- Assumptions/dependencies: Organizational willingness to update specs; human-in-the-loop governance; legal review for regulated domains.
- RLHF/data curation improvement via disagreement-guided sampling (AI training)
- Feed high-disagreement scenarios into training to sharpen model behavior where specs are under-specified; distinguish acceptable vs optimal strategies when multiple compliant responses exist.
- Potential workflows: Active learning loops using query-by-committee disagreements; label collection that explicitly ranks “compliant but suboptimal” vs “compliant and high-utility” responses.
- Assumptions/dependencies: Availability of annotators; clarity on optimization objectives beyond basic compliance.
- Refusal Quality Auditor (safety, product UX)
- Automatically detect “bare refusals” and encourage constructive, safe alternatives (e.g., self-defense advice within bounds) when full compliance is not allowed.
- Potential tools/products: “Refusal Coach” that transforms hard rejections into safe, helpful guidance; topic-aware refusal calibration (bio/chem/cyber/child safety).
- Assumptions/dependencies: Access to refusal-type classifiers; clear internal policies distinguishing prohibited vs permissible alternative help.
- Regulated Advice Guardrails (finance, healthcare, legal)
- Audit and standardize disclaimer practices using evaluator disagreement evidence (e.g., underwriting example); ensure consistent deployment of “avoid regulated advice” patterns and safe educational framing.
- Potential tools/products: “Disclaimer Policy Checker” that tags missing or misapplied disclaimers; rule-based templates integrated into response generation.
- Assumptions/dependencies: Up-to-date regulatory guidance; robust domain ontologies; engagement from compliance officers.
- Enterprise Procurement and vendor assessment (industry governance)
- Use high-disagreement benchmarks to compare providers’ value prioritization patterns (e.g., efficiency vs emotional depth vs ethical responsibility) and to flag misalignment/false refusals pre-contract.
- Potential workflows: RFP evaluation with stress-test scores; sector-specific scenario packs (e.g., finance, education, safety).
- Assumptions/dependencies: Access to multi-provider models; shared evaluation rubric; vendor consent for audits.
- Academic benchmarking and dataset release (academia, research)
- Employ the 300k+ scenario corpus and rubric-based scoring protocol to paper inter-model character differences, inter-rater reliability, and spec-driven behavior; build a public SpecEval benchmark.
- Potential tools/products: Open datasets with disagreement labels; reproducible scoring scripts; Fleiss’ kappa reports across evaluators.
- Assumptions/dependencies: Licensing for values taxonomy and scenarios; community standards for sensitive-topic handling.
- Sensitive-topic risk dashboards (security, safety)
- Monitor refusal patterns and false positives across bio/chem/cyber/child grooming topics; calibrate safety classifiers and waivers where legitimate uses exist.
- Potential products: “Topic Risk Monitor” with rates of complete/soft refusals by model and release; workflows for de-escalation and safe alternative guidance.
- Assumptions/dependencies: Internal safety policies; access to topic classifiers; periodic human review for edge cases.
- End-user transparency and trust features (daily life, consumer apps)
- Provide “Why I refused” explanations and value-tradeoff summaries to users; show alternative safe pathways when a request is declined.
- Potential products: UI components for refusal explanations; configurable guidance level (brief vs detailed); opt-in value summaries.
- Assumptions/dependencies: UX design acceptance; guardrails to avoid leaking sensitive policy internals; consistency with provider policies.
Long-Term Applications
The following applications build on the paper’s methodology but require further research, formalization, scaling, or standardization.
- Machine-readable, formally verifiable model specs (policy, software tooling)
- Translate prose specs into structured, testable formats (logic/rule graphs) that can be automatically checked for contradictions and coverage gaps; tie tests to scenario generators.
- Potential tools/products: “SpecChecker” (formal verification), “SpecDiff” (semantic diffing across versions).
- Assumptions/dependencies: Specification languages and authoring tools; consensus on formalism; legal and policy buy-in.
- Standards and certification for spec clarity and disagreement thresholds (policy, governance)
- Establish sector-wide benchmarks (e.g., “Spec clarity score,” maximum allowed evaluator disagreement rates) as part of pre-deployment certification for foundation models and domain assistants.
- Potential products/workflows: Third-party audits; publishable scorecards; compliance labels.
- Assumptions/dependencies: Regulator engagement; multisector participation; agreed-upon test suites.
- User- or enterprise-configurable “Value Profiles” with safe arbitration (software, enterprise AI)
- Allow organizations or users to set priority weights over values (e.g., neutrality, empathy, efficiency) within safe bounds; a value-arbitration module computes responses consistent with both profiles and policies.
- Potential tools/products: “Value Dial” settings, “Arbitration Engine” that references specs; per-tenant customizations in multi-tenant platforms.
- Assumptions/dependencies: Strong guardrails; transparent conflict resolution; normative governance for default profiles.
- Multi-model committee inference for safety-critical decisions (healthcare, finance, public sector)
- Use disagreement-aware ensemble inference where responses are cross-checked and escalated when conflict exceeds thresholds; only converge when consensus or policy-based arbitration is achieved.
- Potential workflows: Tiered routing and adjudication; “consensus gating” for high-risk tasks; human-in-the-loop escalation.
- Assumptions/dependencies: Cost/performance budgets; robust arbitration policies; clear escalation paths.
- Active learning loops driven by disagreement to shape alignment data (AI training)
- Automate sampling of high-uncertainty scenarios, collect fine-grained labels (compliance, optimality, refusal helpfulness), and retrain models with improved spec granularity and tradeoff reasoning.
- Potential products/workflows: “Disagreement-to-Data” pipelines; quality-weighted RLHF; curriculum design targeting spec weaknesses.
- Assumptions/dependencies: Sustainable annotation capacity; consistent label schemas; strong privacy/safety controls.
- Spec authoring IDEs with contradiction detection and example synthesis (software tooling, policy)
- Build authoring environments that highlight likely contradictions (e.g., “assume best intentions” vs safety bounds), propose disambiguating examples, and simulate model behavior under candidate text.
- Potential tools/products: “Policy IDE,” “Example Synthesizer,” “Ambiguity Heatmap.”
- Assumptions/dependencies: Integration with LLM simulators; version control and review workflows; policy-writer training.
- Sector-specific stress-test packs (healthcare, education, finance, cybersecurity, biosecurity)
- Curate domain scenario libraries reflecting local regulations and ethical norms; measure refusal rates, false positives, and prioritization differences; tailor specs to sector standards.
- Potential products/workflows: “Healthcare SpecPack,” “Finance SpecPack,” with documented norms and evaluator agreements.
- Assumptions/dependencies: Domain experts; ongoing regulatory updates; safe handling of sensitive content.
- Interpretable compliance judge models with higher inter-rater reliability (academia, software)
- Train evaluators on human-labeled, diverse edge cases to reduce ambiguity and boost agreement; incorporate explanations and confidence measures.
- Potential products: “Judge Ensemble” with calibrated reliability; explanation tracing to spec fragments.
- Assumptions/dependencies: High-quality labeled corpora; agreed taxonomy of violations; careful deployment to avoid overfitting to one provider’s policy.
- Provider-comparison and routing services based on value prioritization (industry, platforms)
- Route queries to providers whose value tendencies align with user or enterprise goals (e.g., emotional depth vs efficiency vs ethical caution) while enforcing safety constraints.
- Potential products/workflows: “Value-Aware Router”; cross-provider SLAs; dashboards on prioritization drift.
- Assumptions/dependencies: Reliable, up-to-date provider profiling; transparent routing policies; fairness and compliance controls.
- Dynamic policy adaptation for evolving controversies and facts (policy, public sector, media)
- Continuously reconcile neutrality and factuality principles as topics move between scientific consensus and public contention; update examples and disclaimer protocols.
- Potential workflows: Policy review boards; controversy monitors; scheduled spec refreshes with public transparency.
- Assumptions/dependencies: Governance capacity; community input; mechanisms for resolving contested norms.
These applications collectively enable organizations to diagnose, revise, and govern AI behavior beyond basic compliance, elevating response quality, safety, and transparency. They depend on access to specs and multi-model evaluations, human oversight, and evolving standards that formalize clarity, coverage, and interpretability of AI policies.
Glossary
- 2-approximation greedy algorithm: A heuristic that guarantees the solution is within twice the optimal for certain NP-hard selection problems (e.g., k-center). "the 2-approximation greedy algorithm"
- AI constitutions: Documents that codify high-level principles and rules to govern AI behavior. "AI constitutions"
- Alignment: The process of training models so their behavior adheres to desired ethical principles and user intent. "alignment failures during training"
- Compliance checks: Automated evaluations to determine whether model responses adhere to a given specification or policy. "compliance checks"
- Constitutional AI: An alignment paradigm where models are trained to follow an explicit written set of principles (a “constitution”). "constitutional AI"
- Constitutional classifier: A safety classifier used to enforce constitutional policies during inference or deployment. "the deployment of the constitutional classifier"
- Deliberate alignment: Alignment using structured reasoning or deliberation to improve adherence to principles. "deliberate alignment"
- Disagreement-Weighted Deduplication: A deduplication approach that removes near-duplicate scenarios while prioritizing those with higher inter-model disagreement. "Disagreement-Weighted Deduplication."
- Embeddings: Vector representations of text used to measure semantic similarity and support tasks like clustering and retrieval. "text embeddings"
- Extended thinking: A generation mode that enables longer, step-by-step reasoning to improve scenario quality. "extended thinking for Claude models"
- Facility location: An optimization objective for selecting representative points by minimizing average distance to users or demands; used as a contrast to k-center selection. "rather than k-means or facility location"
- Fleiss' Kappa: A statistic that quantifies inter-rater reliability for categorical ratings among multiple evaluators. "a Fleiss' Kappa of 0.42"
- Frontier LLMs: The most advanced LLMs available from leading providers. "twelve frontier LLMs"
- Gemini embeddings: Embedding vectors produced using the Gemini model family for similarity and matching. "Using Gemini embeddings"
- Inter-rater agreement: The degree to which different evaluators assign the same labels or judgments. "moderate inter-rater agreement"
- Inter-rater reliability: A measure of the consistency of ratings across multiple evaluators. "moderate inter-rater reliability"
- k-center objective: An optimization that selects k points to maximize the minimum distance among selected items, ensuring diverse coverage including outliers. "the k-center objective"
- Majority vote: A decision rule that selects the label agreed upon by most evaluators. "majority vote among the three models"
- Misalignment: Model behavior that deviates from intended ethical or specification-based guidance. "misalignment cases"
- Nearest neighbor classification: Assigning labels by finding the closest vectors in embedding space and mapping them to known categories. "nearest neighbor classification"
- Outlier responses: Model outputs that significantly diverge from the consensus among other models. "outlier responses"
- Post-training: Alignment or fine-tuning stages conducted after initial large-scale pretraining. "post-training"
- Pretraining: The initial stage of training on large corpora to learn general language patterns and knowledge. "pretraining"
- Query-by-committee theory: An active learning framework that measures uncertainty via disagreement among multiple models (a committee). "query-by-committee theory"
- Reasoning-based models: Models optimized for deliberate, step-by-step reasoning to improve quality and adherence. "reasoning-based models"
- Reinforcement learning from human feedback (RLHF): A training method that uses human preference signals to shape model policies via reinforcement learning. "reinforcement learning from human feedback (RLHF)"
- Spectrum rubrics: Graded rubrics defining a range of response strategies from strong favoring to strong opposing, used to classify value alignment. "spectrum rubrics"
- Standard deviation: A dispersion measure used here to quantify disagreement across model value scores. "the standard deviation of value classification scores"
- Taxonomy: A structured classification of values or concepts organized hierarchically for coverage and analysis. "fine-grained taxonomy"
- Topic-based classification: Labeling scenarios by subject areas (e.g., safety, politics) to analyze behavior on sensitive domains. "Topic-Based Classification."
- Value biasing: Creating biased scenario variants that favor one value over another to increase difficulty and diversity. "we apply value biasing"
- Value classification: Scoring responses to indicate how strongly they favor or oppose specified values. "value classification"
- Value prioritization: Patterns showing which values models tend to favor when specifications are ambiguous. "value prioritization patterns"
- Value tradeoff scenarios: Queries constructed to force explicit tradeoffs between legitimate but competing values. "value tradeoff scenarios"
- Weighted k-center objective: A k-center selection where distances are weighted (e.g., by disagreement) to prioritize certain scenarios. "a weighted k-center objective"
Collections
Sign up for free to add this paper to one or more collections.




