Papers
Topics
Authors
Recent
2000 character limit reached

Configurable Preference Tuning (CPT)

Updated 2 January 2026
  • Configurable Preference Tuning (CPT) is a framework that fine-tunes LLMs to switch behavioral profiles (e.g., style, safety) on demand via configuration tokens.
  • It uses synthetic preference data generated from structured rubrics and toggles to condition models, improving control granularity with enhanced accuracy and reliable safety toggling.
  • CPT employs conditional DPO losses and LoRA adapters to ensure models maintain general performance while dynamically modulating outputs under various system prompts.

Configurable Preference Tuning (CPT) refers to a class of methods for fine-tuning LLMs such that their behavioral preferences—including style, safety, persona, and other axes—can be modulated on-the-fly via explicit inputs such as system prompts or configuration tokens. CPT generalizes static preference alignment, which hard-codes a single behavioral regime, by enabling dynamic, interpretable control at inference without retraining. This paradigm leverages synthetic preference data, often guided by structured rubrics or pre-defined toggles, and employs conditional preference objectives during training. The approach has been instantiated with both rubric-based style modulation (Gallego, 13 Jun 2025) and safety toggling (Gallego, 2024), demonstrating robust, granular control across alignment axes and maintaining performance on general tasks.

1. Conceptual Foundations and Distinctions

CPT departs from classical RLHF and Direct Preference Optimization (DPO) by allowing learned preferences to be conditional on explicit configuration inputs. Standard models learn p(y1y0x)p(y_1 \succ y_0 | x) for a prompt xx and output pair (y1,y0)(y_1, y_0), locking in a static notion of "better." CPT modifies this to p(y1y0x,s)p(y_1 \succ y_0 | x, s), where ss is a human-readable system prompt or configuration token specifying a desired behavioral regime. This input can encode binary toggles (e.g., uncensored vs. harmless), ordinal style choices, or multi-attribute rubrics (Gallego, 13 Jun 2025, Gallego, 2024).

Unlike multi-task fine-tuning, CPT conditions a single model on diverse, synthetic preference pair datasets labeled by configuration states. The model architecture is unchanged except for the addition of configuration conditioning—typically via prompt tokens or explicit natural language instructions.

2. Synthetic Data Generation via Rubrics and Configuration Prompts

CPT requires a data generation pipeline that produces preference-paired examples along targeted behavioral axes. A rubric RR is defined as a set of criteria (e.g., "Code Poetry," "Photographic Invocation"), each with multiple proficiency levels ("low," "moderate," "high") and optional weights. For each user prompt xx and rubric RR, teacher LLMs generate system prompts summarizing desired behaviors at the selected score level.

Preference data construction proceeds as follows:

  • For each (R,score)(R, score) pair, summarize rubric objectives into a system prompt ss.
  • Use a teacher LLM to produce outputs y1y_1 and y2y_2 for xx with s1s_1 and s2s_2 (distinct rubric levels).
  • Construct tuples (s1,x,y1,y2)(s_1, x, y_1, y_2) and (s2,x,y2,y1)(s_2, x, y_2, y_1), labeling the preferred output according to configuration.
  • Aggregate into a dataset DD.

In safety-oriented scenarios (CST), specific system prompts (e.g., s0s_0: "uncensored", s1s_1: "harmless") are used to synthesize both raw (uncensored) and critiqued (safe) rewrites for each user prompt, flipping preference ordering to enable bidirectional behavioral control (Gallego, 2024).

3. Model Training Methodology and Conditional Objectives

CPT typically employs a conditional DPO-style loss, fine-tuning the student LLM to prefer y+y_+ over yy_- under system prompt ss. The objective for each training tuple (s,x,y+,y)(s, x, y_+, y_-) is:

L(θ)=E(s,x,y+,y)D[logσ(sθ(x,y+)sθ(x,y))]+λθθ02L(\theta) = - \mathbb{E}_{(s,x,y_+,y_-)\sim D} \left[ \log \sigma(s_\theta(x,y_+) - s_\theta(x,y_-)) \right] + \lambda \| \theta - \theta_0 \|^2

where sθ(x,y)s_\theta(x, y) is the logit-score, θ\theta are trainable weights, θ0\theta_0 are pretrained weights, and λ\lambda is an L2L_2 regularization parameter.

For CST, the DPO loss is augmented over an expanded dataset DCST\mathcal{D}_\text{CST} containing every preference pair under both configuration tokens (safe and uncensored) and their flipped orderings. The reference policy πref\pi_\text{ref} serves as a stability anchor; no additional regularization is needed beyond the implicit KL penalty (Gallego, 2024).

LoRA adapters are injected into transformer projection matrices to facilitate efficient fine-tuning across diverse base models (e.g., Rocinante-12B, Mistral-Nemo-12B).

4. On-the-Fly Preference Modulation at Inference

After CPT fine-tuning, the model's preferences are controlled by prefixing the desired system prompt ss to the user input xx. This mechanism supports continuous, interpretable modulation of outputs along rubric axes (e.g., pushing for more unconventional style or dialling up/down safety) without requiring further training.

Example:

  • sconventionals_\text{conventional}: "Write in a clear, concise, and completely conventional style…" yields conventional output.
  • sabsurdists_\text{absurdist}: "Generate a text that is fragmented, illogical…" yields highly unconventional output.

For safety, toggling between s0s_0 and s1s_1 immediately switches the model between uncensored and harmless response regimes (Gallego, 2024).

5. Empirical Evaluation and Results

Benchmarks confirm that CPT yields fine-grained, robust control over desired axes:

  • Rubric-based CPT improves bin match accuracy, Kendall's τ\tau, and Spearman's ρ\rho compared to untuned baselines, e.g., Rocinante-12B accuracy increasing from 0.60 (baseline) to 0.76 (CPT) (Gallego, 13 Jun 2025).
  • Best-of-N sampling requires fewer draws for CPT models to reach high-quality outputs, consistent with distributional shift towards the target preference.
  • In safety tuning, CST achieves binary success rates of S1=1.00S_1 = 1.00 (harmless) and S0=1.00S_0 = 1.00 (uncensored), recovering both behaviors simultaneously, which vanilla DPO fails to do (e.g., S0S_0 drops to 0.12) (Gallego, 2024).
  • Multi-task prompt sets (e.g., honest vs. role-played outputs) are handled with pairwise accuracy near 1.00 across all axes.

General domain performance (e.g., ARC, HellaSwag, MMLU, TQA) is preserved or slightly improved, showing CPT does not degrade unaligned capabilities.

6. Limitations, Interpretations, and Extension Pathways

Limitations include:

  • Synthetic data quality depends on teacher LLM critique and generative ability; teacher bias may propagate.
  • Rubric construction and scoring is labor-intensive and subject to granularity trade-offs.
  • Coarse configuration prompts (e.g., binary safety toggles) could benefit from richer, hierarchical, or compositional schema.

Extensions highlighted include:

  • Hierarchical or multi-label configuration schemes, where ss encodes multiple attributes (e.g., "safe:toxicity; tone:humorous").
  • Automatic rubric discovery or clustering from human feedback.
  • Continuous or soft configuration selectors driven by small gating networks.
  • Online data generation and active learning to refine preference control (Gallego, 2024, Gallego, 13 Jun 2025).

A plausible implication is that CPT enables practical, scalable personalization and alignment in LLMs, facilitating deployment in heterogeneous environments with varying requirements for style, safety, and task-specific behavior, without repeated retraining or static preference lock-in.

7. Released Artifacts and Reproducibility

Key reproducibility assets include:

  • Synthetic data generation scripts and datasets at huggingface.co/datasets/vicgalle/creative-rubrics-preferences.
  • Source code for CPT and CST pipelines at github.com/vicgalle/configurable-preference-tuning and configurable-safety-tuning.
  • LoRA-conditioned checkpoints for multiple base models.
  • Evaluation scripts leveraging state-of-the-art LLM judges (Claude 3.5 Sonnet).

Stepwise reproduction:

  1. Install dependencies and generate synthetic preference data.
  2. Fine-tune base models with LoRA adapters using DPO-style objective over the assembled dataset.
  3. Evaluate model preference control and output quality using rubric-aware LLM judges.

This ecosystem supports robust replication and extension across new axes and model architectures (Gallego, 13 Jun 2025, Gallego, 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Configurable Preference Tuning (CPT).