Configurable Preference Tuning (CPT)
- Configurable Preference Tuning (CPT) is a framework that fine-tunes LLMs to switch behavioral profiles (e.g., style, safety) on demand via configuration tokens.
- It uses synthetic preference data generated from structured rubrics and toggles to condition models, improving control granularity with enhanced accuracy and reliable safety toggling.
- CPT employs conditional DPO losses and LoRA adapters to ensure models maintain general performance while dynamically modulating outputs under various system prompts.
Configurable Preference Tuning (CPT) refers to a class of methods for fine-tuning LLMs such that their behavioral preferences—including style, safety, persona, and other axes—can be modulated on-the-fly via explicit inputs such as system prompts or configuration tokens. CPT generalizes static preference alignment, which hard-codes a single behavioral regime, by enabling dynamic, interpretable control at inference without retraining. This paradigm leverages synthetic preference data, often guided by structured rubrics or pre-defined toggles, and employs conditional preference objectives during training. The approach has been instantiated with both rubric-based style modulation (Gallego, 13 Jun 2025) and safety toggling (Gallego, 2024), demonstrating robust, granular control across alignment axes and maintaining performance on general tasks.
1. Conceptual Foundations and Distinctions
CPT departs from classical RLHF and Direct Preference Optimization (DPO) by allowing learned preferences to be conditional on explicit configuration inputs. Standard models learn for a prompt and output pair , locking in a static notion of "better." CPT modifies this to , where is a human-readable system prompt or configuration token specifying a desired behavioral regime. This input can encode binary toggles (e.g., uncensored vs. harmless), ordinal style choices, or multi-attribute rubrics (Gallego, 13 Jun 2025, Gallego, 2024).
Unlike multi-task fine-tuning, CPT conditions a single model on diverse, synthetic preference pair datasets labeled by configuration states. The model architecture is unchanged except for the addition of configuration conditioning—typically via prompt tokens or explicit natural language instructions.
2. Synthetic Data Generation via Rubrics and Configuration Prompts
CPT requires a data generation pipeline that produces preference-paired examples along targeted behavioral axes. A rubric is defined as a set of criteria (e.g., "Code Poetry," "Photographic Invocation"), each with multiple proficiency levels ("low," "moderate," "high") and optional weights. For each user prompt and rubric , teacher LLMs generate system prompts summarizing desired behaviors at the selected score level.
Preference data construction proceeds as follows:
- For each pair, summarize rubric objectives into a system prompt .
- Use a teacher LLM to produce outputs and for with and (distinct rubric levels).
- Construct tuples and , labeling the preferred output according to configuration.
- Aggregate into a dataset .
In safety-oriented scenarios (CST), specific system prompts (e.g., : "uncensored", : "harmless") are used to synthesize both raw (uncensored) and critiqued (safe) rewrites for each user prompt, flipping preference ordering to enable bidirectional behavioral control (Gallego, 2024).
3. Model Training Methodology and Conditional Objectives
CPT typically employs a conditional DPO-style loss, fine-tuning the student LLM to prefer over under system prompt . The objective for each training tuple is:
where is the logit-score, are trainable weights, are pretrained weights, and is an regularization parameter.
For CST, the DPO loss is augmented over an expanded dataset containing every preference pair under both configuration tokens (safe and uncensored) and their flipped orderings. The reference policy serves as a stability anchor; no additional regularization is needed beyond the implicit KL penalty (Gallego, 2024).
LoRA adapters are injected into transformer projection matrices to facilitate efficient fine-tuning across diverse base models (e.g., Rocinante-12B, Mistral-Nemo-12B).
4. On-the-Fly Preference Modulation at Inference
After CPT fine-tuning, the model's preferences are controlled by prefixing the desired system prompt to the user input . This mechanism supports continuous, interpretable modulation of outputs along rubric axes (e.g., pushing for more unconventional style or dialling up/down safety) without requiring further training.
Example:
- : "Write in a clear, concise, and completely conventional style…" yields conventional output.
- : "Generate a text that is fragmented, illogical…" yields highly unconventional output.
For safety, toggling between and immediately switches the model between uncensored and harmless response regimes (Gallego, 2024).
5. Empirical Evaluation and Results
Benchmarks confirm that CPT yields fine-grained, robust control over desired axes:
- Rubric-based CPT improves bin match accuracy, Kendall's , and Spearman's compared to untuned baselines, e.g., Rocinante-12B accuracy increasing from 0.60 (baseline) to 0.76 (CPT) (Gallego, 13 Jun 2025).
- Best-of-N sampling requires fewer draws for CPT models to reach high-quality outputs, consistent with distributional shift towards the target preference.
- In safety tuning, CST achieves binary success rates of (harmless) and (uncensored), recovering both behaviors simultaneously, which vanilla DPO fails to do (e.g., drops to 0.12) (Gallego, 2024).
- Multi-task prompt sets (e.g., honest vs. role-played outputs) are handled with pairwise accuracy near 1.00 across all axes.
General domain performance (e.g., ARC, HellaSwag, MMLU, TQA) is preserved or slightly improved, showing CPT does not degrade unaligned capabilities.
6. Limitations, Interpretations, and Extension Pathways
Limitations include:
- Synthetic data quality depends on teacher LLM critique and generative ability; teacher bias may propagate.
- Rubric construction and scoring is labor-intensive and subject to granularity trade-offs.
- Coarse configuration prompts (e.g., binary safety toggles) could benefit from richer, hierarchical, or compositional schema.
Extensions highlighted include:
- Hierarchical or multi-label configuration schemes, where encodes multiple attributes (e.g., "safe:toxicity; tone:humorous").
- Automatic rubric discovery or clustering from human feedback.
- Continuous or soft configuration selectors driven by small gating networks.
- Online data generation and active learning to refine preference control (Gallego, 2024, Gallego, 13 Jun 2025).
A plausible implication is that CPT enables practical, scalable personalization and alignment in LLMs, facilitating deployment in heterogeneous environments with varying requirements for style, safety, and task-specific behavior, without repeated retraining or static preference lock-in.
7. Released Artifacts and Reproducibility
Key reproducibility assets include:
- Synthetic data generation scripts and datasets at huggingface.co/datasets/vicgalle/creative-rubrics-preferences.
- Source code for CPT and CST pipelines at github.com/vicgalle/configurable-preference-tuning and configurable-safety-tuning.
- LoRA-conditioned checkpoints for multiple base models.
- Evaluation scripts leveraging state-of-the-art LLM judges (Claude 3.5 Sonnet).
Stepwise reproduction:
- Install dependencies and generate synthetic preference data.
- Fine-tune base models with LoRA adapters using DPO-style objective over the assembled dataset.
- Evaluate model preference control and output quality using rubric-aware LLM judges.
This ecosystem supports robust replication and extension across new axes and model architectures (Gallego, 13 Jun 2025, Gallego, 2024).