Self-Preference in LLMs

Updated 25 August 2025

Self-preference in LLMs is the model’s inherent ability to assess and rank its outputs using internal reward signals, clarified through probing, DPO, and self-annotation techniques.
Techniques such as RLKF, active preference learning, and self-improvement frameworks (e.g., ScPO, SGPO) leverage internal feedback for enhanced alignment and factual accuracy.
Key challenges include detecting self-bias, resolving contradictory preferences, and ensuring robustness through hybrid evaluation and noise-aware methodologies.

Self-preference in LLMs denotes a model’s ability to generate, recognize, and optimize responses according to its own internal criteria, implicitly or explicitly reflecting either its own knowledge state, internal reward, or emergent alignment signatures. This phenomenon spans mechanisms ranging from intrinsic knowledge state awareness and preference formation to iterative self-improvement via internally generated feedback. Self-preference also encompasses both beneficial forms—such as more robust alignment and improved factuality through self-assessment—and problematic forms, such as bias in LLM-as-judge evaluation settings or the persistence of model-internal limitations. The research literature addresses self-preference through probing, optimization, dataset construction, multi-objective trade-offs, and meta-personalization, inducing a dual focus on both optimization performance and diagnostic methodology.

1. Forms and Detection of Self-Preference

Self-preference manifests as the model’s intrinsic prioritization or ranking of candidate outputs, which can be assessed via a variety of mechanisms:

Knowledge-State Probing and Self-Awareness: Internal linear probes trained on hidden states, MLP outputs, or attention outputs can achieve >85% accuracy in determining whether the model “knows” the correct answer for knowledge-intensive questions. This suggests that LLMs encode their epistemic state in internal representations (Liang et al., 27 Jan 2024).
Implicit Preference Model in DPO: In Direct Preference Optimization (DPO), the reward assigned to a response is entirely determined by the model’s own output distribution, yielding an implicit internal reward function (Muldrew et al., 12 Feb 2024). The preference for one output over another can thus be seen as a form of model self-evaluation.
Self-Annotation and Contradictory Preferences: Self-annotation methods rely on having the model judge or label its own outputs, as in constructing preference graphs and preference orderings (Zhang et al., 13 Jun 2024). However, such self-generated orderings are prone to cycles, indicating the presence of internal contradictions—e.g., the model prefers A over B, B over C, but C over A.

Self-preference is thus empirically detectable via:

Probing internal activations for knowledge or preference signals,
Comparing logit-based or probability-based output preferrability,
Analyzing structure (e.g., acyclicity) of self-annotated preference graphs,
Quantitative metrics of bias when the model acts as its own judge (Wataoka et al., 29 Oct 2024, Chen et al., 4 Apr 2025).

2. Self-Preference for Alignment and Self-Improvement

Recent research exploits self-preference as a signal for alignment and ongoing self-improvement:

RLKF (Reinforcement Learning from Knowledge Feedback): Models use self-awareness—inferred via probing and consistency checks—as reward for reinforcement learning. The RLKF framework labels output factuality using an automated annotation tool (DreamCatcher), which scores outputs based on internal consistency, similarity to ground truth, and self-evaluated knowledge (Liang et al., 27 Jan 2024). Rewards derived from internal self-preference are then used to optimize model behavior, encouraging honest knowledge expression or explicit uncertainty.
Active Preference Learning: By leveraging the model’s uncertainty (predictive entropy) and the certainty of its implicit self-preference (via DPO reward margin), active learning identifies pairs where the model is both highly uncertain about completion and confidently wrong in its preference ranking—targeting correction to scenarios of maximal self-preference misalignment (Muldrew et al., 12 Feb 2024).
Spread Preference Annotation (SPA): Iteratively expands a small human-labeled seed using logit-based self-preference; the LLM generates responses, directly annotates preferences from logits, and performs noise-aware preference learning. This demonstrates that strong alignment can be achieved from minimal external annotation, as the model “spreads” prior intuition through self-judgment, with self-refinement and logit decoupling to mitigate noise (Kim et al., 6 Jun 2024).
Self-Improvement Algorithms (SCPO, SGPO, SIPO): Methods such as Self-Consistency Preference Optimization (ScPO) and Self-Generated Preference Optimization (SGPO) construct training pairs using internally consistent or improved outputs. ScPO relies on sampling and voting (majority or highest-frequency answer chosen as the positive sample), training the model to prefer consistently generated outputs (Prasad et al., 6 Nov 2024). SGPO unifies the policy and “improver” so that preference data is on-policy: the model refines its output incrementally, referencing high-quality supervised responses (Lee et al., 27 Jul 2025).

Key optimization strategies involve:

Iteratively generating synthetic preference data using the current model’s outputs,
Using self-judgment (e.g., logits, reward predictions, or consistency) to select preferences,
Filtering or weighting pairs based on estimated certainty, e.g., through MC dropout or information gain (Wang et al., 17 Sep 2024),
Combining with noise-aware or label-smoothing approaches to dampen erroneous self-preference signals.

3. Bias, Contradiction, and Diagnosis of Self-Preference

LLMs acting as automated evaluators (LLM-as-a-judge) often display a self-preference bias: the tendency to overvalue their own responses relative to those produced by others (Wataoka et al., 29 Oct 2024, Chen et al., 4 Apr 2025).

Self-Preference Bias and Quantification: Bias is quantified using Equal Opportunity or Demographic Parity metrics, capturing the probability difference that the LLM favors its own answer when humans also prefer it, vs. when humans prefer an alternative (Wataoka et al., 29 Oct 2024). Experiments across LLM benchmarks reveal that GPT-4, for example, exhibits a strong self-preference bias in pairwise evaluation scenarios. Some models demonstrate negative bias, undervaluing their own outputs.
Perplexity as a Causal Factor: The analytical origin of this bias is linked to log perplexity of outputs—LLMs favor texts (regardless of authorship) that are more predictable under their own distributions. Lower perplexity correlates strongly with higher evaluated quality per the LLM, even if human labeling disagrees.
Legitimate vs. Harmful Self-Preference: Recent work differentiates between legitimate self-preference (the model prefers its own output and its answer is objectively correct) and harmful self-preference (preferring its own answer even when it is objectively wrong) (Chen et al., 4 Apr 2025). Legitimate preference dominates in strong models, but harmful self-preference persists—especially when a strong model is confidently wrong on difficult examples. Mitigation via chain-of-thought scaling at evaluation time is effective.

For diagnosis and mitigation:

Ensemble or hybrid judge models reduce individual self-preference artifacts,
Re-weighting judgments by perplexity helps offset overvaluation of familiar outputs,
Ensuring evaluation protocols control for order bias, and objective correctness mitigates spurious self-alignment (Chen et al., 4 Apr 2025).

4. Internal Consistency and Self-Preference Optimization

Internal Reward Model Consistency: Self-rewarding systems depend on the model’s ability to provide stable, consistent internal signals. However, different instantiations (e.g., generative reward models vs. implicit DPO reward models) within the same LLM can be highly inconsistent, with disagreement on up to 50% of preference pairs (Zhou et al., 13 Feb 2025). This undermines the reliability of self-generated preference data and, consequently, alignment progress.
Self-Consistent Internal Rewards (SCIR): SCIR addresses this by enforcing consistency across multiple internal reward models using symmetric KL-divergence and entropy regularization. Only preference pairs for which all rewards agree are used for preference optimization, increasing the reliability of signals and alignment performance (Zhou et al., 13 Feb 2025).
Contradictions in Preference Graphs: Self-annotated preference graphs constructed from pairwise comparisons often contain cycles. The ContraSolver algorithm identifies and removes contradictory (cycle-creating) edges by prioritizing high-confidence, non-contradictory pairs, insuring the final preference graph is acyclic and globally consistent. This produces measurable improvements on harm-free, instruction-following, sentiment, and summarization tasks (Zhang et al., 13 Jun 2024).
Multi-Objective Alignment and Pareto Self-Preference: In multi-objective alignment (e.g., where safety and helpfulness may compete), self-preference can create optimization conflicts. Self-improving DPO (SIPO) frameworks autonomously generate Pareto-optimal responses via self-refinement and filtering, ensuring that chosen outputs perform well across all objectives. This substantially improves the empirical Pareto front in multi-objective alignment settings (Li et al., 20 Feb 2025).

5. Methodological and Practical Implications

Preference Degree Awareness: Traditional DPO methods treat preference as binary. Self-supervised Preference Optimization (SPO) enriches the preference signal by creating variants with graded loss of key content and tasks the model with recognizing “degrees” of preference. This allows LLMs to internalize a more nuanced preference landscape, which empirically improves win rates and robustness on summarization and dialogue benchmarks (Li et al., 26 Sep 2024).
Self-Preference as a Mechanism in Personalization: In personalized and pluralistic preference alignment, LLMs can model user-level self-preference by optimizing, at training or inference time, with respect to per-user latent reward functions. Techniques include user-specific parameters, steerable conditional models, prompt-based adaptation, and value-guided decoding (Xie et al., 9 Apr 2025). Public benchmarks such as PersonalLLM provide multi-dimensional scoring, user embedding, and retrieval-based meta-learning platforms for explicit user-level self-preference adaptation (Zollo et al., 30 Sep 2024).
Reward Stability and Data Quality in Non-Text Modalities: In video-LLMs, self-preference methods such as Lean Preference Optimization (LeanPO) address reward instability (likelihood displacement) by reformulating the reward as the average policy likelihood and using a self-reflective, trustworthiness-injected pipeline for generating preference data. Combining dynamic label smoothing with log-likelihood differences mitigates overfitting to noisy data and improves preference alignment in multimodal settings (Wang et al., 5 Jun 2025).
Scalability and Minimization of Human Supervision: Across frameworks, the consistent finding is that robust alignment can be achieved by bootstrapping a small amount of high-quality human-labeled data, then amplifying and refining the model’s alignment using its own self-preferential signals—provided that noise, inconsistency, and bias are controlled with appropriate refinement, decoupling, or consistency mechanisms (Kim et al., 6 Jun 2024, Lee et al., 27 Jul 2025).
Self-Preference Bias as a Diagnostic and an Optimization Target: The presence of self-preference bias, legitimate or harmful, can serve as a diagnostic for both overfitting to internal heuristics and as a target for harnessing beneficial self-assessment in iterative self-improvement. Disentangling the two is critical for both safety and performance (Wataoka et al., 29 Oct 2024, Chen et al., 4 Apr 2025).

6. Limitations and Future Directions

While substantial progress has been made in leveraging self-preference for alignment and self-improvement, open problems remain:

Further reduction of harmful self-preference bias, especially as LLMs are used as automated evaluators or “referees” of other models,
Mitigating the tendency for self-preference to amplify style, distributional, or regional biases—particularly in media and political content generation (Kennedy et al., 20 Mar 2025),
Extending robustness of self-preference mechanisms to unseen domains or in settings with sparse human feedback,
Scaling self-improvement loops (SGPO/ScPO) without reintroducing instability or amplifying low-level failure modes,
Developing even more modular and compositional user modeling frameworks, with control at both parameter and decoding levels, to enable flexible, real-time adaptation to heterogeneous or evolving user preferences (Xie et al., 9 Apr 2025).

Self-preference as both an introspective optimization signal and a potential source of bias remains a central concern for developing future LLMs that are robustly aligned, explainable, and trustworthy across a broad array of domains and personalization requirements.