Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Consistency Preference Optimization (ScPO)

Updated 15 March 2026
  • Self-Consistency Preference Optimization (ScPO) is a framework that leverages vote-based consistency signals from multiple chain-of-thought samples to guide unsupervised and semi-supervised training.
  • It ranks model-generated solutions by their frequency of consistent answers, enabling preference pair generation without the need for extensive gold annotations.
  • Empirical results on GSM8K, MATH, and ZebraLogic demonstrate that ScPO improves model accuracy significantly over reward-model-based optimization methods with minimal extra complexity.

Self-Consistency Preference Optimization (ScPO) is a self-alignment framework for LLMs that leverages self-consistency at the training stage rather than solely at inference time. The core idea is to prefer solutions that are most consistent across multiple samples for a given reasoning task and to use these internal consistency signals to guide preference-based finetuning in the absence of gold annotation. This approach enables fully unsupervised or semi-supervised training for multi-step reasoning and logic problems, and has demonstrated substantial gains over reward-model-based optimization, closing much of the gap with supervised preference training on tasks such as GSM8K, MATH, and ZebraLogic (Prasad et al., 2024).

1. Motivation and Theoretical Rationale

Self-alignment for reasoning tasks is inherently challenging due to the difficulty for models to provide reliable self-judgment of correctness, especially for multi-step problems. In standard practice, inference-time self-consistency—where multiple chain-of-thought (CoT) samples per prompt are generated, and the most common answer is selected—has empirically boosted accuracy (Prasad et al., 2024). The key hypothesis of ScPO is to transfer this powerful inference-time signal into the training paradigm: use the model's own vote-based consistency per query as a preference indicator between model responses, rather than relying on noisy external reward models or expensive human labels.

Objectives of ScPO include:

  1. Bootstrapping high-quality training data from unlabeled reasoning problems, including synthetic problems generated by the model itself.
  2. Forming preference pairs by ranking model-sampled solutions to a problem according to the frequency of their answer.
  3. Optimizing a preference-based loss weighted by the degree of intra-sample agreement, directly increasing the model’s likelihood of producing self-consistent outputs.
  4. Leveraging gold-labeled (supervised) data only when available, but not as a prerequisite.

2. Formalism and Optimization Framework

Given an unlabeled reasoning question xx, and a current policy MtM_t, ScPO samples kk completions {y1,,yk}\{y_1,\dots,y_k\} using CoT prompting. For each solution yy, the final answer is extracted as ans(y)\mathrm{ans}(y). The self-consistency vote for each yy is defined as the number of completions that produce the same final answer:

V(y)=m=1k1(ans(ym)=ans(y))\mathcal{V}(y) = \sum_{m=1}^k \mathbf{1}(\mathrm{ans}(y_m) = \mathrm{ans}(y))

Preference pairs are constructed by taking y+=argmaxyV(y)y^+ = \arg\max_{y} \mathcal{V}(y) (the most consistent answer) and y=argminyV(y)y^- = \arg\min_{y} \mathcal{V}(y) (the least consistent answer), and only including MtM_t0 as a training instance if MtM_t1, a minimum consistency threshold. Each pair is weighted by

MtM_t2

The ScPO loss builds upon Direct Preference Optimization (DPO) by optimizing:

MtM_t3

where MtM_t4 is the sigmoid, MtM_t5 and MtM_t6 are hyperparameters, and MtM_t7 is the token length of MtM_t8. When supervised labels are available, MtM_t9 is a gold solution and kk0.

The following table summarizes essential notation:

Symbol Description
kk1 Unlabeled reasoning problem
kk2 Model at iteration kk3
kk4 CoT-sampled solutions
kk5 Vote count for answer in sample set
kk6, kk7 Most-consistent, least-consistent solutions
kk8 Preference pair weight (kk9)
{y1,,yk}\{y_1,\dots,y_k\}0 Consistency threshold for filtering

3. Algorithmic Procedure

The canonical ScPO procedure is an iterative self-bootstrapping loop over {y1,,yk}\{y_1,\dots,y_k\}1 iterations:

  1. For each training round, augment the query set by generating new problems using few-shot prompting, discarding queries where {y1,,yk}\{y_1,\dots,y_k\}2.
  2. For each problem, sample {y1,,yk}\{y_1,\dots,y_k\}3 solutions, compute vote counts, and form weighted preference pairs as above.
  3. Aggregate these into a preference dataset for the round.
  4. Train a new model copy {y1,,yk}\{y_1,\dots,y_k\}4 on the ScPO loss using these weighted pairs.
  5. Replace {y1,,yk}\{y_1,\dots,y_k\}5 with {y1,,yk}\{y_1,\dots,y_k\}6 and repeat.

The process is formalized as follows:

yy9 Empirically, two iterations are sufficient for convergence and further iterations yield diminishing returns.

4. Empirical Results and Benchmarks

Experiments span math and logic domains, with principal evaluation on GSM8K (math word problems), MATH (complex math questions), and ZebraLogic (logic grid puzzles) (Prasad et al., 2024). Training uses Llama-3 (8B) as the base, with larger models as comparative baselines. The protocol includes both purely unsupervised training (using only model-generated preference pairs) and a semi-supervised variant (using available labeled data).

Key results are summarized in the tables below:

GSM8K Zero-Shot Exact-Match Accuracy (%)

Method Train data (K) Greedy SC 8-way
Seed {y1,,yk}\{y_1,\dots,y_k\}7 41.17 51.80
IRPO{y1,,yk}\{y_1,\dots,y_k\}8 {y1,,yk}\{y_1,\dots,y_k\}9 seed 4.4 + gen 50.11 61.25
ScPOyy0 yy1 seed 1.4 + gen 5.1 63.91 71.11
IRPOyy2 yy3 seed 5.7 + gen – 64.29 72.56
ScPOyy4 yy5 seed 5.7 + gen 4.5 66.64 74.75

MATH Zero-Shot Exact-Match Accuracy (%)

Method Train data (K) Greedy SC 8-way
Seed yy6 14.46 18.20
IRPOyy7 yy8 seed 6.5 + gen – 18.08 22.64
ScPOyy9 ans(y)\mathrm{ans}(y)0 seed 1.2 + gen 2.5 19.72 24.58
IRPOans(y)\mathrm{ans}(y)1 ans(y)\mathrm{ans}(y)2 seed 3.0 + gen – 20.32 26.88
ScPOans(y)\mathrm{ans}(y)3 ans(y)\mathrm{ans}(y)4 seed 3.0 + gen 2.2 20.48 26.92

ZebraLogic Logic Grid Puzzle—Cell Acc. (%)

Model Train seed K + gen K Puzzle ↑ Cell ↑
Llama-3 70B 17.2 42.9
Gemma-2 27B 16.3 41.2
Claude-3 Haiku 14.3 37.9
ans(y)\mathrm{ans}(y)5 Llama-3 8B 11.6 39.1
IRPOans(y)\mathrm{ans}(y)6 seed 1.0 11.3 42.1
ScPOans(y)\mathrm{ans}(y)7 seed 0.4 + gen 2.2 18.1 45.2

Statistical significance was not explicitly reported, but margins (2–8 pp on GSM8K/MATH; 6 pp on ZebraLogic) are well outside normal random variation.

5. Analysis, Advantages, and Limitations

Quantitative ablations reveal that weighting the loss by the degree of consistency yields 1–2 pp accuracy improvement over unweighted variants. The consistency threshold ans(y)\mathrm{ans}(y)8 governs precision–recall trade-off in preference generation; ans(y)\mathrm{ans}(y)9 yields optimal results (Prasad et al., 2024).

Theoretical insights:

  • No formal convergence guarantees, but empirical evidence shows saturation after two rounds.
  • Consistency (vote share) is strongly correlated with ground truth accuracy (Somers’ D ≈ 0.8 for GSM8K, 0.68 for MATH, 0.92 for ZebraLogic), which justifies self-consistency as a proxy for correctness.
  • ScPO acts as a distillation of the empirical “consistency distribution” into the model’s base prediction distribution, boosting accuracy and pseudo-likelihood of correct outputs.

Key limitations:

  • Requires the seed model to initially exhibit non-trivial self-consistency; on extremely difficult or under-specified tasks, bootstrapping may cover only a minority of samples.
  • Currently designed for single-answer reasoning; adaptation to open-ended or generative tasks remains nontrivial.
  • No formal or theoretical convergence proof; empirically, performance gains saturate after two or three rounds.

6. Practical Recommendations for Implementation

ScPO can be used with any LLM capable of chain-of-thought generation; instruction tuning is helpful but not mandatory. For effective deployment:

  • Use yy0 samples per prompt (16 for broad output spaces) to estimate consistency, with temperature yy1 and top-yy2 for chosen solutions, temperature yy3 to diversify rejected ones.
  • Set yy4 initially, raising to yy5 in later rounds as model consistency improves.
  • Optimize with yy6, yy7; hyperparameter tuning is recommended if validation data are available.
  • Two full ScPO iterations are typically sufficient; a third pass can be applied on held-out queries if data privacy allows.
  • Computational requirements are similar to ordinary preference-based finetuning; ScPO does not increase inference-time complexity and is compatible with concurrent inference-time self-consistency.
  • If any gold solutions are available, include them as labeled preference pairs with yy8 for additional gain.

ScPO enables robust, annotation-free finetuning for multi-step reasoning and logic tasks, successfully translating the inference-time self-consistency signal into a direct and effective training signal (Prasad et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Consistency Preference Optimization (ScPO).