Self-Consistency Preference Optimization (ScPO)
- Self-Consistency Preference Optimization (ScPO) is a framework that leverages vote-based consistency signals from multiple chain-of-thought samples to guide unsupervised and semi-supervised training.
- It ranks model-generated solutions by their frequency of consistent answers, enabling preference pair generation without the need for extensive gold annotations.
- Empirical results on GSM8K, MATH, and ZebraLogic demonstrate that ScPO improves model accuracy significantly over reward-model-based optimization methods with minimal extra complexity.
Self-Consistency Preference Optimization (ScPO) is a self-alignment framework for LLMs that leverages self-consistency at the training stage rather than solely at inference time. The core idea is to prefer solutions that are most consistent across multiple samples for a given reasoning task and to use these internal consistency signals to guide preference-based finetuning in the absence of gold annotation. This approach enables fully unsupervised or semi-supervised training for multi-step reasoning and logic problems, and has demonstrated substantial gains over reward-model-based optimization, closing much of the gap with supervised preference training on tasks such as GSM8K, MATH, and ZebraLogic (Prasad et al., 2024).
1. Motivation and Theoretical Rationale
Self-alignment for reasoning tasks is inherently challenging due to the difficulty for models to provide reliable self-judgment of correctness, especially for multi-step problems. In standard practice, inference-time self-consistency—where multiple chain-of-thought (CoT) samples per prompt are generated, and the most common answer is selected—has empirically boosted accuracy (Prasad et al., 2024). The key hypothesis of ScPO is to transfer this powerful inference-time signal into the training paradigm: use the model's own vote-based consistency per query as a preference indicator between model responses, rather than relying on noisy external reward models or expensive human labels.
Objectives of ScPO include:
- Bootstrapping high-quality training data from unlabeled reasoning problems, including synthetic problems generated by the model itself.
- Forming preference pairs by ranking model-sampled solutions to a problem according to the frequency of their answer.
- Optimizing a preference-based loss weighted by the degree of intra-sample agreement, directly increasing the model’s likelihood of producing self-consistent outputs.
- Leveraging gold-labeled (supervised) data only when available, but not as a prerequisite.
2. Formalism and Optimization Framework
Given an unlabeled reasoning question , and a current policy , ScPO samples completions using CoT prompting. For each solution , the final answer is extracted as . The self-consistency vote for each is defined as the number of completions that produce the same final answer:
Preference pairs are constructed by taking (the most consistent answer) and (the least consistent answer), and only including 0 as a training instance if 1, a minimum consistency threshold. Each pair is weighted by
2
The ScPO loss builds upon Direct Preference Optimization (DPO) by optimizing:
3
where 4 is the sigmoid, 5 and 6 are hyperparameters, and 7 is the token length of 8. When supervised labels are available, 9 is a gold solution and 0.
The following table summarizes essential notation:
| Symbol | Description |
|---|---|
| 1 | Unlabeled reasoning problem |
| 2 | Model at iteration 3 |
| 4 | CoT-sampled solutions |
| 5 | Vote count for answer in sample set |
| 6, 7 | Most-consistent, least-consistent solutions |
| 8 | Preference pair weight (9) |
| 0 | Consistency threshold for filtering |
3. Algorithmic Procedure
The canonical ScPO procedure is an iterative self-bootstrapping loop over 1 iterations:
- For each training round, augment the query set by generating new problems using few-shot prompting, discarding queries where 2.
- For each problem, sample 3 solutions, compute vote counts, and form weighted preference pairs as above.
- Aggregate these into a preference dataset for the round.
- Train a new model copy 4 on the ScPO loss using these weighted pairs.
- Replace 5 with 6 and repeat.
The process is formalized as follows:
9 Empirically, two iterations are sufficient for convergence and further iterations yield diminishing returns.
4. Empirical Results and Benchmarks
Experiments span math and logic domains, with principal evaluation on GSM8K (math word problems), MATH (complex math questions), and ZebraLogic (logic grid puzzles) (Prasad et al., 2024). Training uses Llama-3 (8B) as the base, with larger models as comparative baselines. The protocol includes both purely unsupervised training (using only model-generated preference pairs) and a semi-supervised variant (using available labeled data).
Key results are summarized in the tables below:
GSM8K Zero-Shot Exact-Match Accuracy (%)
| Method | Train data (K) | Greedy | SC 8-way |
|---|---|---|---|
| Seed 7 | – | 41.17 | 51.80 |
| IRPO8 9 | seed 4.4 + gen – | 50.11 | 61.25 |
| ScPO0 1 | seed 1.4 + gen 5.1 | 63.91 | 71.11 |
| IRPO2 3 | seed 5.7 + gen – | 64.29 | 72.56 |
| ScPO4 5 | seed 5.7 + gen 4.5 | 66.64 | 74.75 |
MATH Zero-Shot Exact-Match Accuracy (%)
| Method | Train data (K) | Greedy | SC 8-way |
|---|---|---|---|
| Seed 6 | – | 14.46 | 18.20 |
| IRPO7 8 | seed 6.5 + gen – | 18.08 | 22.64 |
| ScPO9 0 | seed 1.2 + gen 2.5 | 19.72 | 24.58 |
| IRPO1 2 | seed 3.0 + gen – | 20.32 | 26.88 |
| ScPO3 4 | seed 3.0 + gen 2.2 | 20.48 | 26.92 |
ZebraLogic Logic Grid Puzzle—Cell Acc. (%)
| Model | Train seed K + gen K | Puzzle ↑ | Cell ↑ |
|---|---|---|---|
| Llama-3 70B | – | 17.2 | 42.9 |
| Gemma-2 27B | – | 16.3 | 41.2 |
| Claude-3 Haiku | – | 14.3 | 37.9 |
| 5 Llama-3 8B | – | 11.6 | 39.1 |
| IRPO6 | seed 1.0 | 11.3 | 42.1 |
| ScPO7 | seed 0.4 + gen 2.2 | 18.1 | 45.2 |
Statistical significance was not explicitly reported, but margins (2–8 pp on GSM8K/MATH; 6 pp on ZebraLogic) are well outside normal random variation.
5. Analysis, Advantages, and Limitations
Quantitative ablations reveal that weighting the loss by the degree of consistency yields 1–2 pp accuracy improvement over unweighted variants. The consistency threshold 8 governs precision–recall trade-off in preference generation; 9 yields optimal results (Prasad et al., 2024).
Theoretical insights:
- No formal convergence guarantees, but empirical evidence shows saturation after two rounds.
- Consistency (vote share) is strongly correlated with ground truth accuracy (Somers’ D ≈ 0.8 for GSM8K, 0.68 for MATH, 0.92 for ZebraLogic), which justifies self-consistency as a proxy for correctness.
- ScPO acts as a distillation of the empirical “consistency distribution” into the model’s base prediction distribution, boosting accuracy and pseudo-likelihood of correct outputs.
Key limitations:
- Requires the seed model to initially exhibit non-trivial self-consistency; on extremely difficult or under-specified tasks, bootstrapping may cover only a minority of samples.
- Currently designed for single-answer reasoning; adaptation to open-ended or generative tasks remains nontrivial.
- No formal or theoretical convergence proof; empirically, performance gains saturate after two or three rounds.
6. Practical Recommendations for Implementation
ScPO can be used with any LLM capable of chain-of-thought generation; instruction tuning is helpful but not mandatory. For effective deployment:
- Use 0 samples per prompt (16 for broad output spaces) to estimate consistency, with temperature 1 and top-2 for chosen solutions, temperature 3 to diversify rejected ones.
- Set 4 initially, raising to 5 in later rounds as model consistency improves.
- Optimize with 6, 7; hyperparameter tuning is recommended if validation data are available.
- Two full ScPO iterations are typically sufficient; a third pass can be applied on held-out queries if data privacy allows.
- Computational requirements are similar to ordinary preference-based finetuning; ScPO does not increase inference-time complexity and is compatible with concurrent inference-time self-consistency.
- If any gold solutions are available, include them as labeled preference pairs with 8 for additional gain.
ScPO enables robust, annotation-free finetuning for multi-step reasoning and logic tasks, successfully translating the inference-time self-consistency signal into a direct and effective training signal (Prasad et al., 2024).