Self-Consistency Preference Optimization (2411.04109v2)

Published 6 Nov 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Self-Consistency Preference Optimization, a method that leverages internal consistency to rank responses without relying on human-annotated data.
ScPO employs an iterative training process with a weighted loss function that guides models to improve performance on benchmarks like GSM8K, MATH, and ZebraLogic.
The approach reduces dependency on costly supervision, enabling smaller models to achieve competitive results in complex reasoning tasks.

A Critical Examination of Self-Consistency Preference Optimization

The paper under discussion explores the development and assessment of a novel technique termed Self-consistency Preference Optimization (ScPO). This methodology aims to enhance the unsupervised training of LLMs for solving complex reasoning tasks. Unlike traditional methods which rely on human-annotated data or reward models, ScPO capitalizes on self-consistency—commonly used in inference to choose the most frequent answer among multiple samples for a given query—introducing it during the training process to improve model alignment without labeled data.

Overview of Methodology

ScPO is predicated on three main steps:

Data Generation and Query Selection: The initial phase involves the model autonomously generating new problem queries from an unlabeled dataset. The selection is filtered using self-consistency, retaining only those queries where a significant proportion of sampled responses converge on a single answer.
Preference Pair Construction: ScPO constructs preference pairs from the responses, ranking them based on consistency. The most consistent response to a query is labeled as preferred, while less consistent answers are negatively marked. This ranking system circumvents using predefined labels by employing a self-consistency-derived metric.
Iterative Training with a Weighted Loss Function: The model is iteratively refined using a weighted preference optimization loss. This weighted loss augments the learning process by assigning higher weights to instances displaying clear, consistent preferences, thus guiding the model towards more reliable outputs.

Experimental Analysis

The paper provides comprehensive experimental evaluations across several reasoning datasets, including GSM8K, MATH, and ZebraLogic. Notably, ScPO achieved significant performance gains over models trained with established unsupervised methods like IRPO with reward models.

GSM8K and MATH: ScPO was nearly on par with supervised models trained using gold answers. Through iterative refinement, it showed substantial improvements in zero-shot accuracy, closely aligning with metrics achieved by models using human-annotated training data.
ZebraLogic: ScPO notably outperformed larger models like Llama-3 70B and Gemma-2 27B on logical reasoning tasks. This illustrates its capability to enable smaller models to compete effectively against more computationally intensive counterparts.

The implications of ScPO are manifold: It potentially reduces dependencies on costly human annotations, enhances model consistency, and scales efficiently across various reasoning challenges.

Implications and Future Directions

This paper represents a meaningful advance in the self-training of LLMs, particularly by leveraging consistency as a training paradigm. Such approaches may democratize access to robust AI models by making it feasible to train on less-curated data compared to gold-standard requirements.

The theoretical underpinnings of ScPO raise intriguing questions about model introspection and the inherent notion of 'confidence' derived from self-consistency metrics. While promising, further exploration is needed to assess the scalability of ScPO across broader AI tasks, particularly those outside the scope of reasoning, such as natural language understanding or generation tasks. Future research might explore integrating ScPO with hybrid learning models that consolidate both consistency and reward signals, potentially optimizing learning trajectories more effectively.

In essence, ScPO as presented in this work, marks a strategic stride toward enhancing the self-sufficiency of LLMs, particularly in solving complex problems autonomously. It also underscores the significance of harnessing internal model metrics for robust learning, paving the way for future innovations in the field of unsupervised and semi-supervised model training paradigms.