Iterative Reasoning Preference Optimization (2404.19733v3)

Published 30 Apr 2024 in cs.CL and cs.AI

Abstract: Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.

PDF HTML Abstract

Iterative Reasoning Preference Optimization: An Overview

The paper "Iterative Reasoning Preference Optimization" by Pang et al. presents an approach to enhance the reasoning capabilities of LLMs by iteratively optimizing preferences between generated Chain-of-Thought (CoT) candidates. This method is aimed at overcoming the limitations of current iterative preference optimization techniques on reasoning tasks. The primary contribution lies in introducing a negative log-likelihood (NLL) term in conjunction with Direct Preference Optimization (DPO) loss, which has shown to be crucial for performance improvements.

Key Concepts and Methodology

The paper builds on the premise that while preference optimization has proven beneficial for general instruction tuning tasks, its effectiveness in reasoning tasks remains modest. The authors propose an iterative approach derived from the idea of optimizing preferences between successful and unsuccessful reasoning steps leading to the correct answers. The process can be summarized in the following core steps:

Initialization: Begin with a base LLM that is typically pre-trained or instruction-tuned.
Sampling and Preference Pair Construction: For each training input, generate multiple CoT reasoning steps and final answers using the current model. Construct preference pairs where the winning responses have correct final answers, while the losing ones have incorrect answers.
Training with DPO+NLL: Train a new model iteration using a modified DPO loss that incorporates an additional NLL term for the winners. This combination proves essential in enhancing the reasoning performance iteratively.
Iteration: Using the newly trained model, repeat the process of generating new data and retraining, allowing performance to improve progressively.

Empirical Results

The efficacy of the proposed method, termed Iterative Reasoning Preference Optimization (Iterative RPO), is demonstrated on three distinct reasoning tasks: GSM8K, MATH, and ARC-Challenge. The improvements across iterations are significant:

GSM8K: The accuracy improved from 55.6% in a zero-shot setting to 81.6% after four iterations of Iterative RPO. Employing majority voting with 32 samples further increased accuracy to 88.7%.
MATH: From an initial accuracy of 12.5% (4-shot) to 20.8% after three iterations.
ARC-Challenge: Enhanced from 77.8% to 86.7% over three iterations, with majority voting yielding 87.9%.

The results are particularly compelling as Iterative RPO consistently outperforms several baselines, including zero-shot CoT, standard DPO, and Self-Taught Reasoning (STaR). The addition of the NLL loss in the DPO objective significantly aids in managing both chosen and rejected sequences effectively, a phenomenon further evidenced by the underlying log probability dynamics during training.

Implications and Future Directions

The implications of this research are multifold:

Practical Impact: The proposed Iterative RPO method offers a straightforward recipe to enhance reasoning in LLMs without requiring a human-in-the-loop or additional data.
Theoretical Insights: The introduction of the NLL term alongside DPO offers a novel adjustment to preference optimization methods, revealing its necessity for reasoning tasks.

Future developments may delve into expanding this method to more diverse datasets and exploring its applicability to other complex domains. Additionally, further research could optimize the iterative process, potentially integrating more sophisticated pairing mechanisms or additional fine-tuning stages to push the boundaries of reasoning capabilities in LLMs.

In summary, the Iterative Reasoning Preference Optimization method presents a substantial advancement in improving reasoning tasks in LLMs, marking an essential step toward more robust and accurate AI systems. The proposed approach not only promises practical enhancements but also opens new avenues for refining iterative learning methodologies in the field of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Richard Yuanzhe Pang (26 papers)
Weizhe Yuan (25 papers)
Kyunghyun Cho (292 papers)
He He (71 papers)
Sainbayar Sukhbaatar (53 papers)
Jason Weston (130 papers)

Citations (52)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1785489252299485188

https://twitter.com/jaseweston/status/1785472971781382402

https://twitter.com/iScienceLuvr/status/1785480853541597668

https://twitter.com/_akhaliq/status/1785493645153423723

https://twitter.com/jaseweston/status/1806089990926946523

https://twitter.com/fly51fly/status/1785528037955785197