Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs (2406.09136v2)

Published 13 Jun 2024 in cs.CL and cs.LG

Abstract: The recent development of chain-of-thought (CoT) decoding has enabled LLMs to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at https://github.com/sail-sg/CPO.

PDF HTML Abstract

An Examination of Chain of Preference Optimization for Chain-of-Thought Reasoning in LLMs

The paper "Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs" presents a refined approach, Chain of Preference Optimization (CPO), for improving the reasoning performance of LLMs. Traditional chain-of-thought (CoT) methods have demonstrated the ability to enhance problem-solving by constructing linear reasoning paths. However, CoT's single-path structure can lead to suboptimal reasoning outcomes. The tree-of-thought (ToT) approach expands on CoT by exploring multiple reasoning paths using a branching structure, albeit at a significant computational cost.

The CPO method addresses the inference latency issue posed by ToT while retaining the benefits of thorough reasoning exploration. CPO takes advantage of the intermediate reasoning steps generated by ToT to gather preference data, forming a comprehensive dataset of preferred and dispreferred thoughts for model training. This is accomplished without reliance on external annotations or additional reward models, leveraging the intrinsic reasoning preferences observed during ToT's tree searches.

Key Methodological Insights

Central to the CPO approach is the utilization of Direct Preference Optimization (DPO). By collecting preference data at each step of the reasoning process, CPO optimizes LLMs to favor preferred reasoning sequences, aligning the model's output with more deliberate reasoning paths identified during ToT's extensive exploration phase. This method, therefore, concentrates on localized preference learning in contrast to optimizing entire reasoning paths in one go, which mitigates potential gradient cancellation issues inherent in longer sequences.

Experimental Framework

The experimental validation of CPO spans a broad array of reasoning tasks, namely Question Answering (QA), Fact Verification, and Arithmetic Reasoning. Using state-of-the-art models such as LLaMA and Mistral, the authors show that CPO yields an average accuracy improvement of up to 4.3% over base models. Impressively, CPO approaches or surpasses the performance of ToT with significantly reduced inference time, corroborating its efficiency and effectiveness.

Implications and Future Directions

The introduction of CPO introduces significant implications for both theoretical development and practical application. Theoretically, CPO demonstrates an innovative alignment mechanism within LLM architectures, advancing the understanding of preference-driven model training. Practically, the method preserves computational resources during inference, addressing the latency concern significant to real-world applications of LLMs.

Future research might extend CPO's principles to integrate with alternative reasoning architectures, such as graph-of-thought models, to further optimize logical path selection across varied problem domains. Additionally, exploring the application of CPO in different modalities, such as vision-LLMs, could provide a broader understanding of its utility and flexibility across fields.

In conclusion, the Chain of Preference Optimization method provides a compelling approach to advancing LLM reasoning capabilities. By effectively balancing between computational efficiency and reasoning depth, CPO sets the stage for more nuanced and performance-efficient applications of LLMs across complex reasoning tasks.