Training Chain-of-Thought via Latent-Variable Inference (2312.02179v1)

Published 28 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

PDF HTML Abstract

LLMs have recently demonstrated remarkable proficiency in various reasoning tasks when prompted to structure their responses step by step, a process known as "chain-of-thought" (CoT) prompting. Improving the reasoning performance of LLMs can be approached via supervised fine-tuning, which uses a labeled training set containing correct answers and their rationales to train the model. However, crafting detailed rationales is costly and labor-intensive.

This paper introduces a novel fine-tuning strategy, named TRICE (Tuning Rationales with Independence-Chain Expectation-maximization), which instead seeks to maximize the marginal log-likelihood of generating correct answers without requiring hand-crafted rationales. TRICE treats the problem of generating rationales as a probabilistic latent-variable model, where the LLM defines a joint probability distribution over questions, rationales, and answers. By maximizing this distribution, TRICE effectively bootstraps correct rationales during learning.

Addressing the challenge of computing this distribution due to the infeasibly large set of potential rationales, TRICE employs a Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm along with a novel control-variate method to estimate gradients with reduced variance. This approach allows TRICE to learn from both correct and incorrect rationales, offering an advantage in performance stabilization and handling of difficult examples.

The paper’s empirical evaluations demonstrate that TRICE outperforms not only the self-taught reasoner (STaR), which generates and fine-tunes on rationales leading to correct answers, but also direct prompt-tuning with or without CoT, across multiple reasoning tasks. Specifically, on the GSM8K dataset and the tasks from the BIG-Bench Hard benchmark, TRICE showcases significant performance improvements. The paper further details the algorithm that underpins the TRICE approach, supported by comprehensive derivation and analytical insights into the method's operation, including the innovative variance-reduction techniques used to enhance training efficiency and effectiveness.

In summary, TRICE is a valuable new method for enhancing the reasoning capabilities of LLMs, which operates by learning to generate high-quality rationales and correct answers without the need for hand-crafted rationales. This development could be widely beneficial for applications that require LLMs to perform complex reasoning, making LLMs more adaptable and effective for a variety of tasks.