Branch-Solve-Merge Improves Large Language Model Evaluation and Generation (2310.15123v2)

Published 23 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a LLM program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA2-chat to match or outperform GPT-4 on most domains. On a constraint story generation task, BSM improves the coherence of stories while also improving constraint satisfaction by 12%.

Citations (50)

View on Semantic Scholar

Summary

The paper introduces the innovative Branch-Solve-Merge framework, which decomposes and synthesizes tasks to improve LLM evaluation and generation.
The framework boosts accuracy with up to a 26% improvement in human-LLM agreement and minimizes biases by as much as 50% compared to traditional methods.
It enhances constrained text generation by 12%, enabling models like LLaMA-2-chat to achieve performance levels comparable to GPT-4.

Evaluating and Enhancing LLMs with Branch-Solve-Merge

In the context of recent advancements in LLMs, a paramount challenge lies in evaluating their capabilities effectively, particularly given their deployment in complex, multi-faceted language generation tasks. The paper "Branch-Solve-Merge Improves LLM Evaluation and Generation" presents an innovative approach called the Branch-Solve-Merge (BSM) framework. This paper addresses the limitations of current LLMs, such as inconsistencies in problem-solving and deficiencies in task decomposition, by introducing a structured evaluation and generation framework to enhance these models' performance.

Problem and Proposed Methodology

LLMs, like GPT-4 and LLaMA-2-chat, although potent, often fall short when required to process tasks demanding coherent planning and the satisfaction of elaborate constraints. This limitation is attributed to their generic design, which doesn't handle task-specific subtleties without human intervention effectively. BSM addresses this by integrating a three-module system: the Branch, Solve, and Merge modules.

Branch Module: This component designs a plan by decomposing complex user tasks into smaller, parallel sub-tasks. Such decomposition enables the independent handling of different facets of a single task.
Solve Module: Each sub-task, as identified by the Branch module, is independently tackled within the Solve module. This facilitates localized problem-solving, ensuring that each aspect of the task receives focused attention.
Merge Module: Post resolution, these independently solved components are synthesized by the Merge module into a coherent and comprehensive solution, encapsulating the multifaceted nature of the original task.

Key Experiments and Findings

The BSM framework demonstrates substantial improvements across two key tasks: LLM response evaluation and constrained text generation.

LLM Evaluation: BSM significantly enhances the accuracy of LLM assessments compared to zero-shot prompting and Self-Consistency models. It achieves an improvement in human-LLM agreement by up to 26%, notably minimizing positional and length biases by up to 50%. Such improvements are vital as they enable LLaMA-2-chat to perform comparably, or even surpass, GPT-4 across various domains. BSM fosters a more systematic evaluation process by effectively generating nuanced evaluation criteria tailored for each context.
Constrained Text Generation: When applied to the task of generating coherent stories under specified constraints, BSM proves its efficacy by improving story coherence and satisfying constraints 12% more effectively than baseline models. This is achieved by allowing the model to focus on specific sections of the narrative independently before amalgamating these into a coherent whole.

Implications and Future Prospects

The ramifications of the BSM approach are manifold. Theoretically, BSM delineates a pathway for leveraging LLMs in more complex, domain-specific scenarios without extensive human-designed intervention. Practically, it enables the creation of more reliable LLM evaluators by integrating decomposition-based techniques within existing frameworks. The enhancement of weaker, open-source models to performance levels approaching proprietary models like GPT-4 opens avenues for broader deployment in resource-constrained settings.

From a speculative perspective, future research might explore recursive applications of BSM, introducing iterative branching to solve increasingly granular task components. The modular nature of BSM also suggests potential adaptability across a wide spectrum of tasks beyond those tested, inviting explorations into its application in different domains and with varied LLM architectures. Furthermore, integrating few-shot or fine-tuned module demonstrations might offer additional robustness and efficiency, paving the way for an era where model-agnostic programs can universally enhance model performance.

In summary, BSM presents a compelling strategy for addressing the versatile needs of LLM evaluation and constrained generation, alleviating current deficiencies through structured decomposition, and planning. Its promising improvements and adaptability hint at its foundational role in the evolution of LLM applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/micksabox/status/1770271893934792911

YouTube

Show All Videos