SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models (2412.11605v2)

Published 16 Dec 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Instruction-following is a fundamental capability of LLMs, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SPaR, a novel framework combining self-play with tree-search refinement to generate focused preference pairs for training instruction-following LLMs.
Models trained with SPaR exhibit improved performance, with LLaMA3-8B notably surpassing GPT-4-Turbo on the IFEval benchmark.
SPaR demonstrates scalability and transferability, enhancing instruction-following in both smaller and larger LLMs and paving the way for more robust AI systems capable of handling complex instructions.

The paper "SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in LLMs" introduces a novel framework aimed at enhancing the instruction-following capabilities of LLMs. The proposed technique, SPaR, employs a self-play strategy combined with tree-search refinement to produce valid and comparable preference pairs that focus on key differences, which is crucial for effective instruction-following. This approach addresses the limitations of existing methods that often generate distracting content variations irrelevant to the core task.

Key Contributions

SPaR Framework: The SPaR framework integrates a self-play mechanism where an LLM employs a tree-search strategy to iteratively refine its responses to instructions. This systematic refinement process reduces unnecessary variations in model outputs, enabling the creation of focused preference pairs that aid in teaching LLMs the nuances of accurate instruction-following.
Improved Performance: The authors present experimental results demonstrating that models trained with SPaR outperform existing models. Notably, a model trained using SPaR, specifically LLaMA3-8B, surpasses GPT-4-Turbo when evaluated against the IFEval benchmark. This underscores the framework's effectiveness in enhancing instruction adherence without compromising general model capabilities.
Scalability and Transferability: SPaR has shown promising scalability and transferability. It significantly improves instruction-following performance not only in smaller models such as LLaMA3-8B but also in larger models like LLaMA3-70B, indicating its applicability across different scales of LLMs.
Tree-Search Refinement: The framework employs a structured tree-search that guides LLMs in exploring various paths of response refinement. This mechanism ensures a high rate of successful refinements by allowing the models to critically evaluate and self-correct their outputs iteratively.
Open-Source Resources: The paper announces that the code and dataset utilized in SPaR development are publicly available, promoting transparency and enabling further research replication and exploration by the community.

Implications

The SPaR framework could have significant implications for both theoretical and applied AI. Theoretically, it provides insights into enhancing LLM instruction-following without relying heavily on large manually curated datasets. From an application standpoint, improved instruction-following can pave the way for more robust AI systems capable of handling complex user instructions with multiple constraints, which is critical in fields like autonomous systems and advanced user-interaction mediums.

Future Directions

Future developments may involve integrating this framework with other alignment techniques to further refine LLM outputs in various scenarios. Moreover, extensive investigations into combining SPaR with external feedback methods, such as human-in-the-loop systems, could potentially result in more aligned and reliable AI systems. Additionally, the exploration of SPaR’s utility across different LLMs and modalities could uncover new avenues for its application, making it a versatile tool in the AI research landscape.

In conclusion, SPaR’s innovative approach highlights the potential of self-play methods in strengthening LLM capabilities in instruction adherence, presenting an invaluable contribution to the field of AI alignment and instruction-following.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (10)

GitHub

GitHub - thu-coai/SPaR (26 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1873802355797627353

https://twitter.com/TheTuringPost/status/1872027945445052781

https://twitter.com/GptMaestro/status/1870118331946238304

YouTube

Show All Videos