FireAct: Toward Language Agent Fine-tuning (2310.05915v1)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Recent efforts have augmented LLMs (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.

PDF Abstract

Toward Language Agent Fine-Tuning

The paper "Toward Language Agent Fine-tuning" addresses an important and often overlooked area of research: the fine-tuning of LLMs (LMs) to function as language agents. Language agents, which leverage LMs to interact with external environments and perform sequential reasoning and actions, offer promising applications but currently suffer from limitations due to their typical reliance on few-shot prompting techniques.

Key Contributions

The primary contributions of this paper can be summarized as follows:

Fine-tuning for Language Agents: The authors advocate for fine-tuning LMs to create more robust, efficient, and effective language agents. Empirical results demonstrate significant improvements, particularly when using smaller, more economical LMs.
Experimental Validation: Various base LMs such as Llama-2, CodeLlama, and GPT-3.5 were fine-tuned and evaluated across several question-answering (QA) tasks, including HotpotQA, Bamboogle, StrategyQA, and MMLU.
Contextual Flexibility: By employing a novel approach dubbed " ," the fine-tuning process incorporates multiple prompting methods (e.g., Chain-of-Thought, Reflexion), which enhances performance and adaptability.
Scalability and Efficiency: Fine-tuned LMs result in substantially lower inference times and costs compared to few-shot prompted LMs.
Robustness and Generalization: Fine-tuned LMs exhibit improved robustness against noisy tool outputs and generalize better to new tasks compared to their prompted counterparts.

Performance Improvements

The empirical results obtained are compelling:

Fine-tuning the Llama-2-7B LM with 500 agent trajectories generated by GPT-4 resulted in a 77% increase in HotpotQA performance.
GPT-3.5, once fine-tuned, saw performance improvements of up to 25% over its few-shot prompted baseline.
The issue of robustness was tackled by evaluating models on noisy environment conditions; fine-tuned models displayed more resilience.
Fine-tuning reduced inference times significantly. For example, fine-tuned GPT-3.5 showed a 70% reduction in inference time compared to its prompted version.

Methodology

The methodology of this paper stands out in its systematic approach:

Data Generation: The fine-tuning data is generated from multiple sources and tasks, utilizing advanced LMs like GPT-4 to create agentic trajectories, which include both reasoning traces and actions.
Diverse Methods Integration: The authors employed several methods such as , Chain-of-Thought (CoT), and Reflexion to generate diverse agentic behavior.
Systematic Evaluation: The experiments were meticulously designed to evaluate the effects of different LMs, tasks, and methods on a variety of QA datasets.

Implications and Future Directions

The implications of this research are broad and significant:

Enhanced LM Utility: By transitioning from few-shot prompting to fine-tuning, the effectiveness and applicability of LMs in real-world scenarios can be greatly enhanced.
Cost and Efficiency: Fine-tuning smaller LMs offers a cost-effective alternative to deploying larger, off-the-shelf models, making such technology accessible and scalable.
Robustness and Resilience: Fine-tuned models are shown to be more robust to noisy inputs, which is crucial for deploying language agents in uncontrolled environments.
Generalization Potential: The ability of fine-tuned models to generalize better to new tasks is promising for the development of multi-task agents.

Speculating on future developments, it is likely that incorporating more complex and dynamic models of environment interaction will be an area of growth. Furthermore, advancements in parameter-efficient fine-tuning methods could make the process even more accessible. Another avenue for exploration is to extend this fine-tuning approach to more diverse and complex tasks beyond QA, such as interactive dialogue systems or multi-agent environments.

Conclusion

This paper provides a rigorous and comprehensive paper on the benefits of fine-tuning LMs for language agent applications. It lays down a strong foundation for future work in this area and provides practical insights and methodologies that can be adapted for various AI applications. The results show clear, quantifiable improvements in performance, efficiency, robustness, and generalization, making a strong case for the adoption of fine-tuning practices in the development of language agents.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Baian Chen (6 papers)
Chang Shu (39 papers)
Ehsan Shareghi (54 papers)
Nigel Collier (83 papers)
Karthik Narasimhan (82 papers)
Shunyu Yao (72 papers)

Citations (69)

View on Semantic Scholar