Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks (2305.14201v1)

Published 23 May 2023 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained LLMs, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.

PDF Abstract

Overview of "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks"

This essay provides a summary and an in-depth analysis of the paper titled "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks." This paper introduces Goat, an enhanced version of the LLaMA model, fine-tuned specifically to excel in arithmetic tasks. The authors demonstrate the superior performance of Goat over GPT-4 in various arithmetic operations, including addition, subtraction, multiplication, and division, particularly on large numbers.

Background

LLMs like GPT-4 have shown unprecedented capabilities across numerous NLP tasks. However, despite their prowess in text generation and comprehension, these models often struggle with elementary arithmetic operations, particularly those involving large numbers. This research aims to bridge this gap by fine-tuning LLaMA, an open-source LLM, to elevate its arithmetic capabilities.

Model and Dataset

The core of the research revolves around Goat, a fine-tuned version of LLaMA-7B. The fine-tuning process leverages a synthetically generated dataset containing approximately one million arithmetic problems. The dataset spans a diverse set of arithmetic tasks, ensuring a balanced representation of different arithmetic operations and complexities. A significant highlight is that Goat-7B can achieve near-perfect accuracy in zero-shot settings for large-number addition and subtraction, an accomplishment attributed to LLaMA's consistent tokenization of numbers.

Methodology

The authors employ a systematic approach to fine-tuning which includes instruction-based supervised learning on LLaMA using the synthetically generated dataset.

Arithmetic Learnability Framework

The paper presents a novel framework for categorizing arithmetic tasks based on their learnability with LLMs:

Learnable Tasks: Tasks such as large-number addition and subtraction, as well as $n$ -digit by 1-digit multiplication and division, where the model can achieve high accuracy through direct supervised learning.
Unlearnable Tasks: Tasks such as multi-digit multiplication and division, which the model struggles to learn due to their inherent complexity.

To address the unlearnable tasks, the authors propose a decomposition strategy:

Multiplication: Breaking down multi-digit multiplication into a series of learnable sub-tasks using arithmetic principles such as distributive law.
Division: Decomposing division into iterative subtraction tasks, akin to the long division method taught in elementary schools.

Experimental Evaluation

The efficacy of Goat is evaluated using the BIG-bench arithmetic sub-task and additional tests on larger-scale arithmetic problems. The numerical results showcased in the paper indicate Goat's strong performance, often surpassing GPT-4.

Findings

Exact String Match & Digit Match: Goat consistently achieves high accuracy rates in exact string matching and digit matching. For instance, Goat reports near-perfect accuracy on 16-digit by 16-digit addition tasks, which poses a significant challenge for GPT-4.
Comprehensive Task Analysis: Using a series of detailed experiments, the paper validates that the proposed decomposition method significantly enhances the model's ability to perform complex tasks such as multi-digit multiplication and division.

Implications and Future Directions

The findings from this research have several implications:

Model Architecture and Tokenization: The paper underscores the importance of consistent tokenization in enhancing arithmetic performance in LLMs.
Instruction Tuning: The success of direct supervised fine-tuning for arithmetic tasks suggests potential expansions into other domains requiring precise, structured outputs.
Generalization vs. Memorization: The research provides evidence that fine-tuned models like Goat can generalize patterns beyond mere memorization, a critical aspect for the practical application of LLMs in computational tasks.

Speculative Directions

Looking forward, the insights from this paper open new avenues for research in AI:

Enhanced Arithmetic Reasoning: Further refinement in decomposition strategies can facilitate better generalization of arithmetic reasoning across various numerical ranges.
Integration with Other LLMs: The end-to-end instruction tuning methodology demonstrated here could be adapted and applied to other advanced LLMs to boost their performance in arithmetic tasks.
Cross-Domain Applications: The principles from this research could be extended to enhance LLM capabilities in domains like scientific computation, financial analysis, and educational technology, where precise numerical reasoning is paramount.

Conclusion

The paper "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks" makes a significant contribution by addressing the arithmetic deficiencies in LLMs. The fine-tuned Goat model showcases an impressive ability to handle complex arithmetic tasks, thereby setting a benchmark for future research in enhancing LLM arithmetic reasoning. This advancement not only broadens the application scope of LLMs but also provides a structured approach to tackling their inherent limitations in numerical operations.