Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (2402.04833v2)

Published 7 Feb 2024 in cs.CL
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Abstract: There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.

Introduction

Instruction fine-tuning (IFT) of LLMs is a critical process that shapes these models to better adhere to human directives, enhancing their conversational capabilities and task performance. While advanced techniques such as LIMA and AlpaGasus leverage carefully curated high-quality examples to guide this process, the paper in question challenges the notion that IFT necessitates complex example-selection mechanisms.

Baseline Methodology

Research has emphasized the selection of high-quality IFT examples, but this paper submits that selecting examples according to their response length – a straightforward and cost-effective method – can not only rival, but also outstrip more nuanced strategies. The authors extract 1,000 lengthiest instruction-response pairs from standard datasets like Alpaca-52k and Evol-Instruct-70k and demonstrate that models fine-tuned on these selections consistently outpace sophisticated methods, such as those deployed in LIMA and AlpaGasus, in head-to-head evaluations. These findings remain robust even when tested against alternative LLM judges like GPT-4 and PaLM-2.

Comprehensive Evaluation

A rigorous assessment across multiple datasets and LLMs (including Llama-2-7B, Llama-2-13B, and Mistral-7B) confirms the efficacy of the simple baseline. In particular, fine-tuning on the 1,000 longest responses often led to significantly better performance than more complex selection methods. Furthermore, a refined version of the longest-instruction dataset was created through a process resembling introspection and subsequently tested to show enhanced performance, affirming its value as an adept baseline for IFT research.

Implications and Analysis

In an unexpected turn, the findings suggest refining models on lengthy instructions may be beneficial beyond sheer alignment. When tested on factual knowledge benchmarks from Open LLM, the fine-tuned models generally maintained or improved factual accuracy, indicating that IFT can enhance factuality if the training dataset is sensibly selected. These results suggest an intricate relationship between the characteristics of the IFT dataset and the resulting model's abilities.

In essence, the paper overturns prior assumptions regarding IFT dataset construction, advocating for the potential of simple heuristics like response length as a baseline standard. The implications of these results are considerable and may prompt a re-evaluation of current methods in developing future LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hao Zhao (139 papers)
  2. Maksym Andriushchenko (33 papers)
  3. Francesco Croce (34 papers)
  4. Nicolas Flammarion (63 papers)
Citations (30)