Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (2305.09246v1)

Published 16 May 2023 in cs.AI and cs.CL

Abstract: Instruction tuning for LLMs has gained attention from researchers due to its ability to unlock the potential of LLMs in following instructions. While instruction tuning offers advantages for facilitating the adaptation of LLMs to downstream tasks as a fine-tuning approach, training models with tens of millions or even billions of parameters on large amounts of data results in unaffordable computational costs. To address this, we focus on reducing the data used in LLM instruction tuning to decrease training costs and improve data efficiency, dubbed as Low Training Data Instruction Tuning (LTD Instruction Tuning). Specifically, this paper conducts a preliminary exploration into reducing the data used in LLM training and identifies several observations regarding task specialization for LLM training, such as the optimization of performance for a specific task, the number of instruction types required for instruction tuning, and the amount of data required for task-specific models. The results suggest that task-specific models can be trained using less than 0.5% of the original dataset, with a 2% improvement in performance over those trained on full task-related data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hao Chen (1005 papers)
  2. Yiming Zhang (128 papers)
  3. Qi Zhang (784 papers)
  4. Hantao Yang (2 papers)
  5. Xiaomeng Hu (12 papers)
  6. Xuetao Ma (9 papers)
  7. Yifan Yanggong (3 papers)
  8. Junbo Zhao (86 papers)
Citations (38)