Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Shot Learning as Instruction Data Prospector for Large Language Models (2312.10302v4)

Published 16 Dec 2023 in cs.CL and cs.AI

Abstract: Contemporary practices in instruction tuning often hinge on enlarging data scaling without a clear strategy for ensuring data quality, inadvertently introducing noise that may compromise model performance. To address this challenge, we introduce \textsc{Nuggets}, a novel and efficient methodology that leverages one-shot learning to discern and select high-quality instruction data from extensive datasets. \textsc{Nuggets} assesses the potential of individual instruction examples to act as effective one-shot learning instances, thereby identifying those that can significantly improve performance across diverse tasks. \textsc{Nuggets} utilizes a scoring system based on the impact of candidate examples on the perplexity of a diverse anchor set, facilitating the selection of the most advantageous data for instruction tuning. Through comprehensive evaluations on two benchmarks, including MT-Bench and Alpaca-Eval, we show that instruction tuning with the top 1\% of examples curated by \textsc{Nuggets} substantially outperforms conventional methods employing the entire dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yunshui Li (18 papers)
  2. Binyuan Hui (57 papers)
  3. Xiaobo Xia (43 papers)
  4. Min Yang (239 papers)
  5. Lei Zhang (1689 papers)
  6. Shuzheng Si (20 papers)
  7. Junhao Liu (60 papers)
  8. Tongliang Liu (251 papers)
  9. Fei Huang (408 papers)
  10. Yongbin Li (128 papers)
  11. Jiaxi yang (31 papers)
  12. Ling-Hao Chen (13 papers)
Citations (22)
X Twitter Logo Streamline Icon: https://streamlinehq.com