MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning (2403.06914v2)
Abstract: LLMs have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a LLM learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of LLMs
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Recurrent memory transformer, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- A survey on in-context learning, 2023.
- Accelerate: Training and inference at scale made simple, efficient and adaptable, 2022.
- Hypernetworks. ArXiv, abs/1609.09106, 2016. URL https://api.semanticscholar.org/CorpusID:208981547.
- Structured prompting: Scaling in-context learning to 1,000 examples, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation. arXiv preprint arXiv:2212.10315, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Unifiedqa: Crossing format boundaries with a single qa system, 2020.
- The power of scale for parameter-efficient prompt tuning, 2021.
- What makes good in-context examples for gpt-3333?, 2021.
- Metaicl: Learning to learn in context, 2022a.
- Rethinking the role of demonstrations: What makes in-context learning work?, 2022b.
- Learning to compress prompts with gist tokens. 2023.
- Transformerlens, 2022. URL https://github.com/neelnanda-io/TransformerLens.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Hypertuning: Toward adapting large language models without back-propagation. In International Conference on Machine Learning, pp. 27854–27875. PMLR, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Learning to generate task-specific adapters from task description, 2021.
- Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. URL https://api.semanticscholar.org/CorpusID:248496292.
- Yichuan Li (25 papers)
- Xiyao Ma (6 papers)
- Sixing Lu (5 papers)
- Kyumin Lee (32 papers)
- Xiaohu Liu (9 papers)
- Chenlei Guo (17 papers)