Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty (2308.02019v2)

Published 3 Aug 2023 in cs.CL

Abstract: We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of LLMs. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Inar Timiryasov (23 papers)
  2. Jean-Loup Tastet (11 papers)
Citations (28)

Summary

We haven't generated a summary for this paper yet.