Towards the Law of Capacity Gap in Distilling Language Models (2311.07052v3)

Published 13 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLM (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted \textit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.

PDF Abstract

Insights on the Law of Capacity Gap in Distilling LLMs

The paper entitled "Towards the Law of Capacity Gap in Distilling LLMs" addresses a pertinent challenge in the field of LLM (LM) distillation: the capacity gap between teacher and student models. This paper introduces a concept called the "Law of Capacity Gap," proposing that there exists an optimal capacity gap for effective distillation, which is surprisingly consistent across models of varying scales and architectures. The research is a corollary to previous findings that the benefits of increasing teacher LM size do not always translate linearly to improvements in distilled student LMs due to the "curse of capacity gap."

Key Contributions

Identifying the Law of Capacity Gap: The authors posit that despite the challenges imposed by the capacity gap, there exists a consistent optimal point that maximizes student model performance across numerous scales and architecture variations. This optimal capacity gap becomes a strategic guideline for deciding the teacher model size relative to the desired student model scale.
Empirical Validation: Through empirical evaluations with models, such as GPT2 and Pythia, using OpenWebText corpus data, the hypothesis is substantiated. The experiments reveal that a scaled teacher model up to a certain point yields the best distillation results, indicating the law's applicability across different model configurations.
Development of MiniMA: One significant outcome of applying the Law of Capacity Gap is the creation of a 3 billion parameter student model named MiniMA, distilled from a specially adapted LLaMA2-7B teacher model. Utilizing the identified optimal capacity gap has allowed MiniMA to outperform benchmarked 3B models, establishing a new compute-performance Pareto frontier on various evaluation benchmarks.
MiniChat and Instruction Following: Further fine-tuning MiniMA to an instruction-tuned model, termed MiniChat, demonstrated superior performance against 3B competitors on GPT4-assisted evaluations, showcasing its capacity to compete with some 7B scale chat models.

Implications and Future Directions

The introduction of the Law of Capacity Gap has notable implications for both the theoretical understanding and practical implementation of model distillation. By presenting a consistent guideline for the optimal capacity gap, this research can potentially streamline the distillation process across various AI applications, reducing computational overhead and resource demands.

Practically, this finding could facilitate more economical model training pipelines, enhancing scaling efficiency without sacrificing model performance. Theoretically, it prompts further inquiry into the nature of knowledge transfer and capacity alignment in neural networks, possibly influencing future architectural decisions and distillation methodologies.

Speculatively, as AI systems increasingly integrate into diverse sectors, the ability to efficiently and effectively distill LMs could accelerate their deployment in resource-constrained environments, democratizing access to powerful language technologies. Additionally, exploration into dynamic capacity gap modeling, responding to evolving task requirements, could usher breakthroughs in adaptive AI systems.

In conclusion, this paper advances our comprehension of capacity dynamics in model distillation, laying groundwork for more effective transfer of knowledge between varying model scales, and paving the way for next-generation AI architectures tailored for optimized distillation processes.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Chen Zhang (403 papers)
Dawei Song (62 papers)
Zheyu Ye (12 papers)
Yan Gao (157 papers)

Citations (17)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - GeneZC/MiniMA: Code for paper titled "Towards the Law of Capacity Gap in Distilling Language Models" (99 stars)

Tweets

https://twitter.com/JasonStillerma1/status/1899846402945192132