- The paper establishes that large language models consistently store exactly 2 bits of knowledge per parameter across various training conditions.
- It uses controlled, knowledge-only datasets to derive scaling laws, revealing that 1000 exposures are needed to achieve full capacity.
- The study finds minimal impact from model architecture and int8 quantization, but reduced capacity under int4 quantization and sparsity in MoE models.
Exploring the Knowledge Capacity of LLMs
Introduction
Recent advancements in LLMs have prompted a reevaluation of the fundamental principles underlying their development and training. This work explores the quantifiable relationship between the size of a LLM and its knowledge capacity, framed as the number of knowledge bits a model can store. Through a comprehensive analysis involving multiple controlled datasets, we establish that, surprisingly, LLMs have a consistent capability to store exactly 2 bits of knowledge per parameter, irrespective of whether the model's parameters are quantized to int8. This finding is significant as it quantifies the efficiency of transformer models in knowledge storage and showcases how various factors such as model architecture, training duration, quantization, and data quality influence this capacity.
Knowledge Storage in LLMs
For our analysis, we define a piece of knowledge in the form of (name, attribute, value) tuples distilled from synthetic knowledge-only datasets. These datasets, devoid of irrelevant information, allow for a clearer computation of scaling laws, enabling a direct comparison between the model size in parameters and its storage capacity in bits. This novel approach to measuring knowledge capacity provides a more accurate and principle-driven method for evaluating and comparing the efficacy of LLMs.
Findings and Implications
Baseline Scaling Laws
Our findings substantiate a baseline scaling law where LLMs, specifically GPT2 variants with standard AdamW training, exhibit a peak capacity ratio consistently greater than 2 bits per parameter across diverse settings after ample training. This indicates that a sufficiently trained model of size 7B parameters can veritably store up to 14B bits of knowledge, potentially surpassing the combined knowledge of English Wikipedia and textbooks.
Training Duration and Model Capacity
An exploration into training duration reveals the dependence of achieving optimal capacity on the exposure of knowledge pieces during training. Specifically, it's found that 1000 exposures are required to fully achieve the 2bit/param capacity, highlighting the importance of sufficient training for maximizing knowledge storage.
Architecture's Influence
Examined across GPT2, LLaMA, Mistral, and scaled-down MLP variants, the inherent architecture of LLMs is observed to have a negligible effect on their knowledge capacity in scenarios of sufficient training. However, the utilization of GatedMLP, as seen in LLaMA and Mistral, demonstrates a reduced capacity in less extensively trained contexts, underscoring the subtle influences of architectural choices.
Quantization Effects
Quantization to int8 demonstrates no adverse effect on the knowledge capacity of LLMs, even pushing models to their theoretical limits of knowledge storage efficiency. Conversely, quantization to int4 marks a notable reduction in capacity, outlining the potential benefits and limitations of model quantization strategies.
Sparsity and MoE
Examined through the lens of Mixture-of-Experts (MoE) models, the paper elucidates that despite the inherent sparsity in MoE models, they exhibit a diminished yet commendable capacity in knowledge storage, shedding light on the intricate balance between efficiency and performance in sparse architectures.
The Role of Data Quality
Pertinent to real-world applications, the research explores the impact of data quality on model capacity. Notably, the presence of 'junk' data substantially hinders the learning efficacy of useful knowledge, though notably mitigated by strategies such as the inclusion of special tokens, highlighting the critical influence of data quality on model performance.
Concluding Thoughts
This comprehensive paper on the knowledge capacity of LLMs unveils fundamental insights into the efficacies and limitations inherent in current LLM architectures and training methodologies. By providing a quantifiable metric of knowledge storage and elucidating the factors influencing this capacity, the findings offer a principled foundation for future research and development in the field of LLMs. As we venture further into the exploration of AI and LLMing, this work sets the stage for more informed and strategic advancements towards realizing the full potential of large-scale LLMs.