Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws (2404.05405v1)

Published 8 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Scaling laws describe the relationship between the size of LLMs and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that LLMs can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include: * The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. * Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. LLMs can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.

Citations (31)

Summary

  • The paper establishes that large language models consistently store exactly 2 bits of knowledge per parameter across various training conditions.
  • It uses controlled, knowledge-only datasets to derive scaling laws, revealing that 1000 exposures are needed to achieve full capacity.
  • The study finds minimal impact from model architecture and int8 quantization, but reduced capacity under int4 quantization and sparsity in MoE models.

Exploring the Knowledge Capacity of LLMs

Introduction

Recent advancements in LLMs have prompted a reevaluation of the fundamental principles underlying their development and training. This work explores the quantifiable relationship between the size of a LLM and its knowledge capacity, framed as the number of knowledge bits a model can store. Through a comprehensive analysis involving multiple controlled datasets, we establish that, surprisingly, LLMs have a consistent capability to store exactly 2 bits of knowledge per parameter, irrespective of whether the model's parameters are quantized to int8. This finding is significant as it quantifies the efficiency of transformer models in knowledge storage and showcases how various factors such as model architecture, training duration, quantization, and data quality influence this capacity.

Knowledge Storage in LLMs

For our analysis, we define a piece of knowledge in the form of (name, attribute, value) tuples distilled from synthetic knowledge-only datasets. These datasets, devoid of irrelevant information, allow for a clearer computation of scaling laws, enabling a direct comparison between the model size in parameters and its storage capacity in bits. This novel approach to measuring knowledge capacity provides a more accurate and principle-driven method for evaluating and comparing the efficacy of LLMs.

Findings and Implications

Baseline Scaling Laws

Our findings substantiate a baseline scaling law where LLMs, specifically GPT2 variants with standard AdamW training, exhibit a peak capacity ratio consistently greater than 2 bits per parameter across diverse settings after ample training. This indicates that a sufficiently trained model of size 7B parameters can veritably store up to 14B bits of knowledge, potentially surpassing the combined knowledge of English Wikipedia and textbooks.

Training Duration and Model Capacity

An exploration into training duration reveals the dependence of achieving optimal capacity on the exposure of knowledge pieces during training. Specifically, it's found that 1000 exposures are required to fully achieve the 2bit/param capacity, highlighting the importance of sufficient training for maximizing knowledge storage.

Architecture's Influence

Examined across GPT2, LLaMA, Mistral, and scaled-down MLP variants, the inherent architecture of LLMs is observed to have a negligible effect on their knowledge capacity in scenarios of sufficient training. However, the utilization of GatedMLP, as seen in LLaMA and Mistral, demonstrates a reduced capacity in less extensively trained contexts, underscoring the subtle influences of architectural choices.

Quantization Effects

Quantization to int8 demonstrates no adverse effect on the knowledge capacity of LLMs, even pushing models to their theoretical limits of knowledge storage efficiency. Conversely, quantization to int4 marks a notable reduction in capacity, outlining the potential benefits and limitations of model quantization strategies.

Sparsity and MoE

Examined through the lens of Mixture-of-Experts (MoE) models, the paper elucidates that despite the inherent sparsity in MoE models, they exhibit a diminished yet commendable capacity in knowledge storage, shedding light on the intricate balance between efficiency and performance in sparse architectures.

The Role of Data Quality

Pertinent to real-world applications, the research explores the impact of data quality on model capacity. Notably, the presence of 'junk' data substantially hinders the learning efficacy of useful knowledge, though notably mitigated by strategies such as the inclusion of special tokens, highlighting the critical influence of data quality on model performance.

Concluding Thoughts

This comprehensive paper on the knowledge capacity of LLMs unveils fundamental insights into the efficacies and limitations inherent in current LLM architectures and training methodologies. By providing a quantifiable metric of knowledge storage and elucidating the factors influencing this capacity, the findings offer a principled foundation for future research and development in the field of LLMs. As we venture further into the exploration of AI and LLMing, this work sets the stage for more informed and strategic advancements towards realizing the full potential of large-scale LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com