Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Information of Large Language Model Geometry (2402.03471v1)

Published 1 Feb 2024 in cs.LG, cs.AI, cs.CL, cs.IT, and math.IT

Abstract: This paper investigates the information encoded in the embeddings of LLMs. We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
  2. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
  3. Francis Bach. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022.
  4. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  6. Bayesian theory, volume 405. John Wiley & Sons, 2009.
  7. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Elements of information theory.
  10. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
  11. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  12. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  13. An introduction to Kolmogorov complexity and its applications, volume 3. Springer, 2008.
  14. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9):1546–1562, 2007.
  15. The quantization model of neural scaling. arXiv preprint arXiv:2303.13506, 2023.
  16. Suvrit Sra. Positive definite matrices and the s-divergence. Proceedings of the American Mathematical Society, 144(7):2787–2797, 2016.
  17. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  18. Information flow in self-supervised learning. arXiv preprint arXiv:2309.17281, 2023.
  19. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp.  1–5. IEEE, 2015.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  21. Llmzip: Lossless text compression using large language models. arXiv preprint arXiv:2306.04050, 2023.
  22. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.  35151–35174. PMLR, 2023.
  23. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  24. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
  25. Large language model evaluation via matrix entropy. arXiv preprint arXiv:2401.17139, 2024.
  26. Matrix information theory for self-supervised learning. arXiv preprint arXiv:2305.17326, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com