Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compression Represents Intelligence Linearly (2404.09937v2)

Published 15 Apr 2024 in cs.CL, cs.AI, cs.IT, cs.LG, and math.IT

Abstract: There is a belief that learning to compress well will lead to intelligence. Recently, LLMing has been shown to be equivalent to compression, which offers a compelling rationale for the success of LLMs: the development of more advanced LLMs is essentially enhancing compression which facilitates intelligence. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. In this work, we examine their relationship in the context of LLMs, treating LLMs as data compressors. Given the abstract concept of "intelligence", we adopt the average downstream benchmark scores as a surrogate, specifically targeting intelligence related to knowledge and commonsense, coding, and mathematical reasoning. Across 12 benchmarks, our study brings together 31 public LLMs that originate from diverse organizations. Remarkably, we find that LLMs' intelligence -- reflected by average benchmark scores -- almost linearly correlates with their ability to compress external text corpora. These results provide concrete evidence supporting the belief that superior compression indicates greater intelligence. Furthermore, our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Yi: Open foundation models by 01.ai, 2024.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  3. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781, 2024.
  4. Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  6. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  7. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
  8. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  9. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  10. Improving language models by retrieving from trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  2206–2240. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  12. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. CodeParrot. codeparrot-clean. https://huggingface.co/datasets/codeparrot/codeparrot-clean, 2021.
  16. Language modeling is compression. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jznbgiynus.
  17. Understanding emergent abilities of language models from the loss perspective, 2024.
  18. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
  19. Language models scale reliably with over-training and on downstream tasks, 2024.
  20. A framework for few-shot language model evaluation, 12 2023a. URL https://zenodo.org/records/10256836.
  21. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023b.
  22. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022.
  23. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  24. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  25. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
  26. A formal definition of intelligence based on an intensional variant of algorithmic complexity. In Proceedings of International Symposium of Engineering of Intelligent Systems (EIS98), pp.  146–163, 1998.
  27. Marcus Hutter. The hutter prize. http://prize.hutter1.net, 2006.
  28. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  29. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  30. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  31. DS-1000: A natural and reliable benchmark for data science code generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  18319–18345. PMLR, 23–29 Jul 2023.
  32. Universal intelligence: A definition of machine intelligence. Minds and machines, 17:391–444, 2007.
  33. A universal measure of intelligence for artificial agents. In International Joint Conference on Artificial Intelligence, volume 19, pp.  1509. LAWRENCE ERLBAUM ASSOCIATES LTD, 2005.
  34. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  35. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
  36. Towards boosting the open-domain chatbot with human feedback. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4060–4078, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.224. URL https://aclanthology.org/2023.acl-long.224.
  37. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  38. Paloma: A benchmark for evaluating language model fit, 2023.
  39. Matt Mahoney. Large text compression benchmark, 2011.
  40. Matthew V Mahoney. Text compression as a test for artificial intelligence. AAAI/IAAI, 970, 1999.
  41. MistralAI. Mixtral of experts: A high quality sparse mixture-of-experts. Mistral Blog, 2023. URL www.mistral.ai/news/mixtral-of-experts/.
  42. R. Pasco. Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Transactions on Information Theory, 23(4):548–548, 1977. doi: 10.1109/TIT.1977.1055739.
  43. Keiran Paster. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
  44. Shortformer: Better language modeling using shorter inputs. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5493–5505, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.427. URL https://aclanthology.org/2021.acl-long.427.
  45. J. J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3):198–203, 1976. doi: 10.1147/rd.203.0198.
  46. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  47. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=RIu5lyNXjT.
  48. C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
  49. Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
  50. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  51. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=zWqr3MQuNs.
  52. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  53. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  55. AM Turing. Computing machinery and intelligence. Mind, 59(236):433–460, 1950.
  56. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  57. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
  58. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pp.  4003–4012, 2020.
  59. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  60. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  61. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  62. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KBMOKmX2he.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuzhen Huang (15 papers)
  2. Jinghan Zhang (18 papers)
  3. Zifei Shan (16 papers)
  4. Junxian He (66 papers)
Citations (19)

Summary

Exploring the Correlation Between Compression and Intelligence in LLMs

Introduction

The correlation between compression capability and perceived intelligence in LLMs has been a topic of theoretical exploration within the AI community for some time. Leveraging insights from compression theory, this paper empirically investigates this correlation, positing that the ability of LLMs to compress external text corpora could serve as an indicator of their intelligence. Intelligence, for the purposes of this paper, is operationalized through performance across a range of downstream tasks encompassing knowledge and commonsense, coding, and mathematical reasoning. The paper encompasses an examination of 30 public LLMs across 12 benchmarks to explore the veracity of these theoretical claims.

Background

The equivalence between LLMing and compression stems from the premise that efficient prediction models can be converted into efficient lossless compressors, and vice versa. This paper succinctly outlines the foundational theories underscoring this relationship, primarily focusing on the source coding theorem and arithmetic coding as a practical application for lossless data compression. It extends this theory to LLMs, highlighting the potential for LLMs to serve as general-purpose compressors, providing they can minimize the average code length required to represent data.

Methodology

The paper undertakes a meticulous approach to validate the theoretical compression-intelligence correlation within LLMs. An extensive array of models representing varied sizes, architectures, and originating organizations were assessed. Intelligence evaluations were grounded in model performance on downstream tasks that were carefully selected to encompass areas critical to AI applications today: knowledge and commonsense, coding, and mathematical reasoning. Compression efficiency was quantified through the bits per character (BPC) metric, ensuring alignment with the evaluation context window sizes across all LLMs. The diversity in the models assessed and the consideration for matching the context window sizes across tasks were crucial for drawing generalizable conclusions.

Results

The paper identifies a near-linear correlation between LLMs' compression efficiency and their performance on downstream tasks, with a Pearson correlation coefficient consistently around -0.95 across different intelligence domains. This correlation was substantiated across different models and benchmarks, establishing a robust link that transcends model size, architecture, and training data differences. Remarkably, this pattern persisted even when examining individual benchmarks, suggesting that compression efficiency could predict performance with considerable accuracy.

Discussion

The findings from this research offer compelling empirical evidence to the long-held belief that there exists a significant correlation between a model's ability to compress data and its performance on tasks that require intelligence. This not only reinforces the theoretical frameworks that position compression as central to intelligent behavior but also suggests practical implications for the evaluation of LLMs. The identification of compression efficiency as a potential unsupervised metric for estimating LLM performance is promising, particularly given the challenges associated with benchmark overfitting and the contamination of evaluation datasets.

Future Directions

While the paper provides substantial evidence supporting the correlation between compression and intelligence, it also opens several avenues for future research. Among these is the exploration of this correlation in fine-tuned models, the impact of different compression corpora on the observed relationship, and the minimum corpus size necessary for reliable BPC computation. Additionally, it invites further investigation into tasks requiring cross-domain abilities, suggesting that compression across diverse datasets might offer a more holistic view of a model's intelligence.

In conclusion, this paper substantiates the theoretical premise that superior compression signifies greater intelligence in LLMs, advocating for compression efficiency as a viable metric for LLM evaluation. By empirically establishing this correlation across a wide array of models and benchmarks, the paper lays a foundation for both theoretical and practical advances in understanding and assessing the intelligence of LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com