Compression Represents Intelligence Linearly (2404.09937v2)

Published 15 Apr 2024 in cs.CL, cs.AI, cs.IT, cs.LG, and math.IT

Abstract: There is a belief that learning to compress well will lead to intelligence. Recently, LLMing has been shown to be equivalent to compression, which offers a compelling rationale for the success of LLMs: the development of more advanced LLMs is essentially enhancing compression which facilitates intelligence. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. In this work, we examine their relationship in the context of LLMs, treating LLMs as data compressors. Given the abstract concept of "intelligence", we adopt the average downstream benchmark scores as a surrogate, specifically targeting intelligence related to knowledge and commonsense, coding, and mathematical reasoning. Across 12 benchmarks, our study brings together 31 public LLMs that originate from diverse organizations. Remarkably, we find that LLMs' intelligence -- reflected by average benchmark scores -- almost linearly correlates with their ability to compress external text corpora. These results provide concrete evidence supporting the belief that superior compression indicates greater intelligence. Furthermore, our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.

View on arXiv

References (62)

Authors (4)

Yuzhen Huang (15 papers)
Jinghan Zhang (18 papers)
Zifei Shan (16 papers)
Junxian He (66 papers)

Citations (19)

View on Semantic Scholar

Summary

Exploring the Correlation Between Compression and Intelligence in LLMs

Introduction

The correlation between compression capability and perceived intelligence in LLMs has been a topic of theoretical exploration within the AI community for some time. Leveraging insights from compression theory, this paper empirically investigates this correlation, positing that the ability of LLMs to compress external text corpora could serve as an indicator of their intelligence. Intelligence, for the purposes of this paper, is operationalized through performance across a range of downstream tasks encompassing knowledge and commonsense, coding, and mathematical reasoning. The paper encompasses an examination of 30 public LLMs across 12 benchmarks to explore the veracity of these theoretical claims.

Background

The equivalence between LLMing and compression stems from the premise that efficient prediction models can be converted into efficient lossless compressors, and vice versa. This paper succinctly outlines the foundational theories underscoring this relationship, primarily focusing on the source coding theorem and arithmetic coding as a practical application for lossless data compression. It extends this theory to LLMs, highlighting the potential for LLMs to serve as general-purpose compressors, providing they can minimize the average code length required to represent data.

Methodology

The paper undertakes a meticulous approach to validate the theoretical compression-intelligence correlation within LLMs. An extensive array of models representing varied sizes, architectures, and originating organizations were assessed. Intelligence evaluations were grounded in model performance on downstream tasks that were carefully selected to encompass areas critical to AI applications today: knowledge and commonsense, coding, and mathematical reasoning. Compression efficiency was quantified through the bits per character (BPC) metric, ensuring alignment with the evaluation context window sizes across all LLMs. The diversity in the models assessed and the consideration for matching the context window sizes across tasks were crucial for drawing generalizable conclusions.

Results

The paper identifies a near-linear correlation between LLMs' compression efficiency and their performance on downstream tasks, with a Pearson correlation coefficient consistently around -0.95 across different intelligence domains. This correlation was substantiated across different models and benchmarks, establishing a robust link that transcends model size, architecture, and training data differences. Remarkably, this pattern persisted even when examining individual benchmarks, suggesting that compression efficiency could predict performance with considerable accuracy.

Discussion

The findings from this research offer compelling empirical evidence to the long-held belief that there exists a significant correlation between a model's ability to compress data and its performance on tasks that require intelligence. This not only reinforces the theoretical frameworks that position compression as central to intelligent behavior but also suggests practical implications for the evaluation of LLMs. The identification of compression efficiency as a potential unsupervised metric for estimating LLM performance is promising, particularly given the challenges associated with benchmark overfitting and the contamination of evaluation datasets.

Future Directions

While the paper provides substantial evidence supporting the correlation between compression and intelligence, it also opens several avenues for future research. Among these is the exploration of this correlation in fine-tuned models, the impact of different compression corpora on the observed relationship, and the minimum corpus size necessary for reliable BPC computation. Additionally, it invites further investigation into tasks requiring cross-domain abilities, suggesting that compression across diverse datasets might offer a more holistic view of a model's intelligence.

In conclusion, this paper substantiates the theoretical premise that superior compression signifies greater intelligence in LLMs, advocating for compression efficiency as a viable metric for LLM evaluation. By empirically establishing this correlation across a wide array of models and benchmarks, the paper lays a foundation for both theoretical and practical advances in understanding and assessing the intelligence of LLMs.