Language Modeling Is Compression

Published 19 Sep 2023 in cs.LG, cs.AI, cs.CL, cs.IT, and math.IT | (2309.10668v2)

Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these LLMs exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that LLMs are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Abstract PDF HTML Upgrade to Chat

Authors (12)

References (78)

Citations (94)

View on Semantic Scholar

Summary

The paper demonstrates that large language models can act as efficient compressors by maximizing log-likelihood to minimize coded message lengths.
It reveals through empirical analysis that models like Chinchilla 70B compress datasets such as ImageNet patches and LibriSpeech better than traditional compressors.
It outlines the trade-offs in scaling and tokenization, emphasizing the balance between model size, vocabulary, and predictive complexity in achieving optimal compression.

Language Modeling Is Compression

Introduction

The paper "Language Modeling Is Compression" (2309.10668) postulates a conceptual paradigm wherein the essence of LLMs can be viewed through the prism of data compression. The authors assert that LLMs, due to their predictive capabilities, serve as proficient compressors. This equivalence between prediction and compression provides deeper insights into scaling laws, tokenization practices, and the broader implications of in-context learning.

Theoretical Underpinnings

Rooted in information theory, the theoretical framework hinges on the source coding theorem, which posits that the optimal message length aligns with the negative log-likelihood of a model. This underpinning elucidates how maximizing log-likelihood (predictive capacity) is synonymous with minimizing the coded message length (compression efficiency). The methodology transforms predictive models into lossless compressors using arithmetic coding, showcasing an intersection of machine learning and data compression.

Figure 2: Original image.

Empirical Analysis

The empirical assessment reveals that LLMs surpass traditional domain-specific compressors on various datasets. Notably, the Chinchilla 70B model compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their original sizes, outperforming PNG and FLAC. These findings underscore the versatility of LLMs as general-purpose compressors, transcending their text-based training.

Figure 4: Chinchilla 1b.

Scaling Laws and Compression

The paper revisits scaling laws, illustrating that increased model size and dataset scale impact compression performance. However, beyond a critical point, additional scaling adversely affects compression due to the burgeoning model parameters, reaffirming that model size relative to dataset size dictates performance viability.

Tokenization and Compression

Tokenization emerges as a pre-compression mechanism, influencing the volume of information models can process within context windows. While larger token vocabularies reduce sequence length, thus condensing information density, this incrementally escalates prediction complexity. Consequently, the balance between vocabulary size and model capacity becomes pivotal.

Practical Implications and Future Directions

By demonstrating compressors' potential as generative models, the authors highlight an intriguing capability of utilizing compressors like gzip in generative tasks via autoregressive sampling. This generative proficiency, however, accompanies the challenge of error accumulation over extended sampling sequences.

The study advocates for integrating compression metrics in evaluating LLMs, thus broadening the spectrum of performance evaluation beyond conventional log-loss or predictive accuracy benchmarks. Future research will likely explore enhancing in-context learning and improving model efficiency to balance parameter size against compression capability optimally.

Conclusion

This research proposes a compelling intersection of language modeling and data compression, expanding our understanding of LLM capabilities beyond traditional predictive contexts. The compression-centric view furnishes novel insights into model scaling, tokenization, and generative applications, paving the path for future explorations in optimizing LLMs for diverse applications across data-modality spectrums.

Markdown Report Issue