Language Modeling Is Compression (2309.10668v2)

Published 19 Sep 2023 in cs.LG, cs.AI, cs.CL, cs.IT, and math.IT

Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these LLMs exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that LLMs are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

References (78)

Authors (12)

Anian Ruoss (20 papers)
Paul-Ambroise Duquenne (12 papers)
Elliot Catt (14 papers)
Tim Genewein (25 papers)
Christopher Mattern (11 papers)
Jordi Grau-Moya (25 papers)
Li Kevin Wenliang (11 papers)
Matthew Aitchison (9 papers)
Laurent Orseau (28 papers)
Marcus Hutter (134 papers)
Joel Veness (29 papers)
Grégoire Delétang (12 papers)

Citations (94)

View on Semantic Scholar

Summary

LLMing is Compression

The paper "LLMing is Compression" authored by Grégoire Delétang et al., presents a thorough investigation into the inherent linkage between predictive models and lossless data compression. The paper builds on the foundational concepts of information theory, particularly Shannon's source coding theorem, and explores how the principles of prediction and compression intersect through the lens of modern machine learning.

Core Contributions

Empirical Validation of Compression Capabilities: The authors empirically evaluate the compression performance of LLMs, specifically foundation models like Chinchilla 70B, across different data modalities—text, images, and audio. Impressively, these models, although primarily trained on text, outperform specialized compression algorithms on image and audio data. For instance, Chinchilla 70B achieves compression rates of 43.4% on ImageNet and 16.4% on LibriSpeech, outperforming domain-specific compressors such as PNG and FLAC.
Theoretical Insights on Scaling: The paper revisits scaling laws in the context of compression and demonstrates that beyond a certain model size, the adjusted compression rate deteriorates due to the overhead of model parameters. This crucial insight aligns with the principle that achieving better compression (hence generalization) is not merely about increasing model size but also considering dataset size.
Arithmetic Coding Application: By leveraging arithmetic coding, the paper transforms predictive models into compressors. This approach enables assessing models' performance in a manner consistent with log-loss optimization but adapted to consider the size of model parameters.
Generative Capabilities: The research validates the hypothesis that compressors can function as generative models. The paper provides qualitative assessments showing that models like Chinchilla can autoregressively generate coherent data sequences across various modalities.
Tokenization as Pre-compression: Another significant contribution is the exploration of tokenization strategies, which act as lossless pre-compression steps. The paper shows that simpler tokenization schemes often lead to better compression rates for Transformers, whereas larger vocabulary sizes enable models to increase context information.
In-Context Learning: The paper highlights that foundation models benefit significantly from in-context learning capabilities. This approach, characterized by rapid adaptation within short contexts, allows these models to achieve competitive compression rates without extensive retraining on different data types.

Strong Numerical Results and Claims

Cross-Modality Compression: Chinchilla 70B compresses ImageNet patches to 43.4% and LibriSpeech to 16.4%, outperforming PNG (58.5%) and FLAC (30.3%), respectively. The paper robustly positions foundation models as general-purpose compressors.
Scaling Laws in Compression: The paper shifts the perspective of scaling laws, showing that effective compression is constrained by both model and dataset sizes. This nuanced understanding is critical for future model designs optimized for specific data environments.

Implications and Future Directions

Practical Implications:

Versatility in Compression: The demonstrated ability of LLMs to effectively compress data across various modalities underscores their potential utility in diverse real-world applications where data storage and transmission efficiency are paramount.
Optimized Model Deployment: Understanding the trade-offs between model size and dataset size can guide the development of more efficient models tailored to specific applications, reducing computational overhead without sacrificing performance.

Theoretical Implications:

Reframing Generalization: Viewing generalization through the compression lens provides a unified framework that may help reconcile different perspectives in machine learning and information theory.
Tokenization Strategies: Insights into tokenization as pre-compression can inform better design choices for model training, potentially leading to innovations in sequence modeling and natural language processing.

Future Directions:

Extended Data Modality: Further research could explore additional data modalities such as video and time-series data, assessing if the demonstrated compression effectiveness of foundation models holds universally.
Model Compression Techniques: Investigating methods to reduce model parameter sizes without compromising performance could make these models more practical for large-scale deployment.
Online vs. Offline Compression: Developing and comparing algorithms for both online (prequential) and offline compression settings could provide a richer understanding of how foundation models operate in dynamic contexts.

In conclusion, the paper presents a robust framework linking LLMing to compression, backed by empirical evidence and theoretical insights. These findings have significant implications for the design and application of machine learning models across various domains.