Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Modeling Is Compression (2309.10668v2)

Published 19 Sep 2023 in cs.LG, cs.AI, cs.CL, cs.IT, and math.IT

Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these LLMs exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that LLMs are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Accelerated deep lossless image coding with unified paralleleized GPU coding architecture. In PCS, 2022.
  2. Fabrice Bellard. Lossless data compression with neural networks. Technical report, Amarisoft, 2019.
  3. Fabrice Bellard. NNCP v2: Lossless data compression with transformer. Technical report, Amarisoft, 2021.
  4. The description length of deep learning models. In NeurIPS, 2018.
  5. Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
  6. Thomas Boutell. PNG (portable network graphics) specification version 1.0. RFC, 1997.
  7. Language models are few-shot learners. In NeurIPS, 2020.
  8. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.
  9. Scaling transformer to 1m tokens and beyond with RMT. arXiv:2304.11062, 2023.
  10. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282, 2017.
  11. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., 1984.
  12. Josh Coalson. Free lossless audio codec, 2008. URL https://xiph.org/flac.
  13. David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893, 2016.
  14. Neural networks and the chomsky hierarchy. In ICLR, 2023.
  15. Peter Deutsch. GZIP file format specification version 4.3. RFC, 1996.
  16. Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271, 2009.
  17. Text categorization using compression models. In Data Compression Conference, 2000.
  18. Memory-based meta-learning on non-stationary distributions. arXiv:2302.03067, 2023.
  19. Deepzip: Lossless data compression using recurrent neural networks. In DCC, 2019.
  20. Dzip: Improved general-purpose lossless compression based on novel neural network modeling. In DCC, 2020.
  21. Longt5: Efficient text-to-text transformer for long sequences. In NAACL-HLT (Findings), 2022.
  22. Training compute-optimal large language models. arXiv:2203.15556, 2022.
  23. Integer discrete flows and lossless compression. In NeurIPS, 2019.
  24. Analysis of arithmetic coding for data compression. In Data Compression Conference, 1991.
  25. David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952.
  26. Marcus Hutter. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability. Springer, 2005.
  27. Marcus Hutter. 500’000€ prize for compressing human knowledge, 2006. URL http://prize.hutter1.net.
  28. Few-shot non-parametric learning with deep latent variable model. In NeurIPS, 2022.
  29. "low-resource" text classification: A parameter-free classification method with compressors. In ACL (Findings), 2023.
  30. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  31. Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In ICML, 2019.
  32. Byron Knoll. CMIX, 2014. URL http://www.byronknoll.com/cmix.html.
  33. A machine learning perspective on predictive coding with PAQ8. In DCC, 2012.
  34. Andrei N. Kolmogorov. On tables of random numbers. Theoretical Computer Science, 1998.
  35. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL (1), 2018.
  36. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP (Demonstration), 2018.
  37. In-context reinforcement learning with algorithm distillation. In ICLR. OpenReview.net, 2023.
  38. Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Springer, 2019.
  39. DecMac: A deep context model for high efficiency arithmetic coding. In ICAIIC, 2019.
  40. David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
  41. Matthew V. Mahoney. Fast text compression with neural networks. In FLAIRS, 2000.
  42. TRACE: A fast transformer-based general-purpose lossless compressor. In WWW, 2022.
  43. Practical full resolution learned lossless image compression. In CVPR, 2019.
  44. Learning better lossless compression using lossy compression. In CVPR, 2020.
  45. Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, Brno Universtiy of Technology, 2012.
  46. Exploring generalization in deep learning. In NIPS, 2017.
  47. Shaking the foundations: delusions in sequence models for interaction and control. arXiv:2110.10819, 2021.
  48. Librispeech: An ASR corpus based on public domain audio books. In ICASSP, 2015.
  49. Richard C. Pasco. Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory, 1977.
  50. Igor Pavlov. 7z Format, 2019. URL http://www.7-zip.org/7z.html.
  51. Bpe-dropout: Simple and effective subword regularization. In ACL, 2020.
  52. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
  53. Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446, 2021.
  54. A philosophical treatise of universal induction. Entropy, 2011.
  55. LC-FDNet: Learned lossless image compression with frequency decomposition network. In CVPR, 2022.
  56. Jorma Rissanen. Generalized kraft inequality and arithmetic coding. IBM J. Res. Dev., 1976.
  57. Randomized positional encodings boost length generalization of transformers. In ACL (2), 2023.
  58. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 2015.
  59. Deep-learning-based lossless image coding. IEEE Trans. Circuits Syst. Video Technol., 2020.
  60. CNN-based prediction for lossless coding of photographic images. In PCS, 2018.
  61. Sequential neural text compression. IEEE Trans. Neural Networks, 1996.
  62. Neural machine translation of rare words with subword units. In ACL (1), 2016.
  63. Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 1948.
  64. Ray J. Solomonoff. A formal theory of inductive inference. part I. Inf. Control., 1964a.
  65. Ray J. Solomonoff. A formal theory of inductive inference. part II. Inf. Control., 1964b.
  66. Compression of generative pre-trained language models via quantization. In ACL (1), 2022.
  67. Using Compression-Based Language Models for Text Categorization, pp.  141–165. Springer Netherlands, 2003.
  68. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  69. Practical lossless compression with latent variables using bits back coding. In ICLR (Poster), 2019.
  70. Llmzip: Lossless text compression using large language models. arXiv:2306.04050, 2023.
  71. Aäron van den Oord and Benjamin Schrauwen. The student-t mixture as a natural image patch prior with application to image compression. J. Mach. Learn. Res., 2014.
  72. Attention is all you need. In NIPS, 2017.
  73. Compress and control. In AAAI, 2015.
  74. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  75. Terry A. Welch. A technique for high-performance data compression. Computer, 1984.
  76. The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory, 1995.
  77. Arithmetic coding for data compression. Commun. ACM, 1987.
  78. Big bird: Transformers for longer sequences. In NeurIPS, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Anian Ruoss (20 papers)
  2. Paul-Ambroise Duquenne (12 papers)
  3. Elliot Catt (14 papers)
  4. Tim Genewein (25 papers)
  5. Christopher Mattern (11 papers)
  6. Jordi Grau-Moya (25 papers)
  7. Li Kevin Wenliang (11 papers)
  8. Matthew Aitchison (9 papers)
  9. Laurent Orseau (28 papers)
  10. Marcus Hutter (134 papers)
  11. Joel Veness (29 papers)
  12. Grégoire Delétang (12 papers)
Citations (94)

Summary

LLMing is Compression

The paper "LLMing is Compression" authored by Grégoire Delétang et al., presents a thorough investigation into the inherent linkage between predictive models and lossless data compression. The paper builds on the foundational concepts of information theory, particularly Shannon's source coding theorem, and explores how the principles of prediction and compression intersect through the lens of modern machine learning.

Core Contributions

  1. Empirical Validation of Compression Capabilities: The authors empirically evaluate the compression performance of LLMs, specifically foundation models like Chinchilla 70B, across different data modalities—text, images, and audio. Impressively, these models, although primarily trained on text, outperform specialized compression algorithms on image and audio data. For instance, Chinchilla 70B achieves compression rates of 43.4% on ImageNet and 16.4% on LibriSpeech, outperforming domain-specific compressors such as PNG and FLAC.
  2. Theoretical Insights on Scaling: The paper revisits scaling laws in the context of compression and demonstrates that beyond a certain model size, the adjusted compression rate deteriorates due to the overhead of model parameters. This crucial insight aligns with the principle that achieving better compression (hence generalization) is not merely about increasing model size but also considering dataset size.
  3. Arithmetic Coding Application: By leveraging arithmetic coding, the paper transforms predictive models into compressors. This approach enables assessing models' performance in a manner consistent with log-loss optimization but adapted to consider the size of model parameters.
  4. Generative Capabilities: The research validates the hypothesis that compressors can function as generative models. The paper provides qualitative assessments showing that models like Chinchilla can autoregressively generate coherent data sequences across various modalities.
  5. Tokenization as Pre-compression: Another significant contribution is the exploration of tokenization strategies, which act as lossless pre-compression steps. The paper shows that simpler tokenization schemes often lead to better compression rates for Transformers, whereas larger vocabulary sizes enable models to increase context information.
  6. In-Context Learning: The paper highlights that foundation models benefit significantly from in-context learning capabilities. This approach, characterized by rapid adaptation within short contexts, allows these models to achieve competitive compression rates without extensive retraining on different data types.

Strong Numerical Results and Claims

  • Cross-Modality Compression: Chinchilla 70B compresses ImageNet patches to 43.4% and LibriSpeech to 16.4%, outperforming PNG (58.5%) and FLAC (30.3%), respectively. The paper robustly positions foundation models as general-purpose compressors.
  • Scaling Laws in Compression: The paper shifts the perspective of scaling laws, showing that effective compression is constrained by both model and dataset sizes. This nuanced understanding is critical for future model designs optimized for specific data environments.

Implications and Future Directions

Practical Implications:

  • Versatility in Compression: The demonstrated ability of LLMs to effectively compress data across various modalities underscores their potential utility in diverse real-world applications where data storage and transmission efficiency are paramount.
  • Optimized Model Deployment: Understanding the trade-offs between model size and dataset size can guide the development of more efficient models tailored to specific applications, reducing computational overhead without sacrificing performance.

Theoretical Implications:

  • Reframing Generalization: Viewing generalization through the compression lens provides a unified framework that may help reconcile different perspectives in machine learning and information theory.
  • Tokenization Strategies: Insights into tokenization as pre-compression can inform better design choices for model training, potentially leading to innovations in sequence modeling and natural language processing.

Future Directions:

  • Extended Data Modality: Further research could explore additional data modalities such as video and time-series data, assessing if the demonstrated compression effectiveness of foundation models holds universally.
  • Model Compression Techniques: Investigating methods to reduce model parameter sizes without compromising performance could make these models more practical for large-scale deployment.
  • Online vs. Offline Compression: Developing and comparing algorithms for both online (prequential) and offline compression settings could provide a richer understanding of how foundation models operate in dynamic contexts.

In conclusion, the paper presents a robust framework linking LLMing to compression, backed by empirical evidence and theoretical insights. These findings have significant implications for the design and application of machine learning models across various domains.

Youtube Logo Streamline Icon: https://streamlinehq.com