Papers
Topics
Authors
Recent
Search
2000 character limit reached

Language Modeling Is Compression

Published 19 Sep 2023 in cs.LG, cs.AI, cs.CL, cs.IT, and math.IT | (2309.10668v2)

Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these LLMs exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that LLMs are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Accelerated deep lossless image coding with unified paralleleized GPU coding architecture. In PCS, 2022.
  2. Fabrice Bellard. Lossless data compression with neural networks. Technical report, Amarisoft, 2019.
  3. Fabrice Bellard. NNCP v2: Lossless data compression with transformer. Technical report, Amarisoft, 2021.
  4. The description length of deep learning models. In NeurIPS, 2018.
  5. Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
  6. Thomas Boutell. PNG (portable network graphics) specification version 1.0. RFC, 1997.
  7. Language models are few-shot learners. In NeurIPS, 2020.
  8. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.
  9. Scaling transformer to 1m tokens and beyond with RMT. arXiv:2304.11062, 2023.
  10. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282, 2017.
  11. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., 1984.
  12. Josh Coalson. Free lossless audio codec, 2008. URL https://xiph.org/flac.
  13. David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893, 2016.
  14. Neural networks and the chomsky hierarchy. In ICLR, 2023.
  15. Peter Deutsch. GZIP file format specification version 4.3. RFC, 1996.
  16. Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271, 2009.
  17. Text categorization using compression models. In Data Compression Conference, 2000.
  18. Memory-based meta-learning on non-stationary distributions. arXiv:2302.03067, 2023.
  19. Deepzip: Lossless data compression using recurrent neural networks. In DCC, 2019.
  20. Dzip: Improved general-purpose lossless compression based on novel neural network modeling. In DCC, 2020.
  21. Longt5: Efficient text-to-text transformer for long sequences. In NAACL-HLT (Findings), 2022.
  22. Training compute-optimal large language models. arXiv:2203.15556, 2022.
  23. Integer discrete flows and lossless compression. In NeurIPS, 2019.
  24. Analysis of arithmetic coding for data compression. In Data Compression Conference, 1991.
  25. David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952.
  26. Marcus Hutter. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability. Springer, 2005.
  27. Marcus Hutter. 500’000€ prize for compressing human knowledge, 2006. URL http://prize.hutter1.net.
  28. Few-shot non-parametric learning with deep latent variable model. In NeurIPS, 2022.
  29. "low-resource" text classification: A parameter-free classification method with compressors. In ACL (Findings), 2023.
  30. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  31. Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In ICML, 2019.
  32. Byron Knoll. CMIX, 2014. URL http://www.byronknoll.com/cmix.html.
  33. A machine learning perspective on predictive coding with PAQ8. In DCC, 2012.
  34. Andrei N. Kolmogorov. On tables of random numbers. Theoretical Computer Science, 1998.
  35. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL (1), 2018.
  36. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP (Demonstration), 2018.
  37. In-context reinforcement learning with algorithm distillation. In ICLR. OpenReview.net, 2023.
  38. Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Springer, 2019.
  39. DecMac: A deep context model for high efficiency arithmetic coding. In ICAIIC, 2019.
  40. David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
  41. Matthew V. Mahoney. Fast text compression with neural networks. In FLAIRS, 2000.
  42. TRACE: A fast transformer-based general-purpose lossless compressor. In WWW, 2022.
  43. Practical full resolution learned lossless image compression. In CVPR, 2019.
  44. Learning better lossless compression using lossy compression. In CVPR, 2020.
  45. Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, Brno Universtiy of Technology, 2012.
  46. Exploring generalization in deep learning. In NIPS, 2017.
  47. Shaking the foundations: delusions in sequence models for interaction and control. arXiv:2110.10819, 2021.
  48. Librispeech: An ASR corpus based on public domain audio books. In ICASSP, 2015.
  49. Richard C. Pasco. Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory, 1977.
  50. Igor Pavlov. 7z Format, 2019. URL http://www.7-zip.org/7z.html.
  51. Bpe-dropout: Simple and effective subword regularization. In ACL, 2020.
  52. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
  53. Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446, 2021.
  54. A philosophical treatise of universal induction. Entropy, 2011.
  55. LC-FDNet: Learned lossless image compression with frequency decomposition network. In CVPR, 2022.
  56. Jorma Rissanen. Generalized kraft inequality and arithmetic coding. IBM J. Res. Dev., 1976.
  57. Randomized positional encodings boost length generalization of transformers. In ACL (2), 2023.
  58. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 2015.
  59. Deep-learning-based lossless image coding. IEEE Trans. Circuits Syst. Video Technol., 2020.
  60. CNN-based prediction for lossless coding of photographic images. In PCS, 2018.
  61. Sequential neural text compression. IEEE Trans. Neural Networks, 1996.
  62. Neural machine translation of rare words with subword units. In ACL (1), 2016.
  63. Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 1948.
  64. Ray J. Solomonoff. A formal theory of inductive inference. part I. Inf. Control., 1964a.
  65. Ray J. Solomonoff. A formal theory of inductive inference. part II. Inf. Control., 1964b.
  66. Compression of generative pre-trained language models via quantization. In ACL (1), 2022.
  67. Using Compression-Based Language Models for Text Categorization, pp.  141–165. Springer Netherlands, 2003.
  68. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  69. Practical lossless compression with latent variables using bits back coding. In ICLR (Poster), 2019.
  70. Llmzip: Lossless text compression using large language models. arXiv:2306.04050, 2023.
  71. Aäron van den Oord and Benjamin Schrauwen. The student-t mixture as a natural image patch prior with application to image compression. J. Mach. Learn. Res., 2014.
  72. Attention is all you need. In NIPS, 2017.
  73. Compress and control. In AAAI, 2015.
  74. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  75. Terry A. Welch. A technique for high-performance data compression. Computer, 1984.
  76. The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory, 1995.
  77. Arithmetic coding for data compression. Commun. ACM, 1987.
  78. Big bird: Transformers for longer sequences. In NeurIPS, 2020.
Citations (94)

Summary

  • The paper demonstrates that large language models can act as efficient compressors by maximizing log-likelihood to minimize coded message lengths.
  • It reveals through empirical analysis that models like Chinchilla 70B compress datasets such as ImageNet patches and LibriSpeech better than traditional compressors.
  • It outlines the trade-offs in scaling and tokenization, emphasizing the balance between model size, vocabulary, and predictive complexity in achieving optimal compression.

Language Modeling Is Compression

Introduction

The paper "Language Modeling Is Compression" (2309.10668) postulates a conceptual paradigm wherein the essence of LLMs can be viewed through the prism of data compression. The authors assert that LLMs, due to their predictive capabilities, serve as proficient compressors. This equivalence between prediction and compression provides deeper insights into scaling laws, tokenization practices, and the broader implications of in-context learning.

Theoretical Underpinnings

Rooted in information theory, the theoretical framework hinges on the source coding theorem, which posits that the optimal message length aligns with the negative log-likelihood of a model. This underpinning elucidates how maximizing log-likelihood (predictive capacity) is synonymous with minimizing the coded message length (compression efficiency). The methodology transforms predictive models into lossless compressors using arithmetic coding, showcasing an intersection of machine learning and data compression. Figure 1

Figure 2: Original image.

Empirical Analysis

The empirical assessment reveals that LLMs surpass traditional domain-specific compressors on various datasets. Notably, the Chinchilla 70B model compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their original sizes, outperforming PNG and FLAC. These findings underscore the versatility of LLMs as general-purpose compressors, transcending their text-based training. Figure 3

Figure 3

Figure 3

Figure 4: Chinchilla 1b.

Scaling Laws and Compression

The paper revisits scaling laws, illustrating that increased model size and dataset scale impact compression performance. However, beyond a critical point, additional scaling adversely affects compression due to the burgeoning model parameters, reaffirming that model size relative to dataset size dictates performance viability.

Tokenization and Compression

Tokenization emerges as a pre-compression mechanism, influencing the volume of information models can process within context windows. While larger token vocabularies reduce sequence length, thus condensing information density, this incrementally escalates prediction complexity. Consequently, the balance between vocabulary size and model capacity becomes pivotal.

Practical Implications and Future Directions

By demonstrating compressors' potential as generative models, the authors highlight an intriguing capability of utilizing compressors like gzip in generative tasks via autoregressive sampling. This generative proficiency, however, accompanies the challenge of error accumulation over extended sampling sequences.

The study advocates for integrating compression metrics in evaluating LLMs, thus broadening the spectrum of performance evaluation beyond conventional log-loss or predictive accuracy benchmarks. Future research will likely explore enhancing in-context learning and improving model efficiency to balance parameter size against compression capability optimally.

Conclusion

This research proposes a compelling intersection of language modeling and data compression, expanding our understanding of LLM capabilities beyond traditional predictive contexts. The compression-centric view furnishes novel insights into model scaling, tokenization, and generative applications, paving the path for future explorations in optimizing LLMs for diverse applications across data-modality spectrums.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 49 tweets with 1156 likes about this paper.