Language Modeling Is Compression (2309.10668v2)
Abstract: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these LLMs exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that LLMs are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
- Accelerated deep lossless image coding with unified paralleleized GPU coding architecture. In PCS, 2022.
- Fabrice Bellard. Lossless data compression with neural networks. Technical report, Amarisoft, 2019.
- Fabrice Bellard. NNCP v2: Lossless data compression with transformer. Technical report, Amarisoft, 2021.
- The description length of deep learning models. In NeurIPS, 2018.
- Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
- Thomas Boutell. PNG (portable network graphics) specification version 1.0. RFC, 1997.
- Language models are few-shot learners. In NeurIPS, 2020.
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.
- Scaling transformer to 1m tokens and beyond with RMT. arXiv:2304.11062, 2023.
- A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282, 2017.
- Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., 1984.
- Josh Coalson. Free lossless audio codec, 2008. URL https://xiph.org/flac.
- David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893, 2016.
- Neural networks and the chomsky hierarchy. In ICLR, 2023.
- Peter Deutsch. GZIP file format specification version 4.3. RFC, 1996.
- Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271, 2009.
- Text categorization using compression models. In Data Compression Conference, 2000.
- Memory-based meta-learning on non-stationary distributions. arXiv:2302.03067, 2023.
- Deepzip: Lossless data compression using recurrent neural networks. In DCC, 2019.
- Dzip: Improved general-purpose lossless compression based on novel neural network modeling. In DCC, 2020.
- Longt5: Efficient text-to-text transformer for long sequences. In NAACL-HLT (Findings), 2022.
- Training compute-optimal large language models. arXiv:2203.15556, 2022.
- Integer discrete flows and lossless compression. In NeurIPS, 2019.
- Analysis of arithmetic coding for data compression. In Data Compression Conference, 1991.
- David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952.
- Marcus Hutter. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability. Springer, 2005.
- Marcus Hutter. 500’000€ prize for compressing human knowledge, 2006. URL http://prize.hutter1.net.
- Few-shot non-parametric learning with deep latent variable model. In NeurIPS, 2022.
- "low-resource" text classification: A parameter-free classification method with compressors. In ACL (Findings), 2023.
- Scaling laws for neural language models. arXiv:2001.08361, 2020.
- Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In ICML, 2019.
- Byron Knoll. CMIX, 2014. URL http://www.byronknoll.com/cmix.html.
- A machine learning perspective on predictive coding with PAQ8. In DCC, 2012.
- Andrei N. Kolmogorov. On tables of random numbers. Theoretical Computer Science, 1998.
- Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL (1), 2018.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP (Demonstration), 2018.
- In-context reinforcement learning with algorithm distillation. In ICLR. OpenReview.net, 2023.
- Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Springer, 2019.
- DecMac: A deep context model for high efficiency arithmetic coding. In ICAIIC, 2019.
- David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
- Matthew V. Mahoney. Fast text compression with neural networks. In FLAIRS, 2000.
- TRACE: A fast transformer-based general-purpose lossless compressor. In WWW, 2022.
- Practical full resolution learned lossless image compression. In CVPR, 2019.
- Learning better lossless compression using lossy compression. In CVPR, 2020.
- Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, Brno Universtiy of Technology, 2012.
- Exploring generalization in deep learning. In NIPS, 2017.
- Shaking the foundations: delusions in sequence models for interaction and control. arXiv:2110.10819, 2021.
- Librispeech: An ASR corpus based on public domain audio books. In ICASSP, 2015.
- Richard C. Pasco. Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory, 1977.
- Igor Pavlov. 7z Format, 2019. URL http://www.7-zip.org/7z.html.
- Bpe-dropout: Simple and effective subword regularization. In ACL, 2020.
- Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
- Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446, 2021.
- A philosophical treatise of universal induction. Entropy, 2011.
- LC-FDNet: Learned lossless image compression with frequency decomposition network. In CVPR, 2022.
- Jorma Rissanen. Generalized kraft inequality and arithmetic coding. IBM J. Res. Dev., 1976.
- Randomized positional encodings boost length generalization of transformers. In ACL (2), 2023.
- Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 2015.
- Deep-learning-based lossless image coding. IEEE Trans. Circuits Syst. Video Technol., 2020.
- CNN-based prediction for lossless coding of photographic images. In PCS, 2018.
- Sequential neural text compression. IEEE Trans. Neural Networks, 1996.
- Neural machine translation of rare words with subword units. In ACL (1), 2016.
- Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 1948.
- Ray J. Solomonoff. A formal theory of inductive inference. part I. Inf. Control., 1964a.
- Ray J. Solomonoff. A formal theory of inductive inference. part II. Inf. Control., 1964b.
- Compression of generative pre-trained language models via quantization. In ACL (1), 2022.
- Using Compression-Based Language Models for Text Categorization, pp. 141–165. Springer Netherlands, 2003.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Practical lossless compression with latent variables using bits back coding. In ICLR (Poster), 2019.
- Llmzip: Lossless text compression using large language models. arXiv:2306.04050, 2023.
- Aäron van den Oord and Benjamin Schrauwen. The student-t mixture as a natural image patch prior with application to image compression. J. Mach. Learn. Res., 2014.
- Attention is all you need. In NIPS, 2017.
- Compress and control. In AAAI, 2015.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Terry A. Welch. A technique for high-performance data compression. Computer, 1984.
- The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory, 1995.
- Arithmetic coding for data compression. Commun. ACM, 1987.
- Big bird: Transformers for longer sequences. In NeurIPS, 2020.
- Anian Ruoss (20 papers)
- Paul-Ambroise Duquenne (12 papers)
- Elliot Catt (14 papers)
- Tim Genewein (25 papers)
- Christopher Mattern (11 papers)
- Jordi Grau-Moya (25 papers)
- Li Kevin Wenliang (11 papers)
- Matthew Aitchison (9 papers)
- Laurent Orseau (28 papers)
- Marcus Hutter (134 papers)
- Joel Veness (29 papers)
- Grégoire Delétang (12 papers)