Training LLMs over Neurally Compressed Text (2404.03626v3)

Published 4 Apr 2024 in cs.CL and cs.LG

Abstract: In this paper, we explore the idea of training LLMs over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text na\"ively compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.

Authors (7)

Brian Lester (21 papers)
Jaehoon Lee (62 papers)
Alex Alemi (9 papers)
Jeffrey Pennington (45 papers)
Adam Roberts (46 papers)
Jascha Sohl-Dickstein (88 papers)
Noah Constant (32 papers)

Citations (5)

View on Semantic Scholar

Summary

Overview of Training LLMs on Neurally Compressed Text

Introduction

The rapid advancement in LLMs has necessitated efficient training techniques that can handle vast datasets without proportional increases in computational resources. A novel approach to achieve this is training LLMs on neurally compressed text. This paper presents a comprehensive exploration into training LLMs directly over text compressed using neural methods, specifically focusing on overcoming the primary challenge that strongly compressed text tends to lose its learnability due to increased opacity. The authors propose a novel method, Equal-Info Windows, which segments text into blocks that each compress to the same bit length, thereby enabling effective learning over neurally compressed data.

Compression Techniques and Learning Challenges

Standard subword tokenizers provide a basic form of compression that improves the efficiency of LLM training by allowing models to process more text per token. The question arises: could further compressing the text, using advanced neural methods, allow LLMs to train more efficiently by learning from a denser knowledge representation? This inquiry forms the basis of the paper’s exploration.

Attempts at directly applying Arithmetic Coding (AC), a known near-optimal method for text compression, resulted in data that LLMs struggled to learn from due to the compression algorithm's complexity and the resultant data's uniformity. This propelled the development of the Equal-Info Windows technique which aims at segmenting the text before compression, such that each segment compresses to an equal number of bits, thus maintaining a degree of learnability.

Equal-Info Windows: Enhancing Learnability

The Equal-Info Windows method segments the corpus text into blocks before applying AC, with each block designed to compress into a consistent number of bits. This technique significantly aids the learning process of LLMs by ensuring that each token, post-compression, contains a stable amount of information. The research findings demonstrate that this method:

Is effective across a range of settings, showing that learning progresses gradually from tokens at the start of each window and diminishes for tokens near the end.
Results in models that outperform byte-level baselines on perplexity benchmarks when fixed computation budgets are considered.
Faces challenges in matching the performance of subword baselines due to instabilities in the token-to-text mappings.

Implications and Future Directions

Training LLMs over neurally compressed text embodies significant implications for model efficiency, both in terms of training and inference. Models capable of processing highly compressed text can handle longer sequences and more data within the same computational constraints. However, the practical deployment of this methodology hinges on overcoming the challenges related to the stability of token-to-text mappings and the uniformity of compressed data.

Further research could aim at refining neural compression techniques to retain more structure in the compressed data, enhancing predictability without significantly compromising on compression rates. Additionally, exploring compression methods that facilitate a more stable mapping between tokens and text would be valuable.

Conclusion

This paper introduces a viable pathway towards training LLMs on neurally compressed text, highlighting the potential for substantial gains in training efficiency. The Equal-Info Windows method, in particular, represents a promising approach to maintaining learnability in the face of strong compression. Future work in this domain holds the potential to significantly reduce the computational load associated with LLM training, making it an exciting area for ongoing investigation.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/finbarrtimbers/status/1813612131046342846

https://twitter.com/finbarrtimbers/status/1791506480182784010

https://twitter.com/noahconst/status/1776091105043591286

https://twitter.com/main_horse/status/1841583661889806497

https://twitter.com/fly51fly/status/1776198544367788266

https://twitter.com/akbirthko/status/1788790498213433840