MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers (2305.07185v2)

Published 12 May 2023 in cs.LG

Abstract: Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context LLMing, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

PDF Abstract

MEGABYTE is a novel multiscale decoder architecture designed to efficiently process and predict sequences of millions of bytes, addressing the limitations of traditional autoregressive transformers which scale poorly to very long sequences. The core idea is to decompose the long sequence into smaller, fixed-size patches and employ a hierarchical approach with a "Global" model operating on patches and a "Local" model operating on bytes within patches. This enables tokenization-free sequence modeling at scale across various modalities.

Architecture Overview

MEGABYTE consists of three main components:

Patch Embedder: This component takes the input byte sequence and transforms it into a sequence of patch representations. Each byte is initially embedded using a standard lookup table with positional embeddings. These byte embeddings are then grouped and concatenated into patches of a fixed size $P$ . To maintain the autoregressive property, padding is applied at the beginning of the patch sequence.
Global Model: This is a large decoder-only Transformer that operates on the sequence of patch embeddings produced by the Patch Embedder. It uses masked self-attention to capture dependencies between different patches (long-range context). The output of the Global model for a given patch represents a contextualized representation of that patch based on all preceding patches.
Local Model: This is a smaller autoregressive model (also a decoder-only Transformer) that operates within each patch. It takes the contextualized patch representation from the Global model and combines it with embeddings of the preceding bytes within the current patch. The Local model then predicts the next byte within that patch. Crucially, copies of the Local model can process different patches in parallel during training.

The prediction process is autoregressive at two levels: the Global model processes patches sequentially, and the Local model processes bytes sequentially within a patch, conditioned on the Global model's output and previous bytes in the patch.

Key Implementation Advantages

MEGABYTE offers several practical advantages over standard transformers for long sequences:

Sub-quadratic Self-Attention: By processing sequences in patches, the quadratic self-attention cost ( $O(T^2)$ for sequence length $T$ ) is significantly reduced. The Global model operates on $K=T/P$ patches with dimension $P \cdot D_G$ , resulting in $O((T/P)^2 \cdot (P \cdot D_G)^2)$ which is simplified in the paper's analysis to $O((T/P)^2)$ . The Local model operates on $T/P$ sequences of length $P$ with dimension $D_L$ , costing $O((T/P) \cdot P^2)$ . The total attention cost is $O((T/P)^2 + PT/P) = O((T/P)^2 + T)$ , which is sub-quadratic for $P > 1$ . With optimal patch size $P = T^{1/3}$ , the complexity is $O(T^{4/3})$ . The paper provides an updated analysis yielding $O(T^{3/2})$ or $O(T^{4/3})$ depending on parameter configurations, still significantly better than $O(T^2)$ .
Per-Patch Feedforward Layers: The paper notes that feedforward networks are often the dominant cost in large transformers. MEGABYTE places its largest feedforward layers in the Global model, which processes $T/P$ patches rather than $T$ individual tokens. This allows using feedforward layers that are $P$ times larger for the same computational cost compared to a per-token feedforward layer in a standard transformer, enabling larger model capacity for the same FLOPs budget.
Parallelism in Decoding: During inference (generation), while transformers must generate tokens strictly serially, MEGABYTE can generate representations for patches in parallel using the Local model. The Global model still processes patches serially, but the within-patch generation can be parallelized, leading to faster overall generation speeds. The paper shows a 1.3B parameter MEGABYTE model generating 8192 bytes 40% faster than a 350M parameter transformer [(Yu et al., 2023 ), Table 6].

Implementation Considerations and Variations

Padding: Careful padding (size $P$ for Global input, size 1 for Local input) is essential to maintain the autoregressive property and prevent information leakage from future tokens/patches.
Hyperparameters: Key hyperparameters include the patch size $P$ , the dimension of the Global model ( $D_G$ ) and Local model ( $D_L$ ), and the number of layers in each. The paper finds that a larger Global model relative to the Local model is generally optimal [(Yu et al., 2023 ), Table 10] and that performance is robust to the exact patch size within a range [(Yu et al., 2023 ), Table 9].
Convolutional Patch Encoder: An optional extension to make the patch embedding more translation-invariant by using causal convolutional layers before chunking into patches.
Cross-Patch Attention: The Local model can optionally attend to a few tokens from the previous patch to increase its context slightly with minimal overhead.
Strided Inference: To mitigate the empirical observation that prediction quality decreases towards the end of a patch, strided inference performs two forward passes with inputs offset by $P/2$ and combines predictions from the first half of each patch from the two passes. This improves results at double the inference cost [(Yu et al., 2023 ), Table 8].
Image Data Handling: For 2D data like images, a "patch scan" method is proposed where the image is divided into 2D patches ( $p \times p$ pixels), and then a raster scan is applied across these patches and within each patch [(Yu et al., 2023 ), Appendix D.1]. This performs better than a simple raster scan of the whole image [(Yu et al., 2023 ), Table 13].

The paper provides pseudocode [(Yu et al., 2023 ), Listing 1] illustrating the core data flow through the prepare_input, Global model, and Local model stages.

Applications and Performance

MEGABYTE was evaluated across diverse modalities:

LLMing: Trained on datasets like PG-19, Stories, Books, arXiv, and Code, MEGABYTE consistently outperforms standard byte-level transformers and PerceiverAR under compute-controlled settings [(Yu et al., 2023 ), Table 2]. On a larger scale on PG-19, MEGABYTE achieves competitive results with state-of-the-art subword models while operating purely at the byte level [(Yu et al., 2023 ), Table 3]. This demonstrates the potential for tokenization-free LLMs.
Image Modeling: Evaluated on ImageNet for density estimation at various resolutions, including up to 640x640 pixels (over 1.2 million bytes per image). MEGABYTE matches state-of-the-art on ImageNet 64x64 using half the compute [(Yu et al., 2023 ), Table 4] and significantly outperforms baselines in compute-controlled settings on higher resolutions [(Yu et al., 2023 ), Table 5].
Audio Modeling: Applied to raw audio files by directly modeling bytes (256 values). MEGABYTE achieved lower bits-per-byte scores than PerceiverAR and vanilla transformers, demonstrating its effectiveness on sequential, byte-level audio data [(Yu et al., 2023 ), Section 7].

Practical Takeaways for Implementation

The two-level architecture with a larger Global model and smaller Local model is crucial for efficiency and performance.
Careful data preparation, including padding and reshaping into patches, is necessary.
The Patch Embedder can be a simple linear embedding layer per byte followed by reshaping, although convolutional layers can provide benefits, especially for image data.
For images, using a patch scan order is recommended over a simple raster scan.
While basic inference is efficient, techniques like strided inference can improve generation quality at the cost of speed.
MEGABYTE models can be significantly larger than standard transformers for the same training compute budget due to the per-patch feedforward layers, enabling higher capacity.

The paper establishes MEGABYTE as a viable and performant architecture for modeling long sequences directly at the byte level, offering improvements in efficiency and scalability compared to standard transformers and other long-context methods, while also providing a compelling alternative to complex tokenization schemes.