Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling (2502.14553v1)

Published 20 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte LLMs (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte LLM (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a multi-stage architecture that decomposes byte sequences into global and local autoregressive decoders for million-length modeling.
The paper details efficiency gains using gradient checkpointing and stage parallelism, achieving near-linear token generation and improved BPB on benchmarks.
The paper demonstrates cross-modal applicability by successfully transferring text-pretrained MBLMs to tasks like visual question answering without custom encoders.

Multiscale Byte LLMs: Hierarchical Approaches for Million-length Causal Sequence Modeling

The Multiscale Byte LLM (MBLM) delineates a significant advancement in the design of byte-level LLMs capable of modeling sequences at million-length scales. MBLM generalizes and extends the principles of hierarchical patch-based modeling, previously exemplified by MegaByte, proposing an architecture that is both model-agnostic and modality-agnostic. The methodology is distinguished by its ability to efficiently process and autoregressively model byte sequences up to 5 million in length on a single GPU, achieved by decomposing the modeling problem into a hierarchy of global and local autoregressive decoders.

Architectural Framework

The core MBLM algorithm consists of an arbitrary number of stacked autoregressive decoder models—each designated as a "stage"—that operate on increasingly localized representations of the input bytestream. At each stage, the input byte sequence is segmented into fixed-size patches. Each subsequent decoder (Transformer, Mamba, or other autoregressive blocks) models dependencies at a coarser or finer granularity:

Global Stages: Capture long-range dependencies between larger patches across the sequence, projecting refined context into successive stages.
Local Stage: Operates on the finest granularity, predicting individual bytes within each patch autoregressively.

The passage of context and representation between stages is implemented by combining the output representations of a prior stage with the input embeddings of the subsequent stage. This facilitates effective hierarchical modeling without information leakage to future tokens within patches—crucial for maintaining strict causality.

Implementation and Scaling Considerations

The hierarchical design of MBLM is parameterized by the number of stages, patch sizes, and decoder choices per stage. By decoupling model depth and patch resolution, MBLM enables trade-offs between hardware memory constraints and computational throughput:

Gradient Checkpointing: Inner-stage batches can be selectively checkpointed and recomputed during backpropagation, reducing memory requirements at the cost of increased computation time.
Stage Parallelism: The architecture allows selective activation of parallel or sequential computation across stages, balancing memory and compute for arbitrarily long bytestreams.
Model-agnostic Design: Any autoregressive sequence model, provided it maintains input-output shape consistency and causal masking, can be employed as a stage.

Empirically, this yields the ability to train models with up to a 5 million byte context window on a single NVIDIA A100 80GB GPU at full model precision. Notably, scaling is only effective beyond the limits of naive (single-stage) modeling; when sequence fits in memory, standard Transformers outperform their hierarchical analogues due to the absence of input compression overhead.

Performance and Novel Experimentation

Unimodal LLMing: On PG19, MBLMs—especially hybrids employing Mamba at global stages and Transformers as locals—achieve superior bits-per-byte (BPB) metrics compared to homogeneous MegaByte and single-stage baselines. The architecture supports context extrapolation to million-length sequences without significant degradation in word-level perplexity, although analysis reveals that for typical LLMing data, most context over 4K bytes offers diminishing returns.

Generational Efficiency: Hybrid Mamba-Transformer hierarchies achieve near-linear token-generational efficiency at million-scale contexts, contrasting with the quadratic complexity bottlenecks that limit non-hierarchical Transformers. However, due to patching and context dependencies, some theoretical inference efficiencies of RNNs (as in pure Mamba) are not fully realized in hierarchical settings.

Multimodal and Visual Question Answering (VQA): MBLM is applied—without modification or custom encoders—to the CLEVR VQA task by serializing both text and raw RGB image bytes into a single bytestream. The model, using only a LLMing head and pure next-token prediction, matches the performance of dedicated CNN-LSTM baselines with classification heads, even under the challenging setup of minimal preprocessing. Notably, discretized or JPEG-compressed image representations are advantageous, as they reduce the input entropy, enhancing model performance for classification-like attributes.

Fine-tuning and Transfer: Fine-tuning text-pretrained MBLMs for VQA tasks yields positive transfer, contradicting prior findings of negative transfer when moving from text-only pretraining to byte-level vision.

Theoretical and Practical Implications

Theoretical Impact:

The evidence that hierarchical, model-agnostic byte-level architectures can combine arbitrary decoders (e.g., SSMs, Transformers) and scale effectively to million-length contexts establishes a compelling direction for omnimodal foundation models. The ability to operate in a tokenization-free regime sidesteps biases and complexity of tokenizer design and facilitates seamless cross-modal modeling.
The findings that much of extremely long context is effectively ignored in next-token prediction, even when available, highlight modeling and data limitations and suggest a need for tasks that require broader context utilization for further benchmarking.

Practical Impact:

MBLM can be readily adapted for domains with diverse binary modalities (e.g., document understanding, software/code, multimedia analysis) due to its strict modality-agnostic interface.
The implementation is open-sourced and packaged for reproducibility and downstream extension.
Real-world deployment can leverage MBLM for tasks that require long-term context (summarization, retrieval, multimodal QA), with scaling achievable via built-in support for distributed and parallel computation.

Performance Considerations:

The hybridization of Mamba and Transformer decoders is empirically validated: use global Mamba for efficient long-range modeling and local Transformers for intra-patch efficiency, especially when patch lengths are short and the backward pass of SSMs is a computational bottleneck.
Scaling to tens of millions of bytes is readily feasible with further memory optimizations (tensor parallelism, model sharding).

Future Research Directions

Scalability: Extending MBLMs to billion-scale parameters and integrating parallelism at both tensor and model levels. Design explorations could include automated patch-size selection, multi-resolution context gating, and dynamic chunking for improved efficiency.
Cross-modal Foundation Models: In-depth evaluation on "needle in a haystack" and sustained attention tasks that require reasoning over extremely large and heterogenous bytestreams.
Inference Optimization: Development of caching and incremental decoding strategies tailored to hierarchical patch structures to realize theoretical inference benefits of component models.

MBLM provides a robust, extensible foundation for the next generation of long-context, modality-agnostic sequence models, bridging advances in hierarchical architectures and state space models with practical scalability and utility across unimodal and multimodal domains.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (3)

GitHub

GitHub - ai4sd/multiscale-byte-lm: A hierarchical architecture that scales to million-length byte sequences (2 stars)