Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (2507.07955v1)

Published 10 Jul 2025 in cs.LG

Abstract: Despite incredible progress in LLMs (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer LLM operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

Summary

The paper introduces H-Net with dynamic chunking that learns content- and context-dependent segmentation, eliminating fixed, manual tokenization.
It employs a recursive, U-Net-inspired hierarchy with state space models and auxiliary losses to effectively compress and process raw sequence data.
Empirical results demonstrate that H-Net delivers robust, efficient, and interpretable performance, outperforming traditional tokenized Transformers on diverse modalities.

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling: An Expert Overview

The paper "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" (2507.07955) introduces H-Net, a hierarchical sequence model that eliminates the need for fixed, handcrafted tokenization in LLMs. Instead, H-Net learns content- and context-dependent segmentation strategies directly from data, enabling fully end-to-end modeling from raw bytes. This work addresses longstanding limitations of tokenization, such as poor robustness, lack of interpretability, and suboptimal performance on languages or modalities with ambiguous or weak tokenization cues.

Core Contributions

H-Net is built on several key innovations:

Dynamic Chunking (DC): A differentiable, data-dependent mechanism for segmenting input sequences into variable-length chunks, learned jointly with the model.
Hierarchical Architecture: A recursive, U-Net-inspired structure with encoder, main, and decoder networks, allowing multi-stage abstraction and compression.
State Space Models (SSMs) in Encoders/Decoders: Use of Mamba-2 layers in outer networks for efficient and effective processing of fine-grained data.
Auxiliary Losses and Optimization Techniques: Introduction of a ratio loss to control compression, learning rate modulation across hierarchy levels, and architectural normalization for stable training.
Autoregressive Consistency: Careful design to preserve causality and enable efficient autoregressive inference, including dynamic compute allocation per token.

Implementation Details

Model Structure

H-Net consists of $S$ hierarchical stages. Each stage comprises:

Encoder ( $\mathcal{E}^s$ ): Processes input at stage $s$ using SSMs (Mamba-2 layers), producing representations $\hat{x}^s$ .
Dynamic Chunking Layer: Predicts chunk boundaries via a routing module (cosine similarity between adjacent vectors) and compresses the sequence by selecting boundary-marked vectors.
Main Network ( $\mathcal{M}$ ): Operates on compressed representations, typically implemented as a Transformer or hybrid (Transformer + SSM) stack.
Dechunking Layer: Smoothly reconstructs the original sequence length using an exponential moving average (EMA) smoothing module and upsampler, facilitating gradient flow through discrete boundary decisions.
Decoder ( $\mathcal{D}^s$ ): Combines coarse representations from the main network with fine-grained residuals from the encoder.

The architecture is recursive: the main network at one stage can itself be an H-Net, enabling multi-level abstraction.

Dynamic Chunking Mechanism

Routing Module: For each position $t$ , computes a boundary probability $p_t$ based on the cosine similarity between projected representations of $\hat{x}_t$ and $\hat{x}_{t-1}$ . High dissimilarity signals a likely boundary.
Downsampler: Selects vectors where $b_t = 1$ (boundary indicator) to form the compressed sequence.
Smoothing Module: During decompression, applies EMA to interpolate between chunked representations, ensuring differentiability and stable training.
Upsampler: Expands compressed vectors back to the original sequence length, weighted by boundary confidence and using a straight-through estimator (STE) for gradient stability.

Training and Optimization

Ratio Loss: Guides the model to achieve a target compression ratio, preventing degenerate solutions (e.g., all or no boundaries).
Learning Rate Modulation: Adjusts learning rates per stage based on sequence length and hidden dimension, accelerating convergence in outer (higher-resolution) stages.
Normalization: RMSNorm layers at the end of each network component balance residual and processed features, critical for effective information propagation across hierarchy.
Autoregressive Training: All components are designed to maintain causality, with boundary decisions and smoothing depending only on current and past information.

Inference

H-Net supports efficient autoregressive generation. At each step, the model dynamically decides whether to process a token through the main network, enabling adaptive compute allocation. This is analogous to speculative decoding, where a lightweight model processes every token and a heavier model is invoked only as needed.

Empirical Results

LLMing

Performance: H-Net with a single stage of dynamic chunking matches or exceeds the performance of strong BPE-tokenized Transformers at equivalent compute and data budgets. Two-stage H-Nets further improve scaling, outperforming tokenized Transformers of twice the size.
Robustness: Byte-level H-Nets demonstrate significantly higher robustness to textual perturbations (e.g., case changes, whitespace removal) compared to tokenized baselines.
Interpretability: Visualizations show that H-Net learns semantically meaningful chunk boundaries (e.g., word or phrase boundaries) without explicit supervision.
Multilingual and Multimodal: H-Net's advantages are amplified in languages without clear tokenization cues (e.g., Chinese), code, and DNA sequences, achieving up to 4x data efficiency improvements over baselines.

Ablations

Smoothing Module: Essential for stable training and effective compression; removing it leads to unstable chunking and degraded performance.
Routing Module: Cosine similarity-based routing outperforms direct probability prediction, yielding more interpretable and semantically aligned boundaries.
Encoder/Decoder Architecture: SSMs (Mamba-2) in encoders/decoders are superior to Transformers, even at coarser resolutions, due to their inductive bias for compression.
Main Network: Hybrid architectures (interleaving SSMs and Transformers) in the main network can further improve scaling, consistent with findings in isotropic models.

Comparison to Mixture-of-Experts (MoE)

H-Net's dynamic chunking provides performance gains beyond what can be achieved by generic sparsity (e.g., MoE), as chunking is semantically informed and jointly optimized with the model.

Practical Implications

Deployment Considerations

Training Efficiency: H-Net introduces additional complexity in training due to dynamic sequence lengths and variable compute per token. Specialized kernels (e.g., FlashAttention2, Mamba2) and careful memory management are required.
Inference: Dynamic compute allocation per token can complicate batched inference, as different tokens may require different processing paths. However, this also enables efficient use of resources by focusing compute on semantically important regions.
Scalability: The architecture is validated up to 1.3B parameter models; further scaling may require additional engineering for stability and efficiency.

Applications

Tokenizer-Free Foundation Models: H-Net enables the development of models that operate directly on raw data, reducing pre-processing and increasing robustness across languages and modalities.
Multilingual and Multimodal Modeling: Particularly advantageous for languages without clear tokenization heuristics and for non-textual sequences (e.g., code, DNA).
Adaptive Compute: The dynamic chunking mechanism allows models to allocate more compute to complex or information-rich regions, potentially improving reasoning and efficiency.

Theoretical and Future Directions

Deeper Hierarchies: The recursive nature of H-Net allows for arbitrarily deep hierarchies, potentially enabling models to discover higher-order abstractions (e.g., sentences, paragraphs) directly from data.
Integration with Other Sparsity Methods: H-Net's dynamic chunking is orthogonal to MoE and other conditional computation techniques, suggesting potential for further efficiency gains.
Scaling Laws: Formal analysis of scaling behavior with respect to model size, data, and hierarchy depth remains an open area.
Distillation and Transfer: The paper demonstrates that H-Net can be distilled from pretrained tokenized models, offering a practical path for transitioning existing models to tokenizer-free architectures.

Strong Numerical Results and Claims

Byte-level H-Net with one stage matches or exceeds BPE-tokenized Transformer performance at equivalent compute/data budgets.
Two-stage H-Net matches the performance of a tokenized Transformer of twice its size.
On languages/modalities with weak tokenization cues (e.g., Chinese, code, DNA), H-Net achieves up to 4x improvement in data efficiency.
H-Net models exhibit significantly higher robustness to adversarial text perturbations compared to tokenized baselines.

Implications and Outlook

H-Net represents a significant step toward fully end-to-end, tokenizer-free sequence modeling. By learning segmentation strategies directly from data, it overcomes the limitations of fixed tokenization, improves robustness, and generalizes across languages and modalities. The dynamic chunking mechanism, combined with hierarchical processing and SSM-based encoders/decoders, provides a flexible and efficient framework for future foundation models. Further research into deeper hierarchies, integration with other sparsity techniques, and large-scale deployment will be critical for realizing the full potential of this approach in practical AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sukjun_hwang/status/1944059505714901222

https://twitter.com/scaling01/status/1943990428396667101

https://twitter.com/rohanpaul_ai/status/1943811779697815972

https://twitter.com/chongzluong/status/1943719808576762041

https://twitter.com/jmbollenbacher/status/1943880430714859672

https://twitter.com/arxivsanitybot/status/1943869514476585418

HackerNews

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (5 points, 0 comments)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (5 points, 0 comments)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (4 points, 0 comments)

Reddit

H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data (49 points, 6 comments)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (4 points, 1 comment)