Dynamic Chunking in H-Net: Hierarchical Modeling

Updated 31 December 2025

Dynamic chunking is a data-driven method that learns hierarchical, context-sensitive segmentation boundaries from raw bytes or speech without relying on fixed tokenizers.
It employs neural routers, latent variable inference, and bidirectional recurrent units to adaptively segment inputs, aligning discovered boundaries with linguistic structures.
The approach improves tokenization-free modeling by enhancing data efficiency and robustness, yielding better compression rates and lower error metrics in morphologically-rich languages and streaming ASR.

Dynamic chunking in H-Net encompasses an end-to-end, data-driven mechanism for learning hierarchical, content- and context-sensitive segmentation boundaries within sequences, enabling scalable and linguistically-aligned modeling without recourse to external tokenizers. This approach subsumes both byte-level and streaming modalities, and is central to recent advances in language modeling, speech recognition, and structured sequence modeling in morphologically-rich languages and under ambiguous tokenization. The architecture, training procedures, and boundary induction methods combine latent variable inference, neural routers, and hierarchical compression, yielding models that discover segments aligned with linguistic structure, enhance data efficiency, and maintain competitive computational costs (Zakershahrak et al., 7 Aug 2025, Hwang et al., 10 Jul 2025, Wang et al., 12 Nov 2025).

1. Architectural Foundations of Dynamic Chunking in H-Net

Dynamic chunking is embedded within the H-Net family of hierarchical models, comprising multilevel encoders, router-controlled chunking layers, and context-mixing main networks. The architecture accepts raw byte sequences or frame-level speech inputs and processes them through stacked levels, each introducing data-dependent chunk boundaries.

At each level $\ell$ ( $\ell=1 \ldots L$ for H-Net++), byte or frame embeddings $z^{(\ell)}$ are routed through bidirectional recurrent units (BiGRU or Mamba-2 SSM), which produce hidden states $h^{(\ell)}_t$ . A boundary predictor then computes gates $g^{(\ell)}_t \in \{0,1\}$ , where $g^{(\ell)}_t=1$ marks chunk boundaries. Contiguous spans with $g^{(\ell)}_t=0$ are grouped, and their embeddings aggregated (typically via mean pooling) into higher-level representations $z^{(\ell+1)}_k$ . The resulting sequence length decreases with each level ( $N_{\ell+1} \ll N_\ell$ ), yielding increasingly abstract and compressed features. For streaming applications (e.g., Tibetan ASR), chunk widths and strides $(W_n, S_n)$ are adaptively set by a gating network taking as input the encoder state and a global control vector (Wang et al., 12 Nov 2025).

A context-mixer (lightweight Transformer or multihead attention) operates on final-level chunk embeddings to propagate cross-segment information. This step enables non-local, hierarchical contextualization prior to decoding or further processing. In H-Net++, document-level consistency is further captured by introducing global latent variables $\xi^{(i)}$ drawn from an amortized variational hyper-prior (Zakershahrak et al., 7 Aug 2025).

2. Latent Variable Segmentation and Router Mechanisms

Chunk boundaries are induced via latent discrete variables governed by neural routers. In H-Net++, the segmentation model employs a per-position latent gate $g^{(\ell)}_t \in \{0,1\}$ , parameterized by a BiGRU and MLP, where the boundary posterior is: $h^{(\ell)}_t \;=\;\mathrm{BiGRU}^{(\ell)}(z^{(\ell)}_t,h^{(\ell)}_{t-1}), \quad \pi^{(\ell)}_t \;=\;\sigma(w^{(\ell)\top}h^{(\ell)}_t + b^{(\ell)})$

$q_\phi^{(\ell)}(g_t^{(\ell)}=1\mid x) = \mathrm{Bernoulli}(\pi^{(\ell)}_t)$

Sampling is performed via straight-through Gumbel-Softmax, ensuring differentiability for end-to-end optimization (with annealed temperature). The prior over boundaries is a Bernoulli with fixed parameter $p_0$ .

In the original H-Net formulation, boundary scoring is operationalized as: $p_t = \frac12(1 - \cos(q_t, k_{t-1})), \quad b_t = 1_{\{p_t \geq 0.5\}}$ where $q_t, k_{t-1}$ are linear projections of embeddings. This approach generalizes across text, code, and biological sequences (Hwang et al., 10 Jul 2025).

3. Hierarchical Compression and Decompression

The dynamic chunking scheme enables multi-level hierarchical compression, with each stage reducing resolution according to learned boundaries. At stage $s$ , the downsampler emits chunked representations $x^{s+1}$ and associated scores $P^{s+1}$ , while the upsampler restores full resolution via weighted interpolation and a straight-through estimator, followed by residual addition (Hwang et al., 10 Jul 2025). The hierarchical process discovers structure at multiple scales, aligning with morphemes, words, phrases, or motifs, depending on the modality.

Pseudocode for H-Net++ is as follows:

for step in 1..500K:
    L_seq ← sample_sequence_length(step)
    x[1:L_seq] ← sample_bytes_from_corpus(L_seq)
    e[1:L_seq] ← embed_bytes_with_ZWNJ(x)
    z^(1)_t ← e_t
    for ℓ in 1..L:
        h^(ℓ) ← BiGRU^ℓ(z^(ℓ))
        π^(ℓ)_t ← sigmoid( w^ℓᵀ h^(ℓ)_t + b^ℓ )
        g^(ℓ) ← straight_through_gumbel(π^(ℓ), τ(step))
        chunks ← segment_by_gates(g^(ℓ))
        z^(ℓ+1) ← mean_pool(h^(ℓ), chunks)
    Z_star ← LayerNorm(FFN(LayerNorm(MHA(z^(L)) + z^(L))) + LayerNorm(MHA(z^(L)) + z^(L)))
    μ,σ ← MLP_φ(mean_over_time(Z_star))
    ξ^(i) ~ Normal(μ, diag(σ²))  # i=1,2
    logits ← Decoder(Z_star, ξ^(1), ξ^(2))
    LM_loss ← negative_log_likelihood(logits, x shifted)
    # additional losses: KL, morph alignment, auxiliary
    ...
    backpropagate_and_update(L_total)

[source: (Zakershahrak et al., 7 Aug 2025)]

4. Training Objectives and Curriculum Learning

Dynamic chunking models optimize composite losses that balance autoregressive language modeling, latent regularization, boundary alignment, and auxiliary criteria: $\mathcal{L} = \mathbb{E}_{q_\phi(z, \xi \mid x)}[-\log p_\theta(x\mid z,\xi)] + \lambda_{\mathrm{KL}}\sum_{\ell=1}^L \mathrm{KL}(q_\phi^{(\ell)}(z^{(\ell)}\mid x)\|p^{(\ell)}(z^{(\ell)})) + \sum_{i=1}^2 \mathrm{KL}(q_\phi(\xi^{(i)}\mid z^\ast)\|p(\xi^{(i)})) + \lambda_{\mathrm{morph}}\mathcal{L}_{\mathrm{morph}}(z) + \lambda_{\mathrm{aux}}\mathcal{L}_{\mathrm{aux}}(z)$ where $\mathcal{L}_{\mathrm{morph}}$ enforces alignment with rule-based morpheme boundaries (where available), and $\mathcal{L}_{\mathrm{aux}}$ encourages load balancing and chunk length regularity (Zakershahrak et al., 7 Aug 2025).

Training proceeds via staged curriculum: initial fixed-length batches, gradually increasing and heterogeneously sampled sequence lengths up to full-range inputs. Optimizers use AdamW with designated hyperparameters and learning rate schedules per stage; ratio-loss and KL regularizers are weighted for stable scaling (Hwang et al., 10 Jul 2025, Zakershahrak et al., 7 Aug 2025).

Dynamic chunking in H-Net++ is specialized for orthographic artifacts, notably the Persian ZWNJ (U+200C). The system introduces a dedicated embedding offset when this byte appears: $e_t = \begin{cases} \mathrm{Embed}_{\rm byte}(x_t) + \mathrm{Embed}_{\rm zwnj} & \text{if } x_t = \mathrm{U+200C} \ \mathrm{Embed}_{\rm byte}(x_t) & \text{otherwise} \end{cases}$ This design allows the router to learn ZWNJ-specific segmentation, preserving linguistic distinctions that would otherwise be lost to generic byte-level modeling. The approach has yielded substantial robustness to corrupted or non-standard input forms (53% improvement to ZWNJ corruption resistance; 73.8% F1 on gold morphological boundaries) (Zakershahrak et al., 7 Aug 2025).

For audio streams, chunk widths and stride are determined by network state, enabling adaptation to variable speaking rates. Context-aware controllers compute fusion and projection of encoding summaries, outputting adaptive segmentations with look-left context for cross-chunk attention (best performance at $K=8$ frames) (Wang et al., 12 Nov 2025).

6. Empirical Outcomes and Practical Impact

Across modalities, dynamic chunking in H-Net architectures has demonstrated:

Tokenization-free modeling efficiency: Byte-level sequence models with dynamic chunking outperform strong BPE-based Transformers of equal or greater size, especially in morphologically-rich languages, code, and genomics. H-Net++ achieves 0.159 BPB reduction (12% better compression) and 5.4pp boost on ParsGLUE (Zakershahrak et al., 7 Aug 2025). In English and code, hierarchy-enhanced chunking yields up to 4x improvement in data efficiency (Hwang et al., 10 Jul 2025).
Alignment with linguistic structure: Without explicit supervision, boundary induction aligns with morpheme, word, or motif boundaries, outperforming heuristic tokenization in robustness and semantics.
Streaming ASR improvements: In Tibetan ASR, context-aware dynamic chunking reduces word error rate by 48.15% over fixed-chunk methods and delivers lower latency, with near-global decoding accuracy (Wang et al., 12 Nov 2025).
Robustness to perturbations: Dynamic chunking models retain segmentation behaviors under adversarial and corrupted input, resisting errors that degrade fixed-token systems (Hwang et al., 10 Jul 2025, Zakershahrak et al., 7 Aug 2025).

7. Limitations, Interpretations, and Broader Implications

Dynamic chunking in H-Net circumvents the need for handcrafted tokenization and exposes underlying hierarchical units that reflect true linguistic or structural boundaries, a property verified across text, speech, code, and biological data. The use of latent gates and neural routers for boundary selection introduces additional regularization burdens and hyperparameter dependencies (e.g., Gumbel temperature, KL weight). Performance scales with the number of chunking stages and the nature of the input modality; multi-stage models consistently outperform single-stage designs on non-English and structurally-opaque sequences.

A plausible implication is that dynamic chunking architectures generalize across domains, offering principled alternatives to pre- and post-processing pipelines by learning segmentation from first principles. However, the full integration of external linguistic resources (e.g., lexicons, rule-based supervision) remains active research to further bolster alignment and interpretability (Wang et al., 12 Nov 2025). Empirical scaling suggests continued gains with increased depth and data.

References:

H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages (Zakershahrak et al., 7 Aug 2025)
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling (Hwang et al., 10 Jul 2025)
Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition (Wang et al., 12 Nov 2025)