Zonkey: Diffusion-Based Hierarchical LM

Updated 31 January 2026

Zonkey is a hierarchical, fully differentiable diffusion-based language model that replaces rigid tokenization with adaptive, learned segmentation for variable-length text representation.
It utilizes probabilistic attention and a hierarchical denoising diffusion process to transform raw character inputs into coherent document-level outputs.
Empirical results demonstrate emergent linguistic boundaries, robustness to noise, and a scalable framework for future advancements in language modeling.

Zonkey is a hierarchical, fully differentiable diffusion-based LLM designed to address the limitations of conventional LLMs stemming from rigid, non-differentiable tokenization schemes. Zonkey integrates adaptive segmentation, @@@@1@@@@, hierarchical compression, and latent denoising, enabling end-to-end training from raw character sequences up to document-level abstractions. The system is characterized by a learnable pipeline that forgoes discrete token boundaries, instead utilizing dynamically emergent linguistic splits and continuous representations at every stage of text synthesis and comprehension (Rozental, 29 Jan 2026).

1. Hierarchical Diffusion Framework

Zonkey arranges its architecture into a stack of hierarchical levels $l = 0, 1, \dots, L$ , with each level responsible for progressively abstracting text structure:

Character Embeddings (Level 0): Raw input documents are embedded at the character level into vector sequences $\{x_i^{(0)}\}_{i=1}^L$ .
Overlapping Segments: The Segment Splitter generates $m_0$ overlapping segments (maximum length $S_0$ ) for each level.
Latent Compression: Each segment $s$ is compressed by concatenating $N_l$ learnable CLS vectors and processed by a Transformer encoder utilizing Probabilistic Attention. This yields a continuous latent $\mathbf{c}^{(l)}_s \in \mathbb{R}^{N_l d_l}$ .
Diffusion/Noising: Latents are noised according to a schedule $\alpha_t$ , combining scaled signal and isotropic Gaussian noise:

$\tilde{\mathbf{c}^{(l)}_{s,t}} = \sqrt{\alpha_t} \mathbf{c}^{(l)}_s + \sqrt{1 - \alpha_t} \boldsymbol\epsilon,\quad \boldsymbol\epsilon \sim \mathcal{N}(0, I)$

Denoising: The Denoising Diffusion Mixed Model (DDMM) restores the latents, enabling both small (“clean”) and large (“dirty”) step denoising.
Stitcher: Overlapping outputs are merged into a global sequence, ready for the next hierarchical level or generation output.
Inference: Text synthesis starts from $\mathbf{z}_T \sim \mathcal{N}(0, I)$ at the highest level, followed by iterative denoising and hierarchical unrolling to reconstruct raw characters.

This organization allows Zonkey to learn and reconstruct complex textual structures through latent signal manipulation and reconstruction, rather than rigid token-based generation.

2. Differentiable Tokenization via the Segment Splitter

The Segment Splitter replaces traditional discrete tokenizers (e.g., Byte Pair Encoding, WordPiece) with a learned, probabilistic decision process. For each position $i$ , it predicts a beginning-of-sequence (BOS) probability: $p_{\mathrm{BOS},i} = \sigma(w^\top h_i + b) \quad \text{where} \quad \sigma(z) = \frac{1}{1 + e^{-z}}$ Segments are not hard-split; instead, the model computes, for each segment $s$ starting at $i$ , an existence probability per downstream token $j$ : $p_{\mathrm{exist},j}^{(s)} = \prod_{k=i+1}^j (1 - p_{\mathrm{BOS},k})$ This process results in soft, overlapping segments, with training losses (reconstruction, masked language modeling, offset regression) weighted by a normalized existence share across overlapping segments: $\alpha_j^{(s)} = \frac{p_{\mathrm{exist},j}^{(s)}}{\sum_{s' \ni j} p_{\mathrm{exist},j}^{(s')}}$ This enables fully gradient-based learning to optimize boundary placement. Empirically, the Splitter yields word-like splits at spaces (level 0) and sentence-like splits at periods (level 1) without explicit boundary supervision.

Classic Tokenizer	Zonkey Segment Splitter	Emergent Splitting
Non-differentiable, fixed	Differentiable, adaptive	Linguistic (spaces, periods)

3. Probabilistic Attention and Soft-Length Modulation

To facilitate infinite, variable, and overlapping segment lengths, each position $k$ receives an existence probability $p_k$ . Zonkey’s self-attention mechanism computes for queries at $q$ and keys at $k$ : $s_{qk} = \frac{Q_q^\top K_k}{\sqrt{d}} \qquad s'_{qk} = s_{qk} + \begin{cases} \log\left(\frac{p_k}{p_q}\right), & k > q \ 0, & k \leq q \end{cases}$ Attention weights become: $A_{qk} = \frac{\exp(s'_{qk})}{\sum_{k'} \exp(s'_{qk'})}$ This enables the model to down-weight contributions from tokens unlikely to exist without breaking backpropagation, supporting flexible segment boundaries and dynamic content length. Only future tokens (in bidirectional encoders) receive attention modulation.

Variable-length outputs are realized by halting decoding where the cumulative existence probability falls below threshold $\varepsilon$ , removing the need for explicit EOS tokens: $P(\text{position } k \text{ exists} \mid \text{all prior exist}) = p_k = \prod_{i=1}^k (1 - p_{\mathrm{BOS},i})$

4. Hierarchical Compression and Denoising Diffusion Mixed Model (DDMM)

Each level’s Compressor processes segments of maximum length $S_l$ (with effective length governed by existence probabilities), prepends $N_l$ CLS tokens, and outputs hidden states $\mathbf{H} \in \mathbb{R}^{(S_l + N_l) \times d_l}$ . The first $N_l$ outputs are concatenated and normalized: $\mathbf{c}^{(l)} = \Bigl[ H_1; \dots; H_{N_l} \Bigr] \Big/ \left\| [H_1; \dots; H_{N_l}] \right\|$ Noise is added to this latent and removed via DDMM, which fuses DDPM stability with DDIM speed. Training leverages mixed-step objectives: from clean latent $\mathbf{p}_1$ to heavily noised $\mathbf{p}_2$ , intermediate denoising to $\mathbf{p}_3$ , additional noise and denoising to $\mathbf{p}_4$ , minimizing the cosine error between $\mathbf{p}_4$ and its projection onto $[\mathbf{p}_1, \mathbf{p}_2]$ : $\Pi_{[1,2]}(\mathbf{p}_4) = \frac{(\mathbf{p}_4 - \mathbf{p}_1) \cdot (\mathbf{p}_2 - \mathbf{p}_1)}{\|\mathbf{p}_2 - \mathbf{p}_1\|^2}(\mathbf{p}_2 - \mathbf{p}_1) + \mathbf{p}_1$

$\mathcal{L}_{\mathrm{mixed}} = 1 - \frac{\langle \mathbf{p}_4, \Pi_{[1,2]}(\mathbf{p}_4) \rangle}{\|\mathbf{p}_4\| \|\Pi_{[1,2]}(\mathbf{p}_4)\|}$

This encourages the model to interpolate between cautious, incremental denoising and aggressive, large-step correction.

5. Overlap Invariance via the Stitcher

Post-denoising, overlapping segments are resolved by the Stitcher module in three differentiable phases:

Soft Offset Inference: For segments $s, s+1$ , determine continuous offsets $\delta_{s \to s+1}$ via cumulative existence trace drop and cosine similarity of segment boundaries.
Weighted Accumulation: Global sequence reconstruction combines all segments’ contributions to each position, weighted by existence probabilities:

$\hat{X}^{(l)}_i = \frac{\sum_s p_{\mathrm{exist}, i - \delta_s}^{(s)} \hat{X}^{(l)}_{s, i - \delta_s}}{\sum_s p_{\mathrm{exist}, i - \delta_s}^{(s)}}$

Learned Refinement: Cross-attention transformer smooths the overlapped output, enforcing coherence and resolving local inconsistencies.

Auxiliary losses (MSE on offsets, cosine similarity on overlaps) improve split prediction and overlap resolution, facilitating sharp yet border-aware segment boundaries.

6. Training Procedures and Empirical Observations

Zonkey was trained end to end on the full English Wikipedia corpus (~2B tokens) with modest compute: single GPU, batch size 16, AdamW optimizer at $1\mathrm{e}^{-4}$ learning rate, sequence lengths $[32,32]$ and compression vectors $N=[4,4]$ . The diffusion schedule comprised $T=100$ steps ( $\alpha_t$ from 1 to 0.01). Losses comprised level-weighted reconstruction (clean/dirty), masked LM, segment collapse, split regularization, and stitch alignment.

Key findings:

Emergent Boundaries: Word-level splits at spaces (L0), sentence-level splits at periods (L1), despite lack of boundary annotation.
Robustness to Noise: Superior to static BPE baselines for handling noisy or adversarial text.
Generative Hierarchy: Coherent, variable-length sentences generated ab initio from Gaussian noise; sequence length adapted organically, no explicit EOS tokens.
Qualitative Alignment: Hierarchical segmentations and reconstructions align closely with natural linguistic units.

7. Limitations and Research Directions

Current implementation covers two hierarchy levels (character, sentence) and requires modest computational resources and data. Scaling to higher abstractions (paragraph, document) is anticipated to demand increased compute, substantial data, and possibly enhanced algorithms for stitching and efficient diffusion.

Proposed avenues for future work include:

Hierarchy Expansion: Increasing model depth to cover paragraphs, chapters, and assessing long-form coherence quantitatively.
Latent Retrieval/Memory: Integrating retrieval/memory modules for improved factual grounding.
Diffusion Optimization: Fine-tuning DDMM mixed-step scheduling to balance sample efficiency and stability.
Domain Adaptation Assessment: Applying adaptive tokenization to noisy or low-resource domains (e.g., medical transcripts).

This suggests that fully gradient-leveraged language modeling—where both tokenization and generative modeling are learned in concert—is achievable, and that robust, adaptive segmentation potentially yields superior performance in specialized or noisy contexts. By replacing discrete, brittle model components with differentiable, data-driven alternatives, Zonkey provides a technical foundation for scalable, expressive, and end-to-end trainable LLMs (Rozental, 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zonkey.

Zonkey: Diffusion-Based Hierarchical LM

1. Hierarchical Diffusion Framework

2. Differentiable Tokenization via the Segment Splitter

3. Probabilistic Attention and Soft-Length Modulation

4. Hierarchical Compression and Denoising Diffusion Mixed Model (DDMM)

5. Overlap Invariance via the Stitcher

6. Training Procedures and Empirical Observations

7. Limitations and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Zonkey: Diffusion-Based Hierarchical LM

1. Hierarchical Diffusion Framework

2. Differentiable Tokenization via the Segment Splitter

3. Probabilistic Attention and Soft-Length Modulation

4. Hierarchical Compression and Denoising Diffusion Mixed Model (DDMM)

5. Overlap Invariance via the Stitcher

6. Training Procedures and Empirical Observations

7. Limitations and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research