Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zonkey: Diffusion-Based Hierarchical LM

Updated 31 January 2026
  • Zonkey is a hierarchical, fully differentiable diffusion-based language model that replaces rigid tokenization with adaptive, learned segmentation for variable-length text representation.
  • It utilizes probabilistic attention and a hierarchical denoising diffusion process to transform raw character inputs into coherent document-level outputs.
  • Empirical results demonstrate emergent linguistic boundaries, robustness to noise, and a scalable framework for future advancements in language modeling.

Zonkey is a hierarchical, fully differentiable diffusion-based LLM designed to address the limitations of conventional LLMs stemming from rigid, non-differentiable tokenization schemes. Zonkey integrates adaptive segmentation, @@@@1@@@@, hierarchical compression, and latent denoising, enabling end-to-end training from raw character sequences up to document-level abstractions. The system is characterized by a learnable pipeline that forgoes discrete token boundaries, instead utilizing dynamically emergent linguistic splits and continuous representations at every stage of text synthesis and comprehension (Rozental, 29 Jan 2026).

1. Hierarchical Diffusion Framework

Zonkey arranges its architecture into a stack of hierarchical levels l=0,1,,Ll = 0, 1, \dots, L, with each level responsible for progressively abstracting text structure:

  • Character Embeddings (Level 0): Raw input documents are embedded at the character level into vector sequences {xi(0)}i=1L\{x_i^{(0)}\}_{i=1}^L.
  • Overlapping Segments: The Segment Splitter generates m0m_0 overlapping segments (maximum length S0S_0) for each level.
  • Latent Compression: Each segment ss is compressed by concatenating NlN_l learnable CLS vectors and processed by a Transformer encoder utilizing Probabilistic Attention. This yields a continuous latent cs(l)RNldl\mathbf{c}^{(l)}_s \in \mathbb{R}^{N_l d_l}.
  • Diffusion/Noising: Latents are noised according to a schedule αt\alpha_t, combining scaled signal and isotropic Gaussian noise:

cs,t(l)~=αtcs(l)+1αtϵ,ϵN(0,I)\tilde{\mathbf{c}^{(l)}_{s,t}} = \sqrt{\alpha_t} \mathbf{c}^{(l)}_s + \sqrt{1 - \alpha_t} \boldsymbol\epsilon,\quad \boldsymbol\epsilon \sim \mathcal{N}(0, I)

  • Denoising: The Denoising Diffusion Mixed Model (DDMM) restores the latents, enabling both small (“clean”) and large (“dirty”) step denoising.
  • Stitcher: Overlapping outputs are merged into a global sequence, ready for the next hierarchical level or generation output.
  • Inference: Text synthesis starts from zTN(0,I)\mathbf{z}_T \sim \mathcal{N}(0, I) at the highest level, followed by iterative denoising and hierarchical unrolling to reconstruct raw characters.

This organization allows Zonkey to learn and reconstruct complex textual structures through latent signal manipulation and reconstruction, rather than rigid token-based generation.

2. Differentiable Tokenization via the Segment Splitter

The Segment Splitter replaces traditional discrete tokenizers (e.g., Byte Pair Encoding, WordPiece) with a learned, probabilistic decision process. For each position ii, it predicts a beginning-of-sequence (BOS) probability: pBOS,i=σ(whi+b)whereσ(z)=11+ezp_{\mathrm{BOS},i} = \sigma(w^\top h_i + b) \quad \text{where} \quad \sigma(z) = \frac{1}{1 + e^{-z}} Segments are not hard-split; instead, the model computes, for each segment ss starting at ii, an existence probability per downstream token jj: pexist,j(s)=k=i+1j(1pBOS,k)p_{\mathrm{exist},j}^{(s)} = \prod_{k=i+1}^j (1 - p_{\mathrm{BOS},k}) This process results in soft, overlapping segments, with training losses (reconstruction, masked language modeling, offset regression) weighted by a normalized existence share across overlapping segments: αj(s)=pexist,j(s)sjpexist,j(s)\alpha_j^{(s)} = \frac{p_{\mathrm{exist},j}^{(s)}}{\sum_{s' \ni j} p_{\mathrm{exist},j}^{(s')}} This enables fully gradient-based learning to optimize boundary placement. Empirically, the Splitter yields word-like splits at spaces (level 0) and sentence-like splits at periods (level 1) without explicit boundary supervision.

Classic Tokenizer Zonkey Segment Splitter Emergent Splitting
Non-differentiable, fixed Differentiable, adaptive Linguistic (spaces, periods)

3. Probabilistic Attention and Soft-Length Modulation

To facilitate infinite, variable, and overlapping segment lengths, each position kk receives an existence probability pkp_k. Zonkey’s self-attention mechanism computes for queries at qq and keys at kk: sqk=QqKkdsqk=sqk+{log(pkpq),k>q 0,kqs_{qk} = \frac{Q_q^\top K_k}{\sqrt{d}} \qquad s'_{qk} = s_{qk} + \begin{cases} \log\left(\frac{p_k}{p_q}\right), & k > q \ 0, & k \leq q \end{cases} Attention weights become: Aqk=exp(sqk)kexp(sqk)A_{qk} = \frac{\exp(s'_{qk})}{\sum_{k'} \exp(s'_{qk'})} This enables the model to down-weight contributions from tokens unlikely to exist without breaking backpropagation, supporting flexible segment boundaries and dynamic content length. Only future tokens (in bidirectional encoders) receive attention modulation.

Variable-length outputs are realized by halting decoding where the cumulative existence probability falls below threshold ε\varepsilon, removing the need for explicit EOS tokens: P(position k existsall prior exist)=pk=i=1k(1pBOS,i)P(\text{position } k \text{ exists} \mid \text{all prior exist}) = p_k = \prod_{i=1}^k (1 - p_{\mathrm{BOS},i})

4. Hierarchical Compression and Denoising Diffusion Mixed Model (DDMM)

Each level’s Compressor processes segments of maximum length SlS_l (with effective length governed by existence probabilities), prepends NlN_l CLS tokens, and outputs hidden states HR(Sl+Nl)×dl\mathbf{H} \in \mathbb{R}^{(S_l + N_l) \times d_l}. The first NlN_l outputs are concatenated and normalized: c(l)=[H1;;HNl]/[H1;;HNl]\mathbf{c}^{(l)} = \Bigl[ H_1; \dots; H_{N_l} \Bigr] \Big/ \left\| [H_1; \dots; H_{N_l}] \right\| Noise is added to this latent and removed via DDMM, which fuses DDPM stability with DDIM speed. Training leverages mixed-step objectives: from clean latent p1\mathbf{p}_1 to heavily noised p2\mathbf{p}_2, intermediate denoising to p3\mathbf{p}_3, additional noise and denoising to p4\mathbf{p}_4, minimizing the cosine error between p4\mathbf{p}_4 and its projection onto [p1,p2][\mathbf{p}_1, \mathbf{p}_2]: Π[1,2](p4)=(p4p1)(p2p1)p2p12(p2p1)+p1\Pi_{[1,2]}(\mathbf{p}_4) = \frac{(\mathbf{p}_4 - \mathbf{p}_1) \cdot (\mathbf{p}_2 - \mathbf{p}_1)}{\|\mathbf{p}_2 - \mathbf{p}_1\|^2}(\mathbf{p}_2 - \mathbf{p}_1) + \mathbf{p}_1

Lmixed=1p4,Π[1,2](p4)p4Π[1,2](p4)\mathcal{L}_{\mathrm{mixed}} = 1 - \frac{\langle \mathbf{p}_4, \Pi_{[1,2]}(\mathbf{p}_4) \rangle}{\|\mathbf{p}_4\| \|\Pi_{[1,2]}(\mathbf{p}_4)\|}

This encourages the model to interpolate between cautious, incremental denoising and aggressive, large-step correction.

5. Overlap Invariance via the Stitcher

Post-denoising, overlapping segments are resolved by the Stitcher module in three differentiable phases:

  1. Soft Offset Inference: For segments s,s+1s, s+1, determine continuous offsets δss+1\delta_{s \to s+1} via cumulative existence trace drop and cosine similarity of segment boundaries.
  2. Weighted Accumulation: Global sequence reconstruction combines all segments’ contributions to each position, weighted by existence probabilities:

X^i(l)=spexist,iδs(s)X^s,iδs(l)spexist,iδs(s)\hat{X}^{(l)}_i = \frac{\sum_s p_{\mathrm{exist}, i - \delta_s}^{(s)} \hat{X}^{(l)}_{s, i - \delta_s}}{\sum_s p_{\mathrm{exist}, i - \delta_s}^{(s)}}

  1. Learned Refinement: Cross-attention transformer smooths the overlapped output, enforcing coherence and resolving local inconsistencies.

Auxiliary losses (MSE on offsets, cosine similarity on overlaps) improve split prediction and overlap resolution, facilitating sharp yet border-aware segment boundaries.

6. Training Procedures and Empirical Observations

Zonkey was trained end to end on the full English Wikipedia corpus (~2B tokens) with modest compute: single GPU, batch size 16, AdamW optimizer at 1e41\mathrm{e}^{-4} learning rate, sequence lengths [32,32][32,32] and compression vectors N=[4,4]N=[4,4]. The diffusion schedule comprised T=100T=100 steps (αt\alpha_t from 1 to 0.01). Losses comprised level-weighted reconstruction (clean/dirty), masked LM, segment collapse, split regularization, and stitch alignment.

Key findings:

  • Emergent Boundaries: Word-level splits at spaces (L0), sentence-level splits at periods (L1), despite lack of boundary annotation.
  • Robustness to Noise: Superior to static BPE baselines for handling noisy or adversarial text.
  • Generative Hierarchy: Coherent, variable-length sentences generated ab initio from Gaussian noise; sequence length adapted organically, no explicit EOS tokens.
  • Qualitative Alignment: Hierarchical segmentations and reconstructions align closely with natural linguistic units.

7. Limitations and Research Directions

Current implementation covers two hierarchy levels (character, sentence) and requires modest computational resources and data. Scaling to higher abstractions (paragraph, document) is anticipated to demand increased compute, substantial data, and possibly enhanced algorithms for stitching and efficient diffusion.

Proposed avenues for future work include:

  • Hierarchy Expansion: Increasing model depth to cover paragraphs, chapters, and assessing long-form coherence quantitatively.
  • Latent Retrieval/Memory: Integrating retrieval/memory modules for improved factual grounding.
  • Diffusion Optimization: Fine-tuning DDMM mixed-step scheduling to balance sample efficiency and stability.
  • Domain Adaptation Assessment: Applying adaptive tokenization to noisy or low-resource domains (e.g., medical transcripts).

This suggests that fully gradient-leveraged language modeling—where both tokenization and generative modeling are learned in concert—is achievable, and that robust, adaptive segmentation potentially yields superior performance in specialized or noisy contexts. By replacing discrete, brittle model components with differentiable, data-driven alternatives, Zonkey provides a technical foundation for scalable, expressive, and end-to-end trainable LLMs (Rozental, 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zonkey.