Zonkey: Diffusion-Based Hierarchical LM
- Zonkey is a hierarchical, fully differentiable diffusion-based language model that replaces rigid tokenization with adaptive, learned segmentation for variable-length text representation.
- It utilizes probabilistic attention and a hierarchical denoising diffusion process to transform raw character inputs into coherent document-level outputs.
- Empirical results demonstrate emergent linguistic boundaries, robustness to noise, and a scalable framework for future advancements in language modeling.
Zonkey is a hierarchical, fully differentiable diffusion-based LLM designed to address the limitations of conventional LLMs stemming from rigid, non-differentiable tokenization schemes. Zonkey integrates adaptive segmentation, @@@@1@@@@, hierarchical compression, and latent denoising, enabling end-to-end training from raw character sequences up to document-level abstractions. The system is characterized by a learnable pipeline that forgoes discrete token boundaries, instead utilizing dynamically emergent linguistic splits and continuous representations at every stage of text synthesis and comprehension (Rozental, 29 Jan 2026).
1. Hierarchical Diffusion Framework
Zonkey arranges its architecture into a stack of hierarchical levels , with each level responsible for progressively abstracting text structure:
- Character Embeddings (Level 0): Raw input documents are embedded at the character level into vector sequences .
- Overlapping Segments: The Segment Splitter generates overlapping segments (maximum length ) for each level.
- Latent Compression: Each segment is compressed by concatenating learnable CLS vectors and processed by a Transformer encoder utilizing Probabilistic Attention. This yields a continuous latent .
- Diffusion/Noising: Latents are noised according to a schedule , combining scaled signal and isotropic Gaussian noise:
- Denoising: The Denoising Diffusion Mixed Model (DDMM) restores the latents, enabling both small (“clean”) and large (“dirty”) step denoising.
- Stitcher: Overlapping outputs are merged into a global sequence, ready for the next hierarchical level or generation output.
- Inference: Text synthesis starts from at the highest level, followed by iterative denoising and hierarchical unrolling to reconstruct raw characters.
This organization allows Zonkey to learn and reconstruct complex textual structures through latent signal manipulation and reconstruction, rather than rigid token-based generation.
2. Differentiable Tokenization via the Segment Splitter
The Segment Splitter replaces traditional discrete tokenizers (e.g., Byte Pair Encoding, WordPiece) with a learned, probabilistic decision process. For each position , it predicts a beginning-of-sequence (BOS) probability: Segments are not hard-split; instead, the model computes, for each segment starting at , an existence probability per downstream token : This process results in soft, overlapping segments, with training losses (reconstruction, masked language modeling, offset regression) weighted by a normalized existence share across overlapping segments: This enables fully gradient-based learning to optimize boundary placement. Empirically, the Splitter yields word-like splits at spaces (level 0) and sentence-like splits at periods (level 1) without explicit boundary supervision.
| Classic Tokenizer | Zonkey Segment Splitter | Emergent Splitting |
|---|---|---|
| Non-differentiable, fixed | Differentiable, adaptive | Linguistic (spaces, periods) |
3. Probabilistic Attention and Soft-Length Modulation
To facilitate infinite, variable, and overlapping segment lengths, each position receives an existence probability . Zonkey’s self-attention mechanism computes for queries at and keys at : Attention weights become: This enables the model to down-weight contributions from tokens unlikely to exist without breaking backpropagation, supporting flexible segment boundaries and dynamic content length. Only future tokens (in bidirectional encoders) receive attention modulation.
Variable-length outputs are realized by halting decoding where the cumulative existence probability falls below threshold , removing the need for explicit EOS tokens:
4. Hierarchical Compression and Denoising Diffusion Mixed Model (DDMM)
Each level’s Compressor processes segments of maximum length (with effective length governed by existence probabilities), prepends CLS tokens, and outputs hidden states . The first outputs are concatenated and normalized: Noise is added to this latent and removed via DDMM, which fuses DDPM stability with DDIM speed. Training leverages mixed-step objectives: from clean latent to heavily noised , intermediate denoising to , additional noise and denoising to , minimizing the cosine error between and its projection onto :
This encourages the model to interpolate between cautious, incremental denoising and aggressive, large-step correction.
5. Overlap Invariance via the Stitcher
Post-denoising, overlapping segments are resolved by the Stitcher module in three differentiable phases:
- Soft Offset Inference: For segments , determine continuous offsets via cumulative existence trace drop and cosine similarity of segment boundaries.
- Weighted Accumulation: Global sequence reconstruction combines all segments’ contributions to each position, weighted by existence probabilities:
- Learned Refinement: Cross-attention transformer smooths the overlapped output, enforcing coherence and resolving local inconsistencies.
Auxiliary losses (MSE on offsets, cosine similarity on overlaps) improve split prediction and overlap resolution, facilitating sharp yet border-aware segment boundaries.
6. Training Procedures and Empirical Observations
Zonkey was trained end to end on the full English Wikipedia corpus (~2B tokens) with modest compute: single GPU, batch size 16, AdamW optimizer at learning rate, sequence lengths and compression vectors . The diffusion schedule comprised steps ( from 1 to 0.01). Losses comprised level-weighted reconstruction (clean/dirty), masked LM, segment collapse, split regularization, and stitch alignment.
Key findings:
- Emergent Boundaries: Word-level splits at spaces (L0), sentence-level splits at periods (L1), despite lack of boundary annotation.
- Robustness to Noise: Superior to static BPE baselines for handling noisy or adversarial text.
- Generative Hierarchy: Coherent, variable-length sentences generated ab initio from Gaussian noise; sequence length adapted organically, no explicit EOS tokens.
- Qualitative Alignment: Hierarchical segmentations and reconstructions align closely with natural linguistic units.
7. Limitations and Research Directions
Current implementation covers two hierarchy levels (character, sentence) and requires modest computational resources and data. Scaling to higher abstractions (paragraph, document) is anticipated to demand increased compute, substantial data, and possibly enhanced algorithms for stitching and efficient diffusion.
Proposed avenues for future work include:
- Hierarchy Expansion: Increasing model depth to cover paragraphs, chapters, and assessing long-form coherence quantitatively.
- Latent Retrieval/Memory: Integrating retrieval/memory modules for improved factual grounding.
- Diffusion Optimization: Fine-tuning DDMM mixed-step scheduling to balance sample efficiency and stability.
- Domain Adaptation Assessment: Applying adaptive tokenization to noisy or low-resource domains (e.g., medical transcripts).
This suggests that fully gradient-leveraged language modeling—where both tokenization and generative modeling are learned in concert—is achievable, and that robust, adaptive segmentation potentially yields superior performance in specialized or noisy contexts. By replacing discrete, brittle model components with differentiable, data-driven alternatives, Zonkey provides a technical foundation for scalable, expressive, and end-to-end trainable LLMs (Rozental, 29 Jan 2026).