Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Published 29 Jan 2026 in cs.CL | (2601.21768v1)

Abstract: LLMs have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a fully differentiable, hierarchical language model that integrates trainable tokenization with probabilistic attention and latent diffusion for efficient sequence generation.
It leverages a novel differentiable Segment Splitter and a Denoising Diffusion Mixed Model to balance deterministic and stochastic denoising for robust text reconstruction.
Empirical results demonstrate enhanced robustness and scalability, enabling coherent generation, infilling, and handling of variable-length sequences.

Zonkey: A Hierarchical Diffusion LLM with Differentiable Tokenization and Probabilistic Attention

Introduction

Zonkey (2601.21768) presents a hierarchical, fully differentiable framework for language modeling that integrates a novel trainable tokenizer, soft probabilistic sequence handling, and latent diffusion mechanisms. Unlike conventional LLMs constrained by fixed, non-differentiable tokenizers such as BPE, Zonkey allows end-to-end optimization from raw character input through to high-level abstractions, increasing robustness to noisy or out-of-domain text and enabling scalability to longer contexts and variable-length sequence generation. The central methodological contributions are Probabilistic Attention, a differentiable Segment Splitter (tokenizer), hierarchical latent compression with targeted contrastive and MLM objectives, a Denoising Diffusion Mixed Model (DDMM) for robust sequence restoration and generation, and a differentiable segment Stitcher. Empirical results demonstrate coherent, hierarchy-emergent text generation, highlighting qualitative advances over entropy-based tokenizers and non-differentiable pipelines.

Probabilistic Attention and Infinite Sequences

A central bottleneck in conventional transformers is the reliance on rigid positional masks for padding and causality, which introduce discontinuities that hamper gradient flow and complicate modeling of variable-length or unbounded sequences. Zonkey's Probabilistic Attention re-parameterizes mask handling as continuous existence probabilities $p_k$ for each sequence position $k$ , encoding the event that position $k$ exists given all previous positions exist. In this formulation, attention between positions $q$ and $k$ is modulated in log-space by existence ratios:

$s_{qk}' = s_{qk} + \begin{cases} \log\left(\frac{p_k}{p_q}\right), & \text{if } k > q \ 0, & \text{otherwise} \end{cases}$

This strategy smoothly down-weights the influence of low-probability positions, effectively ensuring soft truncation without explicit EOS tokens and preserving differentiability even in the infinite-sequence limit.

Figure 1: The log existence probability scale matrix in Probabilistic Attention for ``Alice and Bob’’. Red indicates unattenuated attention; blue marks heavily down-weighted positions, softly approximating hard masks.

Differentiable Tokenization via Segment Splitter

The Segment Splitter is a learnable, contextual tokenizer that predicts per-position probabilities for the beginning-of-segment (BOS) across input sequences. Unlike BPE or deterministic split schemes, the BOS probabilities are optimized to favor segmentations that minimize downstream MLM and reconstruction losses. Segmentations emerge as linguistically meaningful—spaces and sentence delimiters yield high BOS probabilities—not by fiat but through loss-driven adaptation, granting the model an inherently adaptive, domain-robust tokenization mechanism. Training handles overlap and ambiguity via a rigorous existence- and share-weighting mechanism, ensuring that each input position’s loss is appropriately normalized across all its potentially redundant segment embeddings. This is in contrast to fixed entropy allocation strategies, which can misalign with global objectives and perform poorly on unpredictable substrings or complex domains.

Following segmentation, overlapping segments are compressed via a transformer-based module. CLS vectors summarize long segments, and the attention is modulated by the existence probabilities calculated from the splitter.

Hierarchical Compression and Masked Language Modeling

The compressed representations at each hierarchy level are optimized via two main losses:

Contrastive Reconstruction Loss: Ensures that diffusion-based denoising actually restores the latent embeddings of clean input sequences, with in-batch negatives preventing collapse.
Masked Language Modeling (MLM) Loss: Applied to levels above the character tier, enforcing that semantic context is preserved in the compressed representations.

Masking is biased toward high-existence tokens, focusing the MLM supervision where the model is most confident. This yields latent vectors that cluster semantically similar sequences (e.g., synonyms), steering the representations toward linguistic and not merely surface-form alignment.

Denoising Diffusion Mixed Model (DDMM)

Generative robustness is provided by the DDMM. By interpolating between variational small-step denoising (DDPM) and direct deterministic restoration (DDIM), the model can adapt its denoising trajectory according to the uncertainty encoded in the noisy latents. During training, the denoiser must learn to reliably recover clean embeddings from latents corrupted by noise at varying magnitudes, driving both recoverability and local smoothness in the hierarchy.

The DDMM training objective demands that:

When the signal is clear, the model should take a large, direct denoising step (DDIM-like).
When the target is ambiguous or heavily noised, small, careful moves are optimal (DDPM-like).

The mixed-step loss formulation guides this behavior, pushing reconstructions toward the clean latent only when certain.

Overlap-Invariant Stitching and Hierarchical Reassembly

After denoising, sequences are reassembled by the Stitcher, which resolves overlaps and enforces consistency: in the regions shared by adjacent segments, representations are softly aligned and blended, enforcing invariance and seamless propagation of information. This is crucial for unbounded-length sequences; the differentiable feedback ensures that both the Splitter and Denoiser are pressured to generate splits and reconstructions that collectively assemble without discontinuities.

The Stitcher employs existence-weighted similarity and cross-attention mechanisms to gently correct segment misalignments, guided by a regression loss on soft offsets and a contrastive overlap loss. This lightweight module closes the hierarchical loop, making the entire pipeline gradient-based and overlap-consistent.

End-to-End Optimization and Emergent Hierarchy

Training is fundamentally hierarchical: lower tiers (character n-grams) are stabilized before higher-level abstractions (words, phrases, sentences). The loss is decomposed into:

Weighted reconstruction (clean/dirty, contrastive) and MLM terms,
Collapsing-prevention via cosine penalties,
Splitter and Stitcher regularization (segment length, overlap, reassembly consistency),
Hard penalties on pathological segmentations (overlap or length violations).

Each objective is modulated by the existence shares, propagating informative gradients throughout the model. Notably, segmentation boundaries are not given; hierarchies emerge naturally as those split decisions that best support reliable variable-length reconstruction and semantic clustering. Empirically, the emergent BOS probabilities align well with spaces, periods, and other linguistic boundaries.

Generation, Infilling, and Applications

Zonkey’s hierarchical latent pipeline supports parallel generation at each abstraction level. As opposed to strictly sequential sampling in autoregressive models, the compressed latent vectors for an entire sentence or paragraph can be decoded and denoised in parallel, promising substantial efficiency gains for long-context or document-scale generation. Variable-length text is natively supported by the probabilistic existence probabilities, and hard EOS is unnecessary.

The architecture is also well-suited to non-autoregressive infilling: by clamping the compressed vectors of known sequence prefixes and suffixes and denoising only the intermediate noise, coherent infills can be generated, a procedure that generalizes trivially to higher-level gaps (e.g., missing sentences, paragraphs).

Current experiments are limited to sentence-level generation, constrained by computational resources. However, scalability to deeper hierarchies remains a straightforward extension given more resources.

Implications and Future Directions

Zonkey positions itself as an architectural advance towards differentiable LLMs where all modules (tokenization, compression, diffusion, assembly) are fully gradient-based and hierarchically optimized. The trainable probabilistic segmentation overcomes the brittleness of fixed tokenizers, better adapting to noisy or evolving input. The integration with probabilistic attention and stochastic-deterministic diffusion objectives offers a robust path for sequence modeling, particularly in domains where fixed segmentation is inadequate.

Practically, progress in scalable generation, adaptive tokenization, and robust infilling is theoretically applicable to multi-domain LMs, domain adaptation, and long-context modeling. The architecture’s composability suggests future extensions incorporating supervised signals, multimedia input, or more flexible abstraction scales.

Conclusion

Zonkey is a comprehensive demonstration that a fully differentiable hierarchical diffusion-based LLM can achieve emergent, linguistically meaningful tokenizations and efficient, coherent generation from raw character sequences. By solving the non-differentiability bottleneck of subword tokenization and integrating probabilistic mechanisms throughout, Zonkey opens the path for robust, domain-agnostic, adaptive, and scalable text generation paradigms. Its design provides a strong foundation for future work toward gradient-based, genuinely end-to-end LLMs (2601.21768).

Markdown

Paper to Video (Beta)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

All Videos Create Your Own

Whiteboard

Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces Zonkey, a new kind of LLM that learns to understand and generate text without relying on fixed rules for chopping text into “tokens” (like words or subwords). Instead, it learns everything end-to-end—from raw characters up to sentence-like meanings—using a system that can be trained directly. Zonkey combines a smart way of splitting text, a method to compress information, and a diffusion process that can generate new text that varies in length naturally.

Key Questions the Paper Tries to Answer

Can we build a LLM that learns how to split text into meaningful pieces (like words and sentences) by itself, rather than using fixed rules?
Can we keep everything “differentiable,” meaning all parts can be trained together using gradients, which helps the model adapt to new or messy data?
Can we use diffusion (a method often used in image generation) to make text from scratch, even though text is usually made of discrete symbols?
Can a model handle variable-length text gracefully without special “end-of-sequence” markers?

How Zonkey Works (In Everyday Terms)

1) Learning Where to Cut: The Segment Splitter

Imagine reading a long paragraph and deciding where words and sentences begin. Most models use fixed rules to split text, like always breaking at spaces. Zonkey’s Segment Splitter learns where to start segments using probabilities. It doesn’t hard-cut; it softly decides, “This spot is likely the start of something.” Because these decisions are soft (probabilistic), the model can be trained end-to-end, improving its splits as it learns.

A key idea is “existence probabilities.” Think of walking along a path with a flashlight that slowly dims. As you move forward, the chance that later positions are still part of the segment gets smaller. This lets Zonkey treat sequences as potentially very long and gently fade out parts that don’t matter, instead of abruptly stopping at a special end token.

2) Paying Attention Smartly: Probabilistic Attention

“Attention” helps a model decide which parts of the text matter most for understanding the current part. Zonkey’s Probabilistic Attention reduces attention to parts of the sequence that are less likely to exist (those dimmer flashlight areas). This acts like a soft mask: the model naturally focuses on more reliable positions, but stays trainable because there’s no hard cutoff.

3) Summarizing Segments: The Compressor

Each segment (a chunk of characters) is turned into a small set of summary vectors—like making a short “note” that captures the meaning of that chunk. These summary vectors represent higher-level ideas (word-like or sentence-like). The model uses masked-language-style training (guessing missing parts) so these summaries capture meaning, not just surface detail.

4) Generating and Reconstructing: DDMM (Denoising Diffusion Mixed Model)

Diffusion is like starting with noisy “static” and cleaning it step-by-step into clear output. DDMM blends two styles:

Careful small steps when the model isn’t sure (safer, more stable).
Bold jumps when the model is confident (faster, more direct). This balance helps the model turn summary vectors back into detailed text that reads well, and it supports variable-length outputs without fixed ends.

5) Sewing Pieces Back Together: The Stitcher

Because Zonkey creates overlapping segments, the Stitcher “sews” them into a full, smooth document. It gently blends overlaps so the final text is coherent. This stitching is also differentiable, so training signals can flow backward and improve earlier components (like the Splitter).

Main Findings and Why They Matter

Emergent boundaries: Without being explicitly told where words or sentences begin, the model learned to place segment starts at natural spots (like spaces for words, periods for sentences).
Coherent text from noise: Trained on Wikipedia (even on a single GPU), Zonkey can generate sentences that look and feel coherent starting from random noise.
Variable-length generation: By using existence probabilities, the model doesn’t need special end tokens and can produce text that naturally stops when it should.
Fully trainable pipeline: Because the tokenizer is differentiable, the entire system can be trained together, which can help it adapt to different domains or noisy data better than fixed-rule tokenization.
Promising compared to other learnable tokenizers: Qualitatively, Zonkey’s adaptive splitting and hierarchical approach align better with meaningful structure than methods that rely on simple heuristics like equal entropy per chunk.

What This Could Mean Going Forward

Better adaptability: Models that learn their own “tokens” can handle different kinds of texts—like technical jargon, slang, or noisy inputs—more flexibly.
Scalability for long texts: Generating at higher levels (sentences and paragraphs) in parallel could make long-form text generation faster and more efficient than token-by-token methods.
Fewer rigid rules: Moving away from fixed tokenizers may lead to LLMs that are easier to train end-to-end and that naturally discover useful linguistic units.
Stronger domain performance: Fully gradient-based systems can fine-tune their segmentation and representations for specific tasks or fields, potentially improving accuracy and robustness.

In short, Zonkey is a step toward LLMs that learn from raw text in a more human-like, flexible way—deciding for themselves where words and sentences begin, compressing meaning, and generating coherent, variable-length text through a balanced diffusion process.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and unresolved questions that future work could address:

Lack of quantitative evaluation: No standardized metrics (e.g., negative log-likelihood, perplexity proxies, compression ratio, segmentation F1 vs whitespace/punctuation, MAUVE, human eval protocols) are reported; a rigorous benchmark suite is missing to compare Zonkey against autoregressive LMs, ByT5/CANINE/BLT, and diffusion LMs.
Inference speed and compute comparisons: Claims of favorable parallelism are not substantiated with wall-clock benchmarks against strong baselines across sequence lengths and hierarchy depths; cost breakdowns per module (Splitter/Compressor/DDMM/Stitcher) are absent.
Scalability to long documents: Although “theoretically infinite,” training truncates segments to max_seq_len; the practical memory/time complexity and quality scaling for multi-page documents remains untested.
Cross-level error propagation: How denoising or segmentation errors compound across levels (e.g., character→word→sentence) is not quantified; mitigation strategies (e.g., iterative refinement, cross-level consistency losses) are not explored.
Train–test mismatch in splitting: Training uses stochastic BOS sampling while inference uses a hard threshold (p_BOS > 0.5); the impact of this discrepancy on calibration, stability, and quality is unmeasured.
Gradient adequacy through non-differentiable decisions: The approach relies on indirect gradients via Probabilistic Attention rather than explicit relaxations (e.g., straight-through, Gumbel-Softmax, REINFORCE with baselines); it is unclear whether this provides sufficient and unbiased credit assignment to BOS predictors.
Sensitivity to BOS threshold and calibration: No analysis of how thresholding p_BOS or existence-probability calibration affects length control, truncation errors, or over-/under-segmentation across domains and languages.
Segment length regularization trade-offs: The penalties for “short” and “long” segments are described, but their ablation, tuning methodology, and effects on compression quality and reconstruction are not presented.
Existence probability design and independence assumptions: p_k is defined as P(k exists | all prior exist); whether this factorization is expressive enough (vs. learned non-Markovian termination) or leads to pathological decays is untested.
Numerical stability of cumulative products: Existence probabilities and shares rely on long cumulative products of (1 − p_BOS); underflow, gradient saturation, and the need for log-space tricks or clamping are not analyzed.
Probabilistic Attention theory and alternatives: The log(p_k/p_q) score adjustment is proposed without formal guarantees (e.g., on optimization stability, variance, or softmax calibration) or comparisons to soft masking baselines and head-wise variants.
Effect on attention interpretability and head diversity: It is unknown whether existence scaling reduces head redundancy or causes collapse; no head-level analyses are provided.
MLM objective design at latent level: The contrastive MLM over latent vectors (rather than token vocabulary) is novel but under-specified; its efficacy vs. standard reconstruction/regression or InfoNCE variants lacks ablation.
Collapse-prevention loss side effects: Enforcing near-zero cosine similarity across documents may suppress beneficial semantic clustering (e.g., paraphrases); trade-offs are not studied.
DDMM specifics and guarantees: The mixed-step objective is described qualitatively; formalization of its relation to DDPM/DDIM, convergence properties, and sensitivity to noise schedules or step counts T is missing.
Denoiser–compressor loop stability: The repeated decompress–denoise–recompress cycle could drift; no diagnostics on fixed points, cycle consistency, or failure modes are reported.
Termination/length control: Truncation by p_exist < ε is heuristic; calibration, length controllability (e.g., target lengths), and failure cases (premature cutoff or runaway tails) are not evaluated.
Stitcher ground-truth offsets: Training uses “ground-truth offsets,” but it is unclear how these are derived when splits are stochastic; robustness in ambiguous or repetitive overlaps is not assessed.
Stitcher failure modes: Risks of content bleeding, duplication, or inconsistent overlap reconciliation are not quantified; no comparisons to simple overlap-averaging or constrained decoding are provided.
Overlap invariance guarantees: The extent to which stitched representations are invariant to the choice of overlap and BOS density lacks formal tests or proofs.
Resource and reproducibility details: Model sizes, parameter counts per module, optimizer settings, schedules, and training budgets are insufficiently specified for faithful reproduction and scaling studies.
Domain robustness claims: Assertions about better handling of noise and domain shift are qualitative; evaluations on OCR/text-noise benchmarks, specialized domains (code, biomedical), or adversarial perturbations are absent.
Multilingual and non-segmented scripts: The emergence of meaningful boundaries is only discussed for English; performance and segmentation behavior in languages without whitespace (e.g., Chinese, Japanese), morphologically rich languages, or mixed-script corpora are untested.
Code-switching and non-linguistic text: How splitting and compression behave on code, math, tables, or emoji-rich content is unknown.
Downstream utility: Whether Zonkey’s latent representations benefit classification, retrieval, or QA (zero/few-shot) is not evaluated, limiting practical impact assessment.
Comparison to learned tokenizers: No direct head-to-head with Unigram/SentencePiece, WordPiece, BLT’s entropy objectives, or other differentiable segmenters on segmentation quality and downstream performance.
Hyperparameter robustness: Sensitivity of quality to num_compression_vectors, max_seq_len, ε thresholds, and loss weights (w_l) is not explored.
Autoregressive decoder capacity: The “lightweight” 1–2 layer downward transformer may bottleneck decompression; ablations with deeper decoders are missing.
Alignment between latent and surface text: The final mapping from continuous reconstructions to characters (level-0 token loss) is not fully specified for generation; decoding fidelity and error types are not analyzed.
Safety and controllability: No experiments on toxicity, instruction following, or guided generation (e.g., classifier-free guidance in latent space) are provided.
Data leakage and memorization risks: Given end-to-end training from characters, memorization and privacy concerns are not evaluated.
Theoretical connection to likelihood: There is no derivation connecting the training objectives to a tractable likelihood or ELBO, complicating principled model comparison.
Memory footprint with overlaps: Overlapping segments increase compute/memory; efficiency strategies (e.g., dynamic overlap, sparse attention) and their accuracy trade-offs are not examined.
Curriculum over hierarchy depth: The schedule for advancing levels, freezing/unfreezing modules, and its impact on stability and final quality lacks detail and ablations.
Robustness to repetition and long dependencies: How the model handles repeated phrases, coreference, and discourse-level coherence across stitched segments is unmeasured.
Failure case taxonomy: The paper does not catalog typical errors (e.g., truncation artifacts, overlap inconsistencies, semantic drift) or propose diagnostics to detect and mitigate them.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following list outlines practical uses that can be deployed now, leveraging the released source code and the demonstrated sentence-level capabilities on Wikipedia-scale training.

Adaptive tokenizer discovery for niche corpora
- Sectors: healthcare, legal, finance, software localization
- Tool/Product/Workflow: train the Segment Splitter on domain-specific text to learn context-aware splits (e.g., medical abbreviations, legal citations) and export the learned segmentation as a plug-in preprocessor for existing Transformer pipelines
- Assumptions/Dependencies: requires domain-corpus fine-tuning; integration layer to map learned segments to downstream models; quality assessment versus BPE is necessary; current model stability demonstrated at sentence-level
Noisy text preprocessing for OCR/ASR/social media
- Sectors: software, education, media
- Tool/Product/Workflow: use probabilistic BOS and existence probabilities to detect word/sentence boundaries and normalize noisy inputs (typos, missing punctuation); apply denoising to improve readability of transcripts and scans before downstream tasks
- Assumptions/Dependencies: needs fine-tuning on noisy-domain data; output quality depends on training distribution similarity; human-in-the-loop recommended
Sentence-level infilling for editorial workflows
- Sectors: media, content platforms, education
- Tool/Product/Workflow: plug Zonkey’s decompression/denoising loop into an editor to fill missing sentences or refine partial drafts with variable-length generation and soft truncation
- Assumptions/Dependencies: current capabilities are sentence-level; performance on long-form content is limited; editorial review required
Hierarchical embeddings for retrieval and clustering
- Sectors: search/retrieval, data engineering, enterprise knowledge management
- Tool/Product/Workflow: use the Compressor to produce word-/sentence-like vectors for indexing, clustering, deduplication, and semantic search; combine with in-batch contrastive objectives to improve separation
- Assumptions/Dependencies: embeddings reflect training corpus; evaluation and calibration needed for specific domains; consider privacy/bias implications in latent spaces
Long-context chunk stitching library
- Sectors: software infrastructure, LLM tooling
- Tool/Product/Workflow: integrate the Segment Stitcher as a differentiable reassembly module in long-context pipelines to reduce boundary artifacts across overlapping chunks
- Assumptions/Dependencies: requires adaptation to non-Zonkey encoders/decoders; depends on having existence probabilities or analogous confidence scores
Research prototyping platform for differentiable tokenization and probabilistic attention
- Sectors: academia, applied research labs
- Tool/Product/Workflow: reproduce experiments; benchmark against byte/character-level baselines; test curriculum strategies and diffusion schedules; study emergent boundary phenomena
- Assumptions/Dependencies: GPU access; custom evaluation metrics (non-token-based) needed; scaling will affect conclusions
Code-mixed and multilingual preprocessing
- Sectors: localization, translation, multilingual NLP
- Tool/Product/Workflow: apply the Splitter to learn splits that adapt to mixed scripts and languages without fixed vocabularies; feed hierarchical vectors into multilingual encoders
- Assumptions/Dependencies: training on multilingual/codemixed corpora is required; compatibility with downstream multilingual models must be validated
Dataset diagnostics via existence shares and uncertainty hotspots
- Sectors: data engineering, dataset curation, ML ops
- Tool/Product/Workflow: use existence shares and reconstruction losses to flag high-entropy or anomalous regions (e.g., passwords, garbled text) for curation or redaction
- Assumptions/Dependencies: heuristics require calibration; not a substitute for formal PII detection; depends on training loss instrumentation
Teaching modules for ML courses
- Sectors: academia, online education
- Tool/Product/Workflow: course labs on differentiable tokenization, soft/probabilistic masking, hierarchical diffusion, overlap stitching
- Assumptions/Dependencies: students need basic PyTorch/JAX skills; small-scale datasets sufficient for demonstration

Long-Term Applications

The following use cases require further research, scaling, or engineering (e.g., deeper hierarchies, larger training runs, standardized evaluation, and robust safety controls).

Fully end-to-end differentiable LLMs (no fixed tokenizer)
- Sectors: software, cloud AI platforms
- Tool/Product/Workflow: production frameworks where segmentation, compression, denoising, and generation are trained jointly from raw characters/bytes
- Assumptions/Dependencies: substantial compute for multi-level hierarchies; rigorous benchmarks to demonstrate parity with SOTA; robust safety and evaluation protocols
Domain-adapted enterprise models with on-the-fly tokenization
- Sectors: healthcare, legal, finance, customer support
- Tool/Product/Workflow: fine-tune the Splitter+Compressor on proprietary corpora to capture domain-specific boundaries and semantics; deploy as internal LLMs for drafting, analysis, and QA
- Assumptions/Dependencies: data governance and privacy; controllability and auditability of learned splits; risk management for generative outputs
Parallel long-form generation with hierarchical diffusion
- Sectors: media, technical documentation, education
- Tool/Product/Workflow: generate paragraphs and documents by processing compression vectors in parallel at higher levels (paragraph/document) for faster long-form synthesis
- Assumptions/Dependencies: deeper hierarchical training; improved denoiser stability; human oversight and editorial standards
Streaming event/time-series modeling using Probabilistic Attention
- Sectors: energy (sensor data), finance (tick data), cybersecurity (logs), IoT
- Tool/Product/Workflow: adapt existence probabilities to irregular, potentially infinite sequences; soft truncation for streams; anomaly detection with hierarchical representations
- Assumptions/Dependencies: extension beyond text to numerical/event data; task-specific decoders; evaluation on streaming benchmarks
On-device robust keyboards and assistants for noisy text
- Sectors: mobile, accessibility
- Tool/Product/Workflow: small hierarchical models that handle typos, spacing, and punctuation; variable-length suggestions; code-mixed input support
- Assumptions/Dependencies: compression to edge-friendly sizes; distillation from larger Zonkey-style models; privacy-preserving on-device inference
Content moderation and anti-obfuscation analysis
- Sectors: policy, social platforms, safety
- Tool/Product/Workflow: adaptive tokenization to handle obfuscated or adversarial text; hierarchical embeddings for intent detection in noisy or stylized content
- Assumptions/Dependencies: careful fairness and bias checks; adversarial robustness evaluations; human review pipelines
Semantic compression and archival
- Sectors: cloud storage, enterprise records management
- Tool/Product/Workflow: store documents as hierarchical latent codes with reversible reconstruction and semantic indexes; speed up search and reduce storage
- Assumptions/Dependencies: fidelity guarantees; compression ratios versus cost trade-offs; legal and compliance considerations for lossy reconstruction
Hierarchy-aware RAG (retrieval-augmented generation)
- Sectors: enterprise search, knowledge assistants
- Tool/Product/Workflow: use learned splits to produce semantically aligned chunks; stitch retrieved segments with existence-aware blending; improve context coherence in generation
- Assumptions/Dependencies: integration with vector databases; end-to-end evaluation; compatibility with existing RAG stacks
Byte-level program synthesis and code modeling
- Sectors: developer tools, cybersecurity
- Tool/Product/Workflow: model code and binaries without fixed token vocabularies; detect patterns in obfuscated or minified code; variable-length generation for code templates
- Assumptions/Dependencies: large-scale code corpora; domain-specific evaluation (tests, compiles); safety controls to avoid insecure code patterns
Clinical text normalization and EHR note processing
- Sectors: healthcare
- Tool/Product/Workflow: adaptive segmentation for medical jargon; variable-length denoising to clean notes and align sections; downstream support for coding and summarization
- Assumptions/Dependencies: HIPAA-compliant workflows; clinician-in-the-loop validation; training on healthcare-specific datasets

View Paper Prompt View All Prompts

Glossary

atanh transformation: A mathematical transformation using the inverse hyperbolic tangent to stabilize or reshape similarity scores in contrastive objectives. "We use an $\mathrm{atanh}$ transformation and in-batch negatives, similarly to how we compute our MLM compressor loss."
autoregressive decompression: A decoding process that generates outputs sequentially, each step conditioned on previously generated elements, used here for expanding compressed representations. "In causal decoders (used for autoregressive decompression), the scaling aligns with the masking:"
Beginning-of-sequence (BOS): A marker or decision indicating the start of a new segment or token, here learned probabilistically by the tokenizer. "learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits"
Bidirectional encoders: Transformer encoders that attend to both past and future positions in a sequence to build contextual representations. "In bidirectional encoders, the mechanism is applied to all position pairs without causality constraints"
Byte Pair Encoding (BPE): A common subword tokenization algorithm that merges frequent byte or character pairs to form a fixed vocabulary. "fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE)"
Causal decoders: Decoder architectures that enforce future masking to preserve causality, generating tokens left-to-right. "In causal decoders (used for autoregressive decompression), the scaling aligns with the masking:"
CLS vectors: Learnable classification/summarization vectors prepended to sequences to aggregate information, akin to BERT’s [CLS] token. "we prepend $N = \text{num\_compression\_vectors}[l]$ (a hyper-parameter, e.g., 4) learnable classification (CLS) vectors to the input."
Contrastive cosine similarity: A training objective that pulls similar representations together and pushes dissimilar ones apart using cosine similarity and negative samples. "Contrastive cosine similarity (with in-batch negatives and $\mathrm{atanh}$ transformation) between denoised segments and ground-truth inputs."
Cross-attention: An attention mechanism where one sequence attends to another, used here to refine overlaps between segments. "via constrained cross-attention."
denoising autoencoder: A model trained to reconstruct clean inputs from noised versions, improving robustness and representation quality. "training precise recovery akin to a denoising autoencoder~\citep{vincent2008extracting}."
Denoising Diffusion Implicit Models (DDIM): A class of diffusion models enabling faster, more deterministic sampling by skipping steps. "Denoising Diffusion Implicit Models (DDIM; \cite{song2021denoising})."
Denoising Diffusion Mixed Model (DDMM): The paper’s hybrid diffusion approach that balances small stochastic steps with larger deterministic leaps for efficient denoising. "with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space."
Denoising Diffusion Probabilistic Models (DDPM): Diffusion models that iteratively denoise data with small stochastic steps, yielding high-fidelity samples. "Denoising Diffusion Probabilistic Models (DDPM; \cite{ho2020denoising})"
Diffusion-based generation: Generating data by reversing a noise-adding process, iteratively denoising from noise to signal. "Diffusion-based generation, while powerful for images, struggles with text due to discrete tokens"
energy-based variants: Generative modeling approaches that use energy functions to define data distributions or preferences. "as well as energy-based variants \citep{nvidia2025edlm}."
existence probabilities: Position-wise probabilities indicating whether sequence positions “exist,” enabling soft variable-length handling. "with each position $k$ assigned an existence probability $p_k \in (0,1]$ "
Existence shares: Normalized weights that distribute loss across overlapping segments so each original position contributes uniformly. "Existence shares: For each position in the original sequence, a per-segment weight tensor where the position appears, normalized such that shares sum to 1 across all overlapping segments containing it."
Gaussian perturbations: Adding Gaussian noise to simulate diffusion steps or augment data for robustness. "Dirty: after large accumulated noise (simulating multi-step diffusion via Gaussian perturbations~\citep{song2021denoising}) and denoising"
Gumbel-Softmax: A differentiable approximation for sampling from categorical distributions, often used to relax discrete choices. "Note that alternatives like Gumbel-Softmax~\citep{jang2017categorical} would not suffice here, as the hard choice"
hard masking: Applying masks that fully block attention to certain positions, typically for padding or causality. "a generalization of traditional hard masking: if cumulative existence probabilities drop sharply from 1 to 0 (or $\epsilon$ ), it yields equivalent results to conventional masking."
Hierarchical Diffusion LLMs (HDLM): Diffusion-based LMs that incorporate hierarchy, e.g., across semantic scales. "Notably, Hierarchical Diffusion LLMs (HDLM) \citep{zhu2025hdlm} introduce semantic scale prediction in a discrete diffusion framework, enabling hierarchical generation."
hierarchical invariance: A training/property goal ensuring consistent representations across overlapping segments and levels. "The Stitcher enforces hierarchical invariance by ensuring that overlapping regions produce nearly identical representations after denoising and refinement, thereby stabilizing multi-level abstraction."
in-batch negatives: Using other samples within the same batch as negative examples for contrastive learning. "We use an $\mathrm{atanh}$ transformation and in-batch negatives, similarly to how we compute our MLM compressor loss."
latent space: A continuous representation space where inputs are embedded and manipulated, often for generation or compression. "denoising in latent space"
Masked Language Modeling (MLM): A pretraining objective where masked inputs are predicted from context to learn rich representations. "Our model primarily employs Probabilistic Attention in encoders for hierarchical compression, diffusion denoising, and Masked Language Modeling (MLM) tasks,"
mode collapse: A failure mode in generative models where outputs lack diversity, collapsing to few modes. "heighten risks of mode collapse and oversmoothing in unanchored latents"
noise-conditioning vector: A vector appended or prepended to condition the model on the current noise level during diffusion. "A noise-conditioning vector $\mathbf{v}_t = (1 - t) \cdot \mathbf{v}_{\text{clean} + t \cdot \mathbf{v}_{\text{noisy}$ is prepended"
noise schedule: A predefined plan for the magnitude of noise across diffusion steps, often decreasing over time. "Over $T$ diffusion steps with a linear noise schedule decreasing from $\sigma_{\max}$ to $\sigma_{\min}$ , the following operations are performed:"
non-autoregressive sampling: Generating multiple tokens in parallel without conditioning on previously generated outputs. "Diffusion models have shown promise for text generation by enabling non-autoregressive sampling and fine-grained control."
out-of-vocabulary (OOV): Tokens not present in a model’s fixed vocabulary, often causing representation issues. "which can lead to out-of-vocabulary (OOV) issues"
Probabilistic Attention: An attention mechanism that scales scores using position-wise existence probabilities to softly model variable sequence lengths. "Probabilistic Attention modulates the raw attention scores $s_{qk} = \frac{\mathbf{Q}_q^\top \mathbf{K}_k}{\sqrt{d}$ by incorporating existence ratios:"
projection space: A lower-dimensional space used to compute similarities more efficiently or robustly. "Pairwise similarities between consecutive segments are computed in a reduced-dimensional projection space ($d_{\text{model}/4$)."
Segment Splitter: A differentiable tokenizer that learns where to start segments by predicting BOS probabilities. "The Segment Splitter serves as Zonkey's hierarchical tokenizer,"
Segment Stitcher: A differentiable module that reassembles overlapping segments into coherent sequences while enforcing consistency. "The Segment Stitcher serves as the symmetric counterpart to the Segment Splitter in Zonkey."
soft-masked attention: Attention that uses continuous weights instead of hard masks, enabling gradient flow through mask decisions. "differentiable soft-masked attention \citep{athar2022differentiable} modulates contributions with continuous probabilities"
soft truncation: Gradually reducing the influence of low-probability positions instead of cutting sequences with hard EOS tokens. "enabling soft truncation during inference when $p_k < \varepsilon$ "
variance-preserving: A noise injection scheme that maintains the overall variance (norm) of representations during diffusion. "Noise injects post-reshaping via variance-preserving: $\tilde{\mathbf{c} = \sqrt{t} \cdot \boldsymbol{\epsilon} + \sqrt{1 - t} \cdot \mathbf{c}$, $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ "

Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Summary

Zonkey: A Hierarchical Diffusion LLM with Differentiable Tokenization and Probabilistic Attention

Introduction

Probabilistic Attention and Infinite Sequences

Differentiable Tokenization via Segment Splitter

Hierarchical Compression and Masked Language Modeling

Denoising Diffusion Mixed Model (DDMM)

Overlap-Invariant Stitching and Hierarchical Reassembly

End-to-End Optimization and Emergent Hierarchy

Generation, Infilling, and Applications

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions the Paper Tries to Answer

How Zonkey Works (In Everyday Terms)

1) Learning Where to Cut: The Segment Splitter

2) Paying Attention Smartly: Probabilistic Attention

3) Summarizing Segments: The Compressor

4) Generating and Reconstructing: DDMM (Denoising Diffusion Mixed Model)

5) Sewing Pieces Back Together: The Stitcher

Main Findings and Why They Matter

What This Could Mean Going Forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (1)

Collections

Tweets

Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Summary

Zonkey: A Hierarchical Diffusion LLM with Differentiable Tokenization and Probabilistic Attention

Introduction

Probabilistic Attention and Infinite Sequences

Differentiable Tokenization via Segment Splitter

Hierarchical Compression and Masked Language Modeling

Denoising Diffusion Mixed Model (DDMM)

Overlap-Invariant Stitching and Hierarchical Reassembly

End-to-End Optimization and Emergent Hierarchy

Generation, Infilling, and Applications

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions the Paper Tries to Answer

How Zonkey Works (In Everyday Terms)

1) Learning Where to Cut: The Segment Splitter

2) Paying Attention Smartly: Probabilistic Attention

3) Summarizing Segments: The Compressor

4) Generating and Reconstructing: DDMM (Denoising Diffusion Mixed Model)

5) Sewing Pieces Back Together: The Stitcher

Main Findings and Why They Matter

What This Could Mean Going Forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (1)

Collections

Tweets