- The paper introduces Scratchpad Patching, a mechanism that decouples compute from patch size to directly mitigate patch lag in patch-based byte-level models.
- It employs adaptive, transient scratchpad states triggered by local entropy to refine intra-patch predictions without increasing persistent KV cache usage.
- Empirical results demonstrate improvements with a 16× smaller KV cache and 3–4× less inference compute, recovering quality in code generation and multilingual tasks.
Authoritative Summary: Scratchpad Patching for Byte-Level LLMs
Motivation and Problem Characterization
Tokenization-based language modeling introduces inherent brittleness and restrictiveness by mapping raw text to fixed subword units, hindering adaptability and propagating artifacts such as prompt sensitivity and token glitches. Recent progress in tokenizer-free, byte-level modeling circumvents these pitfalls, but scaling such models is computationally prohibitive due to long sequence lengths. Patch-based byte-level models address efficiency by aggregating contiguous bytes into patches, enabling the main model trunk to operate over a shortened sequence. However, increasing patch size induces "patch lag": predictions for bytes early in a patch rely on stale context from the previous patch, causing a trade-off between compute/memory efficiency and modeling quality. The paper identifies patch lag as a fundamental, structural flaw in patch-based architectures.
Scratchpad Patching: Architecture and Mechanism
The proposed Scratchpad Patching (SP) is a generic mechanism that decouples compute allocation from patch size, directly mitigating patch lag. Rather than committing a single patch representation at patch boundaries, SP introduces transient scratchpad states inside patches, which aggregate all bytes seen up to that position and pass them through the trunk network. Critically, these scratchpads are not persisted in the KV cache, so they do not increase memory footprint or sequence length for subsequent patches. Byte-level predictions within a patch can thus condition on fresher, locally-refined patch representations. SP can be integrated with any patchification scheme (fixed-size, delimiter-based, entropy-based, learned). The most effective triggering policy is next-byte prediction entropy: scratchpads are fired adaptively at byte positions where local entropy exceeds a threshold, thereby concentrating computation on information-dense regions of the byte stream.
Empirical Results and Numerical Highlights
Quality-Efficiency Frontier
SP consistently shifts the Pareto frontier on BPB (Bits-Per-Byte), task accuracy, and code generation quality versus sequence reduction factor and compute. At fixed sequence reduction (e.g., patches of 16 bytes), SP models match or closely approach the byte-level baseline in downstream accuracy, with a 16× smaller KV cache and 3–4× less inference compute. Performance degradation at large patch sizes is substantially recovered via SP, notably on fixed-size and delimiter-based patchifiers. SP is not simply injecting more compute: under FLOPs-matched comparisons, task performance gains are preserved, especially for most patchifiers.
Robustness and Flexibility
SP-augmented models demonstrate robust performance across multilingual benchmarks (FLORES-200), narrowing the gap to byte-level models and outperforming tokenizer-based models, which suffer script-specific biases. Inference-time compute can be flexibly adjusted post-hoc (by varying scratchpad frequency and patch size) without retraining, exposing a smooth, monotonic trade-off curve in BPB and downstream task scores. In contrast, non-SP models degrade precipitously under such mismatched configurations.
For code generation (MBPP, HumanEval), SP models provide substantial increases in pass@1 rate with preserved or improved compute and cache efficiency compared to both byte-level and tokenizer-based baselines. For natural language understanding tasks (ARC, BoolQ, HellaSwag, etc.), SP efficiently recovers quality lost in aggressive patch regimes and enables simple patchifiers to be competitive with sophisticated learned boundary schemes.
Theoretical and Practical Implications
SP exposes the bottleneck of patch-based modeling as compute allocation rather than boundary placement or patch size. By decoupling these axes, SP repositions efficiency-oriented byte-level modeling as a practical alternative to token-based approaches, especially in multilingual and code settings where tokenization struggles. The adaptive, content-aware compute allocation can be regarded as a form of selective, transient recurrence at the patch level, akin to looped or pondered computation in universal transformer variants. SP's approach is orthogonal to improvements in patchification and opens up avenues for integrating richer update rules (e.g., more complex recurrence, hierarchical scratchpad scheduling) and for hierarchical multi-stage patching, which has not been systematically explored yet.
From a practical perspective, SP delivers a single-knob control of compute versus quality at inference, licensing flexible deployment in resource-constrained environments. The decoupling of scratchpad computation from persistent cache states further improves scalability in long-context modeling.
Interaction with Patchifiers and Ablation Insights
SP integrates seamlessly with all patchifier families, but the interaction with learned boundary predictors (e.g., H-Net) can introduce redundancy when scratchpad triggers are spatially coupled with patch boundaries. Entropy-based and delimiter-based patching avoid this via threshold separation or coincidence suppression. Ablations confirm that entropy-based triggering is optimal: dense, per-byte updates waste compute and can degrade performance in highly predictable domains.
Conclusions
Scratchpad Patching provides a principled solution to the structural inefficiencies of patch-based byte-level LLMs, enabling finer-grained compute allocation without increasing persistent sequence length or KV-cache usage. It consistently improves the quality-efficiency trade-off, offers robust inference-time flexibility, and makes patch-based architectures competitive with or superior to token-based baselines in domains where tokenization is suboptimal. The method is widely applicable and establishes compute allocation as the fundamental axis for efficient byte-level modeling. Future work includes extending SP to hierarchical patching, recurrent mechanisms, and explicit training-time compute reduction strategies.