Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Published 10 May 2026 in cs.CL and cs.LG | (2605.09630v1)

Abstract: Tokenizer-free LLMs eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Scratchpad Patching, a mechanism that decouples compute from patch size to directly mitigate patch lag in patch-based byte-level models.
It employs adaptive, transient scratchpad states triggered by local entropy to refine intra-patch predictions without increasing persistent KV cache usage.
Empirical results demonstrate improvements with a 16× smaller KV cache and 3–4× less inference compute, recovering quality in code generation and multilingual tasks.

Authoritative Summary: Scratchpad Patching for Byte-Level LLMs

Motivation and Problem Characterization

Tokenization-based language modeling introduces inherent brittleness and restrictiveness by mapping raw text to fixed subword units, hindering adaptability and propagating artifacts such as prompt sensitivity and token glitches. Recent progress in tokenizer-free, byte-level modeling circumvents these pitfalls, but scaling such models is computationally prohibitive due to long sequence lengths. Patch-based byte-level models address efficiency by aggregating contiguous bytes into patches, enabling the main model trunk to operate over a shortened sequence. However, increasing patch size induces "patch lag": predictions for bytes early in a patch rely on stale context from the previous patch, causing a trade-off between compute/memory efficiency and modeling quality. The paper identifies patch lag as a fundamental, structural flaw in patch-based architectures.

Scratchpad Patching: Architecture and Mechanism

The proposed Scratchpad Patching (SP) is a generic mechanism that decouples compute allocation from patch size, directly mitigating patch lag. Rather than committing a single patch representation at patch boundaries, SP introduces transient scratchpad states inside patches, which aggregate all bytes seen up to that position and pass them through the trunk network. Critically, these scratchpads are not persisted in the KV cache, so they do not increase memory footprint or sequence length for subsequent patches. Byte-level predictions within a patch can thus condition on fresher, locally-refined patch representations. SP can be integrated with any patchification scheme (fixed-size, delimiter-based, entropy-based, learned). The most effective triggering policy is next-byte prediction entropy: scratchpads are fired adaptively at byte positions where local entropy exceeds a threshold, thereby concentrating computation on information-dense regions of the byte stream.

Empirical Results and Numerical Highlights

Quality-Efficiency Frontier

SP consistently shifts the Pareto frontier on BPB (Bits-Per-Byte), task accuracy, and code generation quality versus sequence reduction factor and compute. At fixed sequence reduction (e.g., patches of 16 bytes), SP models match or closely approach the byte-level baseline in downstream accuracy, with a 16× smaller KV cache and 3–4× less inference compute. Performance degradation at large patch sizes is substantially recovered via SP, notably on fixed-size and delimiter-based patchifiers. SP is not simply injecting more compute: under FLOPs-matched comparisons, task performance gains are preserved, especially for most patchifiers.

Robustness and Flexibility

SP-augmented models demonstrate robust performance across multilingual benchmarks (FLORES-200), narrowing the gap to byte-level models and outperforming tokenizer-based models, which suffer script-specific biases. Inference-time compute can be flexibly adjusted post-hoc (by varying scratchpad frequency and patch size) without retraining, exposing a smooth, monotonic trade-off curve in BPB and downstream task scores. In contrast, non-SP models degrade precipitously under such mismatched configurations.

Task Performance

For code generation (MBPP, HumanEval), SP models provide substantial increases in pass@1 rate with preserved or improved compute and cache efficiency compared to both byte-level and tokenizer-based baselines. For natural language understanding tasks (ARC, BoolQ, HellaSwag, etc.), SP efficiently recovers quality lost in aggressive patch regimes and enables simple patchifiers to be competitive with sophisticated learned boundary schemes.

Theoretical and Practical Implications

SP exposes the bottleneck of patch-based modeling as compute allocation rather than boundary placement or patch size. By decoupling these axes, SP repositions efficiency-oriented byte-level modeling as a practical alternative to token-based approaches, especially in multilingual and code settings where tokenization struggles. The adaptive, content-aware compute allocation can be regarded as a form of selective, transient recurrence at the patch level, akin to looped or pondered computation in universal transformer variants. SP's approach is orthogonal to improvements in patchification and opens up avenues for integrating richer update rules (e.g., more complex recurrence, hierarchical scratchpad scheduling) and for hierarchical multi-stage patching, which has not been systematically explored yet.

From a practical perspective, SP delivers a single-knob control of compute versus quality at inference, licensing flexible deployment in resource-constrained environments. The decoupling of scratchpad computation from persistent cache states further improves scalability in long-context modeling.

Interaction with Patchifiers and Ablation Insights

SP integrates seamlessly with all patchifier families, but the interaction with learned boundary predictors (e.g., H-Net) can introduce redundancy when scratchpad triggers are spatially coupled with patch boundaries. Entropy-based and delimiter-based patching avoid this via threshold separation or coincidence suppression. Ablations confirm that entropy-based triggering is optimal: dense, per-byte updates waste compute and can degrade performance in highly predictable domains.

Conclusions

Scratchpad Patching provides a principled solution to the structural inefficiencies of patch-based byte-level LLMs, enabling finer-grained compute allocation without increasing persistent sequence length or KV-cache usage. It consistently improves the quality-efficiency trade-off, offers robust inference-time flexibility, and makes patch-based architectures competitive with or superior to token-based baselines in domains where tokenization is suboptimal. The method is widely applicable and establishes compute allocation as the fundamental axis for efficient byte-level modeling. Future work includes extending SP to hierarchical patching, recurrent mechanisms, and explicit training-time compute reduction strategies.

Markdown Report Issue