Persistent Scratchpad Mechanisms

Updated 2 April 2026

Persistent scratchpad is a mutable memory buffer that persists and evolves intermediate computation steps, enabling stateful processing across diverse tasks.
It enhances sequence-to-sequence and transformer models by embedding intermediate results, leading to improvements in metrics like BLEU and ROUGE.
Applications span robotics, interactive notebooks, and hardware accelerators, where persistent state supports efficient, context-aware computation and OOD generalization.

A persistent scratchpad is an architectural or algorithmic mechanism for maintaining an explicit, mutable memory buffer over the course of computation, enabling models or systems to repeatedly read from and write to intermediate results as they process input or generate output. Persistent scratchpad concepts span multiple subfields, including sequence-to-sequence modeling, LLMs, vision-language-action (VLA) frameworks, interactive computational notebooks, hardware accelerators, and GPU-based stencil computation. Across these contexts, the defining property is that the scratchpad (memory, buffer, or auxiliary state) is not ephemeral but persists and evolves throughout the task, supporting more coherent, stateful, and context-aware computation.

1. Foundational Principles and Formalizations

The persistent scratchpad mechanism typically manifests as a sequence or memory buffer $M_t$ , $S_t$ , or $H^i$ , updated over the temporal evolution of a computation. For neural sequence models, the scratchpad either augments existing state (e.g., encoder outputs, LM context) or serves as an external, read-write memory. In "Keeping Notes: Conditional Natural Language Generation with a Scratchpad Mechanism," the scratchpad is implemented by allowing the decoder to write back into the (multi-layer) encoder hidden states after every output token, with each encoder cell updated as $h_{t}^{i+1} = \alpha_{t}^{i} h_{t}^{i} + (1-\alpha_{t}^{i}) u^{i}$ . The persistence arises because these updated states are read and further mutated at all future steps (Benmalek et al., 2019).

In transformer-based LLMs, as in "Show Your Work," the scratchpad is a token buffer $M_t$ that accumulates intermediate computation steps and is included as part of the model’s context window for subsequent decoding. This formulation requires no architectural modification—the persistence arises solely from the autoregressive expansion of the buffer and repeated re-attention over its growing content (Nye et al., 2021).

For reinforcement learning and robotics settings, as shown in "Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks," the scratchpad $S_t$ is a collection of natural language tokens that record partial plans, spatial states, or subgoal completions. Updates to $S_t$ are governed by outputted token triggers (e.g., <done>) and incorporated into the model’s input for the next step, enabling persistent tracking of both spatial and temporal task state (Haresh et al., 24 Feb 2026).

Formally, in the context of reasoning tasks and efficient learning, the "inductive scratchpad" paradigm enforces that state $s_i$ at each computation step is computed from $(Q, s_{i-1})$ by a single learned transition function $g$ , with careful masking and reindexing to guarantee inductive invariance and OOD generalization (Abbe et al., 2024).

2. Persistent Scratchpad in Neural Sequence Modeling

The persistent scratchpad mechanism originated in seq2seq architectures for conditional sequence generation. The "Scratchpad Mechanism" alters the canonical attentional decoder as follows: after emitting each token, the decoder computes a global update vector $S_t$ 0 and a write-attention weight $S_t$ 1 for each encoder cell. The write-back step mutates the encoder states, such that at each future timestep, attention is applied over these evolving representations. This allows the model to "keep soft notes" of what has been generated, guiding future output toward fluency and coherence.

Experimental results across machine translation, question generation, and summarization demonstrate quantitative improvements, including higher BLEU and ROUGE scores and human-judged fluency. Notably, state-of-the-art or comparable metrics are achieved in IWSLT machine translation and CNN/DailyMail summarization benchmarks, with the scratchpad reducing required training time and eliminating the need for auxiliary coverage penalties (Benmalek et al., 2019).

Crucially, the scratchpad is realized without introducing explicit memory modules: the encoder serves as persistent, differentiable memory, with write/overwrite mixtures computed directly from model states at each step. This design can generalize to transformer architectures by treating any encoder layer’s outputs as scratch cells.

3. Persistent Scratchpad for Stepwise Computation in Transformers

Large pre-trained LLMs have limited innate capacity for multi-step computation without intermediate supervision. The "Show Your Work" approach establishes that explicitly emitting intermediate steps into a persistent scratchpad buffer ( $S_t$ 2) enables such models to solve long-horizon tasks. The scratchpad consists of a sequence of delimiters, natural language working, numerical steps, or program states, embedded as part of LM context (Nye et al., 2021).

$S_t$ 3

where each $S_t$ 4 is sampled autoregressively. The transformer re-attends to all prior scratchpad content and the fixed prompt at every generation step. No architecture change is required, beyond increasing the context window to accommodate scratchpad and answer.

Key empirical results include:

Long addition: direct prediction yields <10% accuracy even at 1B parameters, while persistent scratchpad improves to >90% for in-distribution (≤8 digits) and ~80%/60% accuracy out-of-distribution (9/10 digits) for models >100M params.
Polynomial evaluation and Python program execution: per-term and traced intermediate step supervision using a scratchpad vastly improves accuracy over direct mapping, especially in few-shot and OOD settings.

Memory growth remains linear in the number of computational steps, and context window size poses limits. Suggested strategies include periodic summarization or learned truncation to compress the scratchpad as it grows.

4. Inductive Scratchpad and Efficient Reasoning

Persistent scratchpads are not universally sufficient for OOD generalization or learning high "globality degree" distributions. In "How Far Can Transformers Reason?", three paradigms are analyzed:

Agnostic scratchpads (unsupervised intermediate memory) fail to overcome global dependency barriers.
Educated scratchpads (supervised, gold intermediate steps) can solve training-length tasks but fail to extrapolate length.
Inductive (persistent) scratchpads, in which a single step function $S_t$ 5 is learned and applied repeatedly with masking/reindexing, permit effective OOD generalization (e.g., 6× length for addition), by constraining each step to only attend to the fixed input and previous state.

The masking ensures that at each induction step, the attention scope remains small and invariant, matching the statistical structure of tasks such as parity or DFS-based cycle detection. Empirical results show persistent/inductive scratchpad paradigms achieve perfect generalization far beyond training length on algorithmic tasks, while non-inductive paradigms collapse or overfit (Abbe et al., 2024).

5. Applications in Robotics, Notebook Authoring, and Hardware

Vision–Language–Action

In robotic control, persistent scratchpads realized as language token buffers confer both spatial and temporal memory to VLAs. For non-recurrent models, the scratchpad enables substantial gains (e.g., ~54% vs. 5% baseline on five memory-dependent manipulation tasks in ClevrSkills-Mem). Real-world pick-place tasks see 65% success with scratchpad, compared to 0% baseline for stateless VLA (Haresh et al., 24 Feb 2026). The approach injects plans, subgoal progress, and spatial states as persistent "notes to self" into the VLA inference loop with minimal architectural overhead.

Computational Notebooks

In interactive data analysis, persistent scratchpads address the tension between exploratory programming and notebook clarity. The Tidynote system realizes a persistent scratchpad as a sidebar where cells can be offloaded, edited, and eventually moved back to the main narrative. Scratchpad cells maintain execution state via a linear, fork-based execution protocol, guaranteeing notebook reproducibility and eliminating hidden state (Huang et al., 26 Feb 2026). Studies show that notebooks authored with a persistent scratchpad are significantly clearer, support diverse clarity strategies, and eliminate non-reproducible workflows.

Hardware Accelerators and Temporal Blocking

Persistent scratchpads on hardware appear as high-density, low-leakage on-chip memory, such as STT-MRAM replacing SRAM in systolic-array DNN accelerators. These support storage of activations, weights, and error gradients during training. When co-optimized with write-energy tuning and bit-level heterogeneity, such persistent scratchpads deliver up to 22× improvement in system-level energy while maintaining training accuracy (Roy et al., 2023). In GPU stencil computations, persistent scratchpad memory is exploited for deep temporal blocking—buffering a tile for $S_t$ 6 steps entirely in shared memory before writing back—yielding high arithmetic intensity and simplified code with performance within 2–5% of state-of-the-art auto-generated schemes (Zhang et al., 2023).

6. Limitations, Scalability, and Future Directions

Across domains, the key limitation of persistent scratchpads is growth in memory, context size, or execution overhead. Transformer-based models are bounded by context window, leading to the need for memory summarization, truncation, or hybrid hierarchical scratchpad schemes (Nye et al., 2021, Abbe et al., 2024). In VLA and notebook settings, scratchpad content must be periodically pruned or clustered, and enforcement of strict execution linearity may restrict exploration (Huang et al., 26 Feb 2026, Haresh et al., 24 Feb 2026).

Emerging directions include integration of retrieval-based or summarization modules, dynamic scratchpad management via RL, hybrid memory architectures on hardware, and generalization to multimodal or distributed agent settings. For sequence models, inductive scratchpad techniques will remain central to scaling reasoning and multi-step computation beyond rigid training regimens. For hardware, advances in non-volatile memory will further expand possible scratchpad architectures.

7. Summary Table of Persistent Scratchpad Implementations

Domain	Scratchpad Type	Functionality / Results
Seq2seq NLG (Benmalek et al., 2019)	Encoder hidden states	Fluency, coverage, SOTA BLEU/ROUGE; soft notes for decoder guidance
Transformers for computation (Nye et al., 2021)	Token buffer in context	>90% accuracy on long addition, reasoning OOD, no arch. changes
Inductive reasoning (Abbe et al., 2024)	Autoregressive state (masked)	OOD (up to 6×) length generalization, globality barrier broken
VLA/robotics (Haresh et al., 24 Feb 2026)	Language token log	+49–66 pp gain on long-horizon tasks; spatial/temporal state memory
Notebooks (Huang et al., 26 Feb 2026)	Interactive sidebar buffer	3–4× improved clarity; reproducibility; efficient workflow
DNN hardware (Roy et al., 2023)	Persistent STT-MRAM	15–22× energy reduction, negligible accuracy loss
GPU stencil (Zhang et al., 2023)	On-chip shared memory	150× arithmetic intensity, SOTA performance, simple implementation

The persistent scratchpad paradigm, across these domains, reflects a fundamental principle: by externalizing and maintaining mutable, structured intermediates, complex, memory-dependent tasks become tractable for stateless, sequential, or hardware-constrained systems.