Scratchpad Methodologies in HPC & ML
- Scratchpad methodologies are explicit memory management techniques that improve performance by localizing data and reducing off-chip traffic in high-performance computing and ML systems.
- They enhance energy efficiency and throughput through hybrid architectures, sectorized designs, and power-gating strategies, resulting in significant speedups and reduced leakage.
- Integrated compiler and algorithmic frameworks optimize scratchpad allocation and multi-step reasoning, enabling advanced applications like neural network training and sequential reasoning.
Scratchpad methodologies encompass a diverse set of design patterns, architectures, and computational techniques underlying the use of scratchpad memory or external working space in high-performance computing, neural network systems, and reasoning-focused machine learning models. In the context of both hardware and algorithmic perspectives, scratchpad memory is characterized by explicit data management, local placement, and direct manipulation, offering enhanced control over data layout, task synchronization, and intermediate state tracking compared to conventional cache-based or unified memory architectures. In algorithmic frameworks—especially in sequence modeling and reasoning—scratchpads facilitate intermediate step recording, dynamic memory, and “work exhibition” that enable robust multi-step computation and improved generalization.
1. Scratchpad Memory: Properties and Design Criteria
Scratchpad memories are fast, explicitly managed local storage modules, usually implemented as distributed SRAM, non-volatile memory (STT-MRAM), or custom memory blocks. Unlike hardware-managed caches, scratchpads require direct software control of data movement (e.g., explicit load/store or allocation commands). Key design criteria include:
- Latency and Bandwidth: Scratchpad memories offer single-cycle access and high aggregate bandwidth, supporting concurrent multithreaded operations and localized data exchange, as in GPU SMs (Jatala et al., 2016), custom accelerators (Marchisio et al., 2020), and many-core clusters (Cavalcante et al., 2021).
- Capacity and Scalability: Recent trends show a >40× increase in scratchpad memory capacity on GPGPUs over the last decade, enabling “deep temporal blocking” for multidimensional stencil computations (Zhang et al., 2023) and larger tile sizes in clusters (Cavalcante et al., 2021).
- Explicit Management: Placement, allocation, and release of scratchpad variables are handled via software and/or compiler, exemplified by the relssp instruction and corresponding control-flow analysis in GPU kernels (Jatala et al., 2016), or dynamic mapping in neural accelerator drivers (Bansal et al., 2019).
Scratchpad architectures are frequently optimized to enable:
- Data reuse;
- Minimization of off-chip (DRAM) traffic;
- Increased thread-level or core-level parallelism.
2. Methodological Innovations and Memory Hierarchies
Recent work advances scratchpad methodologies through hybrid memory hierarchies, allocation frameworks, and new mechanisms for communication and energy optimization:
- Hybrid and Sectorized Designs: Partitioning large scratchpad blocks into banks and sectors, enabling differential power gating (HY-PG) and compositional allocation across data, weights, and accumulators. The DESCNet methodology demonstrates application-driven sizing using:
for shared multi-port memory, and
in separated designs (Marchisio et al., 2020).
- Energy and Area Trade-offs: More compact memory technologies such as STT-MRAM provide 3–4× higher density and significantly lower leakage energy compared to SRAM (Roy et al., 2023). These trade-offs are quantified using device–system co-simulation frameworks incorporating Landau–Lifshitz–Gilbert physics for write error rates and empirical design space exploration (Marchisio et al., 2020, Roy et al., 2023).
- Power-Gating and Utilization Matching: Application-specific gating at the sector/bank level using PMUs to switch off unused scratchpad regions during computation phases, virtually eliminating leakage overhead for memory-heavy inference (Marchisio et al., 2020, Cavalcante et al., 2021). Wakeup latencies are negligible relative to computation cycle durations.
- Communication Paradigms: Sidebar architectures implement dedicated outside-address-space scratchpad buffers between CPU and accelerator, engineered for low-latency intermediate data exchange via specialized instructions (sbLD/sbST), synchronization flags, and well-defined data placement conventions (Bansal et al., 2019). Sidebar-based systems demonstrate near-monolithic performance and energy characteristics and are extensible to accelerator-to-accelerator linking and streaming modes.
3. Compiler and Algorithmic Support for Scratchpad Management
Compiler frameworks play a pivotal role in scratchpad methodologies for both hardware and sequence models:
- Control-Flow and Lifetime Analysis: In GPU kernels, compiler analysis determines the minimum access ranges for variables eligible for shared scratchpad allocation, thus minimizing contention windows. Critical edge splitting and backward data-flow facilitate optimal relssp placement for early shared region release (Jatala et al., 2016).
- Integer Linear Programming (ILP) Deployment: SMART employs an ILP-based compiler to allocate and prefetch convolutional layer memory objects, scheduling them across heterogeneous (SHIFT and RANDOM) arrays to maximize throughput subject to latency and area constraints:
- Algorithmic “Scratchpad” for Multi-Step Reasoning: In sequence models and transformers, scratchpads are external token spaces for intermediate computation, allowing supervised stepwise output. This mechanism was shown to substantially improve arithmetic and program execution accuracy by “showing your work” (Nye et al., 2021). Efficient learning is enabled when scratchpads follow an inductive structure, where intermediate states are computed recursively:
overcoming nonconstant locality barriers in complex reasoning (Abbe et al., 10 Jun 2024).
4. Application Domains: Visualization, HPC Kernels, ML Accelerators
Scratchpad methodologies drive performance, interpretability, and usability across application classes:
- High-Dimensional Data Visualization: SketchPadN-D combines instantaneous generation and editing of high-dimensional data directly within the visualization interface. The WYDIWYGS paradigm links 2D sketching to genuine N-D distributions via tools such as parallel coordinates for PDF curves and ND polygons for shape sculpting. Data editing, cleaning, and cluster generation leverage direct manipulation of marginals, correlations, and multidimensional relationships (Wang et al., 2013).
- Sparse and Stencil Computation: The SMASH kernel implements a row-wise product approach for SpGEMM, merging intermediate results on-chip using atomic instructions and hash-based scratchpad tables, achieving a 9.4× observed speedup over inner/outer product competitors (Shivdikar, 2021). For stencil operations, “deep temporal blocking” exploits increased scratchpad capacities by fully loading domain tiles, performing multiple time iterations entirely in shared memory before synchronizing with global DRAM (Zhang et al., 2023).
- Capsule and CNN Accelerators: DESCNet and SMART systems utilize scratchpad hierarchies with bank/sector partitioning, pipeline random-access arrays, and compiler-assisted prefetch across sequential and random memory needs. Inference throughput and energy reductions up to 3.9× and 86% over SHIFT-only and state-of-the-art designs are reported (Zokaee et al., 2021, Marchisio et al., 2020).
- Neural Network Training: STT-MRAM scratchpads enable high-capacity, low-leakage storage for DNN training, using heterogeneous write protocols for floating-point bits. System-level energy savings reach 15–22× over SRAM, with negligible accuracy loss when errors are confined to least-significant bits (Roy et al., 2023).
5. Inductive Scratchpad and Globality Barriers in Sequential Reasoning
Transformers trained for multi-step reasoning face inherent learnability limits dictated by the “globality barrier,” a formalization of target locality in token sequences (Abbe et al., 10 Jun 2024):
- High globality tasks (e.g., compositional syllogism, graph connectivity) lack reliable correlation in small token subsets, impeding efficient learning by vanilla Transformers.
- Scratchpad techniques—especially inductive scratchpads—reduce effective locality by structuring intermediate representations into recursively defined states, facilitating training and enabling length generalization for arithmetic and logical tasks far beyond seen input lengths.
- Educated and agnostic scratchpads are distinguished by the presence or absence of auxiliary supervision; only the inductive variant robustly breaks the globality bottleneck and supports strong out-of-distribution generalization.
6. Impact, Misconceptions, and Prospects
Scratchpad methodologies challenge the assumption that cache-based and implicitly managed memory hierarchies are optimal for all workloads. The explicit and programmable data control afforded by scratchpads:
- Yields substantial area, energy, and performance benefits in modern ML and HPC architectures.
- Underpins interpretability and robust generalization in sequential models and reasoning systems.
- Is a key enabler in visualization and scientific computing, supporting direct data sculpting, multidimensional editing, and tractable exploratory analysis.
Misconceptions may arise regarding the maintenance and overhead of explicit management versus hardware-managed cache. However, experimental results consistently demonstrate that, in regimes where predictable access patterns and memory overprovisioning are achievable, scratchpad approaches are both feasible and advantageous.
The continued evolution includes integration of emerging memory technologies (MRAM, SFQ), heterogeneous compiler-directed allocation, inductive learning strategies for algorithmic reasoning, and streaming accelerator communication. These developments position scratchpad methodologies as central to the scalability, adaptability, and efficiency of next-generation computational and reasoning systems.