ScaleSim Simulation Frameworks

Updated 5 February 2026

ScaleSim is a suite of simulation frameworks modeling DNN accelerators, multi-agent systems, instrument pipelines, and health data with cycle-accurate precision.
It provides detailed microarchitectural insights by simulating dense matrix workloads and exploring dataflow strategies such as output-, weight-, and input-stationary mappings.
Its multi-domain applications, including proactive prefetching in LLM simulations and optical distortions modeling in astrophysics, drive optimized design and system analysis.

ScaleSim refers to several distinct simulation frameworks across computer architecture, AI accelerator design, large-scale system modeling, and scientific instrumentation. This entry focuses on the principal incarnations documented in peer-reviewed and preprint literature: (1) SCALE-Sim, a cycle-accurate systolic-array accelerator simulator for DNNs, particularly CNNs and transformers; (2) a memory management and scheduling system for LLM-based multi-agent simulations; (3) a Python-based instrument simulator for astrophysical data pipelines; (4) generalized individual-level health data simulation; and (5) its use as a parallel architecture exploration tool. Each instantiation operates within a specialized domain, sharing an underlying emphasis on high-performance simulation at significant scale for research and design optimization.

1. Cycle-Accurate DNN Accelerator Simulation: SCALE-Sim

SCALE-Sim is an open-source, cycle-accurate simulator for systolic-array deep learning accelerators, designed to offer microarchitectural and system-level insights unobtainable from high-level performance models or analytical estimates. Its core capability is the explicit modeling of dense matrix-multiplication workloads typical of DNNs, supporting parameter sweeps across array dimensions, dataflows, buffer sizes, and off-chip bandwidths. The simulator centralizes three canonical dataflows: output-stationary (OS), weight-stationary (WS), and input-stationary (IS), emulating diverse hardware-mapping strategies such as those in Google TPU and Eyeriss architectures (Samajdar et al., 2018).

The user supplies a layer-wise workload description (filter sizes, channel counts, feature-map dimensions, etc.) and a hardware configuration (array geometry, on-chip buffer sizes, dataflow selection, DRAM bandwidth). SCALE-Sim generates cycle-level traces of all SRAM and DRAM accesses, layered compute/memory utilization statistics, and energy estimates under user-supplied or literature-derived cost models. The computation is simulated under the assumption of no PE stalls for data, idealized double-buffering, and perfect overlap between computation and on-chip memory access, thereby providing a best-case lower-bound on system performance. The operational model is modular, with output granularity suitable for post-processing in external DRAM simulators or power models.

2. Memory Management for Large-Scale Multi-Agent Simulations

In the context of LLM-based multi-agent simulations, ScaleSim denotes a memory-efficient LLM serving framework that exploits sparsity in agent activation and leverages a unified "invocation distance" abstraction to drive priority-based memory management (Pan et al., 29 Jan 2026). Invocation distance, $D_i$ , quantifies the predicted relative time until agent $i$ will next require GPU-resident memory for an LLM request. This abstraction enables ScaleSim to delineate prefetch candidates and eviction victims by maintaining agents with smallest $D_i$ (highest urgency) in GPU memory and offloading those with large $D_i$ .

ScaleSim's two principal mechanisms are:

Proactive Prefetching: Agents with small $D_i$ are preloaded to the GPU before their request, overlapping memory transfer with concurrent computation.
Future-Reuse-Aware Eviction: Instead of least-recently-used policies, evictions maximize $D_j$ among resident agents, minimizing the probability of reloading soon.

A modular interface (AgentMemoryModule) supports heterogeneous agent memory objects (e.g., LoRA adapters, prefix trees), with hooks for handling LLM requests, eviction, prefetch dispatch, and asynchronous load management. When agent states are shared, urgency is set to the minimum $D_i$ among referencing agents. This approach significantly reduces load overhead (up to 70% versus baselines), and achieves end-to-end speedup of up to 1.74× compared to SGLang with LRU, over a spectrum of agent counts and benchmark types.

3. Integration with System-Level and Instrument Simulators

SCALE-Sim is also invoked as a component within hybrid simulation pipelines. For example, in energy analysis of LLM inference, ScaleSIM is used to evaluate the operational intensity and per-cycle microarchitecture utilization of custom accelerator arrays (Atmer et al., 26 Dec 2025). In such workflows, SCALE-Sim reports per-cycle MAC operations, PE utilization, and SRAM access statistics, which are fused with latency traces from external tools (e.g., LLMCompass) and SRAM energy models (e.g., OpenRAM). This enables architectural trade-off exploration across buffer sizes, operating frequencies, and external bandwidth constraints.

Separately, in astronomical instrumentation, a python-based "ScaleSim" implements end-to-end modeling of spectrograph data, convolving input astrophysical scenes through optical distortions, detector response, and calibration pipelines. This module integrates physical-optics simulations, polynomial distortion modeling, and spectral extraction routines, supporting full experimental design prior to hardware first-light (Briesemeister et al., 2020). Output includes detector-frame simulations, calibration products, and subsequent data reduction pipeline handoff.

4. Mathematical Models and Dataflow Strategies

Systolic DNN Acceleration

The simulator implements the detailed 2D wavefront propagation of data through a configurable $P \times Q$ PE mesh, for both convolutional (CONV) and matrix-multiplication (GEMM) layers. The total MAC count for a typical convolutional layer is

$N_\mathrm{MAC} = M \times C \times H_\mathrm{out} \times W_\mathrm{out} \times K \times K$

where $M$ is output channels, $C$ input channels, $H_\mathrm{out} \times W_\mathrm{out}$ output spatial extent, and $K$ filter size.

The simulator computes, per block mapped onto the array,

$T_\mathrm{block} = (K-1) + (P-1) + (Q-1) + \left\lceil \frac{C}{B_c} \right\rceil + \left\lceil \frac{H_\mathrm{out} W_\mathrm{out}}{PQ} \right\rceil$

and aggregates the results for total cycles. Energy modeling proceeds by counting each operation and data movement event, applying user-defined energy constants for MACs, SRAM, and DRAM accesses.

LLM Multi-Agent Simulations

Agent memory management optimizes two operations:

Prefetch selects all offloaded agents with $D_i \le D_\text{thresh}$ , sorts by $D_i$ , and, for each, evicts the largest- $D_j$ resident if $D_i < D_j$ before loading.
On inference request for agent $k$ , if not resident, evicts agent with largest $D_j$ and synchronously loads agent $k$ .

Key metrics include load latency, time-to-first-token, throughput speedup, and per-step execution time, all reported across varied agent population sizes and model distribution topologies.

5. Practical Application Domains

Accelerator Design

SCALE-Sim has been applied for architectural analysis of DNN accelerator proposals in vision, speech, NLP, games, and recommendation systems. By conducting sweeps over dataflows, buffer sizing, and array aspect ratios, the simulator exposes regime transitions between memory- and compute-bound operation, working set reuse saturation, and the sweet spots for energy-delay product (Samajdar et al., 2018). In LLM inference accelerator studies, it reveals how buffer sizing and frequency scaling interact under prefill (compute-bound) and decode (memory-bound) phases, quantitatively delineating memory bandwidth ceilings and the non-monotonicity of the energy-delay curve (Atmer et al., 26 Dec 2025).

Multi-Agent Simulation Infrastructure

ScaleSim's invocation distance-based framework is applied to large-scale simulations of agent societies, interaction networks, and information diffusion models, where memory contention emerges as the principal bottleneck due to non-uniform agent activity. The framework is benchmarked on Qwen2.5 transformers (7B and 32B) and demonstrates generality across single- and multi-GPU deployments (Pan et al., 29 Jan 2026).

Scientific Instrumentation

In spectrograph data pipeline design, ScaleSim simulates the optical PSF, lenslet and dispersion models, detector noise, and calibration frames, yielding rigorous predictions of instrument sensitivity, spatial/spectral resolution, and calibration stability under varying observation scenarios (Briesemeister et al., 2020).

6. Limitations, Assumptions, and Future Directions

Across its incarnations, ScaleSim generally assumes idealized overlaps of data transfer and computation, omits modeling of on-chip interconnect contention, and abstracts DRAM timing into burst models. In DNN acceleration, bank conflicts, NoC delays, and non-stationary dataflows must be modeled externally. For multi-agent memory management, invocation distance ordering relies on external hints; future work proposes dynamic or learning-based estimation. Heterogeneous agent memory sizes or highly non-uniform reference patterns present granularity challenges. In system integration scenarios, energy models are limited to SRAM and compute, not accounting for off-chip DRAM or control logic unless externally supplied. Calibration to silicon or cross-simulator validation is typically minimal (Atmer et al., 26 Dec 2025).

Proposed extensions include automatic invocation distance profiling, adaptive budget tuning for prefetch/eviction, integration of additional memory tiers (HBM, remote storage), and more general abstraction of shared-object urgency. In architectural simulation, planned work targets hierarchical synchronization, locality-optimized thread mapping, and event-driven adaptive clocking (Chalak et al., 2018, Pan et al., 29 Jan 2026).

SCALE-Sim is distinct from event-driven system simulators such as ScaleSimulator (which targets cycle-accurate architectural pipelines using lock-free parallel models and a hierarchical synchronization protocol (Chalak et al., 2018)) or grid/computational infrastructure simulators such as CGSim (focused on distributed HPC/HTC environments with fair-share bandwidth arbitration and real-time calibration pipelines (Vatsavai et al., 1 Oct 2025)). In health data generation, candidate solutions share the large-scale, agent-based orientation but focus on state-space evolution for epidemiological, intervention, and bias analysis (Tikka et al., 2020). Each ScaleSim derivative is domain-specialized, and direct comparison requires careful attention to configuration, semantics, and target research questions.

Markdown Upgrade to Chat

References (7)

SCALE-Sim: Systolic CNN Accelerator Simulator (2018)

ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management (2026)

Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling (2025)

End-to-end Simulation of the SCALES Integral Field Spectrograph (2020)

ScaleSimulator: A Fast and Cycle-Accurate Parallel Simulator for Architectural Exploration (2018)

CGSim: A Simulation Framework for Large Scale Distributed Computing Environment (2025)

Simulation Framework for Realistic Large-scale Individual-level Data Generation with an Application in the Health Domain (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScaleSim.