DAMOV Simulation Results Analysis

Updated 21 October 2025

The paper identifies memory-bound functions using empirical profiling and benchmarks to quantify function-level data movement bottlenecks.
It employs a three-phase simulation methodology combining ZSim, Ramulator, and custom locality analysis to extract both architecture-independent and dependent metrics.
Findings reveal that NDP architectures can yield up to 4.8× speedup for bandwidth-bound workloads while reducing energy consumption, though compute-bound cases may underperform.

DAMOV simulation results refer to the empirical performance and profiling data generated during the evaluation of memory-bound application kernels in the DAMOV benchmark suite, focusing on data movement bottlenecks at the function level. These results are used to quantify and classify functions according to their dominant memory system constraints, comparing conventional processor-memory hierarchies, hardware prefetcher strategies, and Near-Data Processing (NDP) architectures. They serve both as a diagnostic for microarchitectural memory systems and as a rigorous basis for designing or choosing mitigation strategies for data movement overheads.

1. Methodology for DAMOV Simulation Studies

The DAMOV simulation results are produced using a three-phase methodological workflow:

Function Selection: Hardware profiling, using tools such as Intel VTune, identifies memory-bound functions. A function is retained if it constitutes at least 3% of all execution cycles and has a “Memory Bound” metric over 30%.
Metric Extraction:
- Architecture-independent metrics include:
  - Spatial Locality: Quantified by $\sum (\text{stride\_profile}(i) / i)$ , where $\text{stride\_profile}(i)$ is the count of accesses with stride $i$ . Values near 1 indicate high sequentiality.
  - Temporal Locality: Estimated as $\sum [2^i \times \text{reuse\_profile}(i)] /$ (total memory accesses); higher values reflect frequent address re-use.
- Architecture-dependent metrics:
  - Arithmetic Intensity (AI)
  - LLC Misses Per Kilo-Instruction (MPKI)
  - Last-to-First Miss Ratio (LFMR): Defined as the ratio of L3 to L1 misses, quantifying deep-cache effectiveness for L1 misses.
Simulation Infrastructure: The DAMOV-SIM platform combines the ZSim parallel simulator (modeling in-order and out-of-order pipelines, caches, and prefetchers) with Ramulator (cycle-accurate DRAM simulator) and custom trace-based locality analyzers. Three major target system topologies are modeled:
- Standard host CPU with deep multi-level caches
- CPU with hardware stream prefetchers
- NDP system with the logic layer in stacked memory, using L1-only caches

Performance, energy, and scaling results are gathered by sweeping core counts (from 1 to 256) and correlating functional behaviors with bottleneck class assignments.

2. Composition and Properties of the DAMOV Benchmark Suite

From an initial set of approximately 77,000 functions extracted from 345 applications (spanning 37 benchmark suites), 144 functions are selected as representative of memory-bound workloads. These are organized into 74 applications and deliberately cover a spectrum of data movement patters—including high/low MPKI, spatial/temporal locality extremes, and varying cache effectiveness.

Each function is labeled according to its bottleneck class, e.g., DRAM bandwidth-bound, DRAM latency-bound, cache capacity-bound (L1/L2/L3), or compute-bound. This systematic diversity enables comprehensive assessment and targeted architectural exploration.

Metric/Class	Noted Characteristic	Example Behavior
High MPKI, low temp. loc.	DRAM bandwidth-bound	NDP benefits most
Low MPKI, high LFMR	Latency-bound, cache hierarchy ineffective	NDP provides moderate benefit
High AI, low MPKI	Compute-bound	NDP may degrade performance

3. Key Simulation Results and Insights

The simulation outcomes show strong distinctions among function classes:

DRAM Bandwidth-Bound Functions: With high LLC MPKI and low temporal locality, these benefit disproportionately from NDP, where additional internal memory bandwidth (up to 3.7 $\times$ , e.g., 431 GB/s vs 115 GB/s on CPU) can yield speedups approaching 4.8 $\times$ as core counts increase.
DRAM Latency-Bound Functions: Functions where L1 misses are almost always DRAM misses (high LFMR $\sim$ 1), but low MPKI mean bandwidth is not the core issue. For these, NDP architectures lower the average memory access time (AMAT) by bypassing much of the cache hierarchy, resulting in speedups of approximately 1.2–1.3 $\times$ over host CPUs.
Cache-Fitting or Compute-Bound Functions: Where temporal locality is very high, or AI is high and MPKI is low, NDP may perform worse than host CPU due to the lack of deep cache exploitation and additional costs of direct DRAM access.

Detailed energy breakdowns reveal NDP's efficacy in reducing energy consumption for bandwidth- and latency-bound functions by eliminating off-chip I/O, but potential net energy degradation for compute-bound use-cases.

4. Case Studies and Implications for NDP System Design

In-depth simulations of representative functions across the function classes underscore key patterns:

Bandwidth-bound: NDP’s attached logic efficiently leverages the high pin bandwidth of 3D-stacked memory (e.g., Hybrid Memory Cube), achieving speedups as large as 4.8 $\times$ under large core counts.
Latency-bound: By routing L1 misses directly to DRAM and avoiding cascaded latency from deep cache lookups and off-chip busses, NDP architectures lower the AMAT.
Limitations: For compute-intensive or highly cache-favorable workloads, NDP may underperform due to the absence of deep, high-capacity caches, and architectural inefficiencies when address interleaving and locality benefits are diluted.

Further simulation cases address inter-vault traffic, dynamic task scheduling, and effects of microarchitectural choices (e.g., core pipeline type, NDP accelerator granularity) in stacked-memory deployments.

5. Methodological Reproducibility and Artifact Availability

DAMOV’s simulation methodology is fully open-sourced, including benchmark suite functions, toolchains, and integration files for ZSim, Ramulator, and custom locality analyzers. The repository (https://github.com/CMU-SAFARI/DAMOV) ensures reproducibility and facilitates rigorous comparative studies of new data-movement mitigation techniques. This open-source policy aims to standardize benchmark-driven architectural evaluation and enable broad participation in the analysis of memory system bottlenecks.

6. Correction of Simulation Result Reporting and Reproducibility Issues

Recent investigation (Luo et al., 17 Oct 2025) identified critical errors in reporting DAMOV simulation results in subsequent works. Specifically, some analyses used non-DRAM-sensitive statistics (e.g., latGETnl/mGETs from the L1-D cache of zsim), which do not capture actual DRAM-induced latency. This led to unrealistic, constant latency figures (~25 ns) and an incorrect assessment of simulator fidelity.

A corrected methodology requires extracting latency and performance metrics directly from the DRAM simulator (e.g., Ramulator) by collecting timing-accurate statistics including:

DRAM protocol latencies (e.g., $t_{RCD}, t_{RAS}, t_{RP}, t_{RFC}$ )
Variabilities due to refresh operations and queuing
Congestion-induced delay increases with rising memory traffic

Proper configuration and open availability of simulation sources, traces, and configuration files are mandated to ensure accurate, interpretable, and reproducible results. Simulation results corrected in this manner display load-dependent latency curves and bandwidths reflecting real hardware behavior, thus undermining incorrect conclusions in prior literature regarding the inaccuracy or resemblance of DAMOV and Ramulator-based memory system simulation.

7. Broader Impact and Future Research Directions

The DAMOV simulation results, when correctly produced and thoroughly documented, offer both a practical and a methodological foundation for evaluating memory bottleneck mitigation in modern architectures—especially as the field expands toward memory-centric (NDP, PIM) solutions. The systematic approach linking low-level access metrics to architectural efficiency supports both hardware-software co-design and the development of analytically predictive performance models. Ongoing and future research can benefit from DAMOV’s transparent methodologies to explore scaling effects, heterogeneity in data movement patterns, and to rigorously validate new mitigation mechanisms in realistic, reproducible environments.

PDF Markdown Chat (Pro)

References (1)

Cleaning up the Mess (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DAMOV Simulation Results.