PrIM Benchmark Suite Overview
- PrIM Benchmark Suite is a comprehensive, open-source collection of memory-bound workloads for evaluating processing-in-memory architectures.
- It covers diverse application domains such as linear algebra, databases, analytics, graph processing, neural networks, bioinformatics, and image processing.
- Implemented in C using the UPMEM SDK, it enables detailed performance analysis through metrics like execution time, energy consumption, and speedup.
The PrIM Benchmark Suite constitutes a comprehensive, open-source suite of memory-bound workloads, explicitly crafted for evaluating real-world processing-in-memory (PIM) architectures. Developed in the context of assessing the UPMEM system, it represents the first benchmark collection to target publicly-available commercial PIM hardware, featuring broad coverage of application domains critical for contemporary data-intensive computing (Gómez-Luna et al., 2021).
1. Motivation and Design Goals
The PrIM suite was designed to enable thorough performance characterization of PIM systems across representative, real-world workloads. Key design objectives include:
- Diversity: Selection of memory-bound tasks from domains encompassing dense and sparse linear algebra, databases, analytics, graph processing, neural networks, bioinformatics, image processing, and parallel primitives.
- Realism: Workloads are chosen for relevance and for displaying practical bottlenecks under standard memory hierarchies.
- Open Accessibility: Code and datasets are publicly available (PrIM repo), allowing reproducibility and extension by academia and industry.
- Hardware Suitability: Developed specifically for the UPMEM system—an architecture integrating DRAM with embedded DPUs (data processing units)—but abstract enough for general PIM evaluation.
These goals distinguish PrIM from prior PIM benchmark suites, which were largely domain-specific, simulation-only, or proprietary. PrIM’s focus on memory-boundedness is justified by roofline analysis performed in the original paper, showing all suite workloads incur substantial memory access costs on typical CPU systems (Gómez-Luna et al., 2021).
2. Benchmark Composition and Characteristics
PrIM comprises 16 workloads, each selected to represent both canonical and emerging computational patterns. The suite is organized according to domain and operational characteristics. These are detailed in the table below (excerpt per data):
| Domain | Benchmark Name | Operations |
|---|---|---|
| Dense Linear Alg. | Vector Add (VA), GEMV | add, mul |
| Sparse Linear Alg. | SpMV | add, mul |
| Databases | Select (SEL), Unique (UNI) | add, cmp |
| Analytics | Binary Search (BS), TS | cmp, arithmetic |
| Graph Processing | BFS | logic |
| Neural Networks | MLP | add, mul, cmp |
| Bioinformatics | NW | add, sub, cmp |
| Image Processing | HST-S, HST-L | add |
| Parallel Prim. | RED, SCAN-SSA, SCAN-RSS, TRNS | add, sub, mul |
Each benchmark is annotated with memory access pattern (sequential, strided, random), data type (int32, float, uint64, etc.), and intra/inter-DPU communication requirements (barrier, mutex, handshake, semaphores). For parallel primitives such as reduction and scan, multiple implementation variants are supplied (e.g., scan-scan-add versus reduce-scan-scan; barrier versus handshake synchronization) (Gómez-Luna et al., 2021).
3. Implementation Methodology
PrIM workloads are implemented in C using the UPMEM SDK, leveraging explicit orchestration between host-CPU and DPUs. Programming guidelines include:
- Data Block Partitioning: Tasks are divided into independent data blocks to maximize DPUs’ parallelism.
- Tasklet Utilization: Full utilization of available DPU tasklets (≥11 per DPU recommended).
- Optimized Data Transfers: Use of large contiguous data blocks for maximal bandwidth between host and DPUs.
- Synchronization Primitives: Selection among barrier, mutex, and handshake synchronization based on workload dependency structure.
These practices align with general PIM programming strategies aimed at reducing synchronization overheads and maximizing compute locality.
4. Metrics and Evaluation Techniques
To facilitate rigorous evaluation across architectures, PrIM specifies several measurement metrics:
- Execution Time: Wall-clock time, decomposed into DPU computation, CPU-DPU and DPU-CPU data transfer, and multi-DPU synchronization costs.
- Strong Scaling: Fixed overall problem size, increasing DPU count.
- Weak Scaling: Fixed per-DPU workload, scaling up aggregate problem size with more DPUs.
- Speedup: Relative speedup versus single DPU, CPU, or GPU baselines.
- Energy Consumption: Measured via Intel RAPL (CPU), Nvidia SMI (GPU), and integrated circuit counters (PIM DIMMs).
- Operational Intensity: Calculated via Roofline model (floating-point or integer operations per memory byte accessed).
- Workload Breakdown: Detailed profiling of time partitioning (computation versus communication/synchronization).
The suite includes microbenchmarks for measuring UPMEM hardware pipeline utilization and memory bandwidth, enabling diagnostic studies of architectural limits (Gómez-Luna et al., 2021).
5. Role in PIM System Characterization
PrIM’s application in evaluating the UPMEM architecture reveals:
- Memory-Boundedness: Most benchmarks are highly memory bound, displaying linear or near-linear strong scaling with increasing DPUs, so long as inter-DPU/host synchronization is minimized.
- Architectural Bottlenecks: Workloads requiring inter-DPU communication (e.g., BFS, scan) exhibit reduced scaling due to reliance on host-mediated exchanges, highlighting lack of direct DPU-DPU communication as a principal limiting factor.
- Compute-Bound Exceptions: Tasks with high operational intensity or unsupported operations (e.g., floating-point arithmetic, multiplications) reveal PIM’s limitations compared to CPU/GPU.
- Comparative Performance: In suitable cases, UPMEM PIM outperforms modern CPUs and high-end GPUs (up to 2.5× speedup), with energy reductions apparent in the majority (10/16) of PrIM workloads (Gómez-Luna et al., 2021).
These results suggest workload characteristics strongly inform PIM suitability and accentuate the need for hardware and programming model enhancements focused on communication bottlenecks.
6. Use in Scheduling and Static Profiling Research
The PrIM suite’s diversity underpins research into hybrid scheduling policies, as exemplified in the analytic offloading strategy evaluated in APIM (Jiang et al., 23 Feb 2024). The suite serves as a testbed for examining the impact of static code analysis and fine-grained partitioning in mapping application segments to CPU or PIM cores.
For example, in the APIM paper, selected PrIM benchmarks (gemv, select, unique, hashjoin, mlp) are used to compare:
- Static analytic partitioning (APIM)
- Function-level and basic block-level offloading
- Naive, MPKI-based, and greedy scheduling heuristics
Empirical findings indicate:
- Fine-grained partitioning (basic-block offloading) realizes maximal speedup, often approaching theoretical upper bounds.
- Avoidance of excessive data movement and context switching is essential; naive PIM-only policies can degrade performance in certain applications (e.g., hashjoin, mlp).
- The analytic model used in APIM leverages PrIM’s coverage to generalize beyond classic graph workloads and expose the granularity-dependent trade-offs inherent to memory-centric computing.
Quantitatively, APIM (basic block granularity) demonstrated average speedups of 2.63x over CPU-only and 4.45x over PIM-only execution, with maxima of 7.14x and 10.64x, respectively (Jiang et al., 23 Feb 2024).
7. Comparison to Prior Benchmark Suites
A comparison to prior PIM benchmark suites highlights PrIM’s:
- Novelty: PrIM is the first benchmark suite tailored to a real-world, commercial PIM system (UPMEM), in contrast to previous simulation-only or synthetic suites.
- Breadth: Encompasses wider domains, including databases, analytics, graph, ML, bioinformatics, and image processing, whereas predecessors tended to focus narrowly (e.g., linear algebra or synthetic kernels).
- Depth: Multiple implementation variants per primitive and memory/computation pattern diversity foster nuanced evaluation.
- Accessibility: Open-sourced code and datasets, combined with public documentation, facilitate standardized benchmarking and community contributions (Gómez-Luna et al., 2021).
A plausible implication is that PrIM’s broad applicability and public availability provide a foundation for future comparative studies, hardware co-design, and generalization of PIM programming best practices.
8. Impact and Directions
The PrIM Benchmark Suite has established itself as a foundational tool for systematic, reproducible evaluation of processing-in-memory architectures. Its real-system orientation enables the exposure of architectural bottlenecks, scaling phenomena, and programming challenges not observable in simulation or emulation. PrIM also serves as a core reference in hybrid scheduling and static analysis research, influencing the development of analytic partitioning tools able to fully exploit heterogeneous CPU-PIM systems (Jiang et al., 23 Feb 2024).
Current limitations—including host-mediated inter-DPU communication and non-native arithmetic support—identified through PrIM-driven analysis, suggest explicit directions for hardware and software evolution in general-purpose PIM platforms.
Summary: PrIM’s comprehensive workload coverage, open-source availability, and methodological rigor provide the infrastructure necessary for advancing both experimental PIM research and practical system implementation. It constitutes the empirical backbone for performance, scaling, and architectural studies in this emerging field.