- The paper introduces a broadcast-based approach that enables PIM-based spatial queries using a hierarchical R-tree model.
- It incorporates two-phase filtering and breadth-first serialization to drastically reduce communication overhead across thousands of DPUs.
- Empirical results show significant performance gains and energy savings, with up to 3.66× speedup and 3.4× lower energy consumption.
Parallel R-tree-Based Spatial Query Processing on Commercial Processing-in-Memory Systems
Introduction
The exponential growth of data in scientific, spatial, and geospatial workloads imposes increasing challenges for efficient query processing due to memory bandwidth limitations and the memory wall problem. Traditional CPU-centric systems suffer from inefficiencies related to data movement and limited parallel memory access. Processing-in-Memory (PIM) architectures, such as commercial UPMEM DPUs, provide a compelling alternative by integrating compute capability directly within the memory subsystem, enabling large-scale parallelism and substantially reducing energy consumption for memory-bound workloads. While prior PIM research has focused primarily on linear scans and hash-based operators, this paper presents the first design, implementation, and evaluation of hierarchical R-tree range queries on real commercial PIM hardware (2604.14445).
PIM Architecture and Challenges
UPMEM PIM systems comprise thousands of DPUs embedded within DRAM chips, with no direct inter-DPU communication; all coordination is mediated via explicit host-to-device bulk transfers. Each DPU is equipped with a custom core, DRAM-based MRAM (data storage), small WRAM (fast buffer), and limited instruction RAM (IRAM), supporting up to 16 tasklets for intra-DPU parallelism. The bulk-synchronous parallel (BSP) execution model and stringent local memory constraints necessitate careful index and data layout strategies.
Figure 1: Processing-in-Memory (PIM) system organization, showing DIMMs populated with multiple DRAM Processing Units (DPUs) integrated with DRAM dies.
Hierarchical structures like R-trees, which are standard in spatial query processing, present unique challenges for PIM due to their inherently irregular access patterns, multi-level pointer traversals, and output-sensitive nature. Adapting these data structures to the constraints and strengths of the UPMEM architecture is non-trivial, as typical approaches (e.g., pointer-based traversals) are infeasible.
R-tree Algorithms and Bulk-Loading
The R-tree is a height-balanced hierarchical index organizing objects using Minimum Bounding Rectangles (MBRs), with leaf nodes storing data MBRs and internal nodes containing MBRs encapsulating their children. Bulk-loading, particularly the Sort-Tile-Recursive (STR) method, packs R-trees to ensure low overlap and efficient spatial partitioning—properties critical for parallel search.
The paper adopts an STR bulk-load variant with explicit control over node fanout and tree depth to fit the architecture constraints and maximize parallel execution.
Figure 2: Three-level STR R-tree layout, illustrating hierarchical structure and partitioning across DPUs.
Executing R-tree Range Queries on PIM
Three approaches are explored:
- Multi-threaded CPU Baseline: Standard parallel R-tree querying using POSIX threads, exploiting host DRAM and multi-core CPUs for direct comparison.
- Subtree-Based PIM Baseline: The R-tree is constructed with a custom top-down approach, with the root fanout explicitly set to the number of DPUs; each DPU receives a disjoint subtree and processes queries locally, but at the cost of substantial host-to-DPU communication.
- Broadcast PIM R-tree (Proposed): The R-tree is built bottom-up with STR bulk-loading on the CPU, serialized in breadth-first order, and split into two stages: upper-level headers (root and level-1) are compacted and broadcast to all DPUs, while the leaf nodes are partitioned and distributed across DPUs. Batched queries are then processed in parallel using per-DPU tasklets, with global pruning using upper-level MBRs followed by MRAM leaf scans.
Figure 3: Workflow for broadcast PIM R-tree range query processing with combined CPU orchestration and DPU parallel execution.
Algorithmic Enhancements
- Two-phase Filtering: Phase 1 performs upper-level pruning using local WRAM-resident metadata, limiting the candidate set of subtrees per query to a small constant and dramatically reducing unnecessary memory accesses. Phase 2 performs local leaf node scanning for final overlap checks.
- Breadth-First Serialization: Tailored serialization enables highly efficient bulk broadcasting of metadata and contiguous leaf distribution, optimizing for UPMEM hardware's memory and communication constraints.
- Static Partitioning of Queries: Queries are batch-broadcast and partitioned across DPU tasklets, avoiding synchronization costs and balancing memory and compute loads.
Figure 4: DPU-index–guided upper-level filtering, where each DPU only evaluates overlap with a small neighborhood of adjacent level-1 MBRs, reducing filtering complexity from O(c) to O(1).
Empirical Evaluation
The methodology is rigorously benchmarked using real datasets (Sports, Lakes) and a synthetic dataset on both an HPC system (CPU) and a commercial UPMEM-PIM server with up to 2,540 DPUs. Key performance metrics include kernel time (pure DPU execution), end-to-end time (including communication), and energy consumption, with careful normalization to CPU-only execution and analysis of strong scaling.
- Broadcast-Based vs. Subtree-Based: The broadcast-based method yields substantially lower communication overhead, as demonstrated on the Lakes dataset where kernel time drops from 64.9 s (512 DPUs) to 17.6 s (2,540 DPUs), with up to 3.66× kernel and 2.70× end-to-end speedup relative to CPU R-tree search. Communication is no longer the dominant cost, in sharp contrast to the subtree approach.
Figure 5: Kernel and communication time breakdown for the Lakes dataset—Broadcast-based execution vastly reduces communication overhead compared to the subtree baseline.
- Scalability: Strong scaling is demonstrated up to 2,540 DPUs, with kernel and end-to-end speedup scaling near-linearly until communication overheads dominate at the highest DPU counts. Intra-DPU scaling saturates at 8–11 tasklets, highlighting MRAM bandwidth as the limiting factor.
Figure 6: Strong scaling on a fixed dataset shows robust speedup as more DPUs are utilized in broadcast-based PIM R-tree search.
Figure 7: Effect of intra-DPU tasklet parallelism—performance saturates as MRAM bandwidth becomes a bottleneck.
Figure 8: Per-batch operation timing for query transfer, kernel execution, and results retrieval—kernel execution is now the primary bottleneck, not communication.
Energy Efficiency
The broadcast-based PIM R-tree consumes 3.4× less energy than CPU search on large, memory-bound workloads (e.g., 59.6 kJ vs. 167.0 kJ for Lakes), and more than 14× lower energy for highly memory-intensive synthetic datasets. Marginal gains are observed for cache-resident tasks.
Discussion and Implications
The proposed broadcast-based PIM R-tree search represents the first practical demonstration of hierarchical spatial indexing on commercial PIM hardware, demonstrating effective strong and weak scaling, reduced energy consumption, and communication-aware design as critical enablers for PIM adoption in irregular, data-intensive database workloads. The memory-centric design paradigm maximally leverages DPU-local bandwidth and reduces reliance on costly host/device transfers, aligning index layout with device constraints.
Compared to CPU-parallel approaches, PIM excels as the problem size and memory access irregularity grow. For small or cache-resident workloads, CPU baselines remain competitive due to unavoidable communication overheads. For large-scale spatial analytics, the implications are clear: PIM architectures, if paired with suitable communication-minimizing algorithms, offer a scalable, energy-efficient alternative to traditional architectures.
Future Directions
Key directions for advancing PIM-based spatial analytics include:
- Overlapping Computation and Transfer: Dual DPU set strategies to hide communication latency.
- Analytical Performance Modeling: Development of theoretical models to guide dynamic workload partitioning between CPU and PIM.
- Extension to Other Hierarchical Indexes: Exploration of the technique’s generalizability to other structures and multi-query/complex spatial operators, including joins and overlays.
Such enhancements will further clarify the system break-even points, inform workload scheduling for hybrid CPU–PIM systems, and drive future improvements in index structure design for in-memory and near-memory architectures.
Conclusion
This paper provides a comprehensive methodology and empirical validation for efficient R-tree spatial range queries on commercial PIM platforms via a broadcast-based execution model. The results unambiguously demonstrate substantial performance and energy benefits for large, memory-bound workloads, provided that data layout and communication strategies are tightly coupled to hardware constraints. This work establishes a foundation for extending PIM acceleration to broader classes of irregular, hierarchical, and spatial data processing tasks.