Data Path Fusion in GPU for Analytical Query Processing

Published 11 May 2026 in cs.DB | (2605.10511v1)

Abstract: One major technical challenge for modern analytical database systems is how to leverage GPU to exploit their massive parallelism and high bandwidth. Yet, existing GPU-driven database engines suffer from inefficiencies caused by frequent host-device interactions and fragmented execution across multiple GPU kernels, limiting their ability to fully utilize GPU's computational and IO capabilities. This paper proposes Data Path Fusion (DPF), a novel GPU-driven data processing architecture that integrates a sequence of data path operations -- including IOs, decompression, and query operations -- into a single GPU kernel. By fusing the data path, DPF reduces host-device communication overheads and enables more efficient utilization of GPU resources for analytical query workloads. DPF seamlessly integrates GPU-friendly optimization techniques, including type-specific compression/decompression, variable-length attribute support, and state-of-the-art GPU-driven IO mechanism, to work in concert, enabling efficient end-to-end query execution directly on GPU. Through extensive experimental evaluation using a prototyped DPF-based GPU-driven database engine (DPFProto) with analytical benchmark workloads, this paper demonstrates that DPF achieves speedups of 2.66 to 6.22 on TPC-H and 3.84 to 16.81 on SSB over the state-of-the-art approach in the representative configuration. Our results show that DPF effectively unlocks the computational and IO potential of modern GPU, providing a promising direction for next-generation analytical database systems.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel GPU data path fusion method that merges IO, decompression, and query processing into a single kernel.
It leverages GPU-initiated IO via the BaM system and type-specific compression to reduce memory transactions by up to 45%.
Experimental results show speedups ranging from 2.66× to 176.8×, demonstrating significant performance improvements over fragmented execution.

Data Path Fusion for GPU-Driven Analytical Query Processing

Introduction and Motivation

"Data Path Fusion in GPU for Analytical Query Processing" (2605.10511) introduces a fundamentally novel GPU-driven data processing paradigm—Data Path Fusion (DPF)—to address severe efficiency limitations present in contemporary GPU-accelerated analytical database systems. While modern GPUs offer immense parallelism and bandwidth, existing analytical DBMSs interact with these accelerators primarily via fragmented, host-orchestrated execution pipelines: IO, decompression, and core query operators are often distributed across multiple independent kernels with intermediate materialization and significant host-device synchronization. This fragmentation leads to kernel launch latencies, excessive global memory transactions, and suboptimal GPU utilization.

DPF challenges this design by unifying the entire OLAP data path—IO, decompression, and query logic—within a single, monolithic GPU kernel. Through comprehensive kernel fusion and accompanying GPU-optimized data path techniques, DPF aims to unlock both the computational and IO throughput of commodity high-end GPUs for large-scale analytics, with an infrastructure that minimizes CPU intervention post kernel launch.

Technical Contributions

Unified Kernel Fusion Architecture

At the core of DPF is a kernel fusion design that absorbs all stages of the analytical data path within a single kernel invocation. Rather than executing IO, decompression, and relational operators (scan, join, aggregation) in distinct kernels, DPF statically fuses these operations along the query plan, maintaining thread residency and persistent memory locality across all processing stages. This approach eliminates intermediate result materialization in global memory and removes host-device control for IO orchestration.

GPU-Initiated IO via BaM

DPF leverages the BaM system for GPU-native, fine-grained IO—superseding host-controlled mechanisms such as GPUDirect Storage. GPU threads initiate, coordinate, and complete NVMe storage requests directly from within the fused kernel, with adaptive queue management to balance IO and computation concurrency.

Type-Specific Compression and Variable-Length Attribute Support

Recognizing the heterogeneity of analytical datasets, DPF applies type-specific columnar compression, using methods such as GPU-FOR for fixed-width attributes and FSST for variable-length types. Each compression codec is selected based on attribute type and value distribution at load time. Importantly, FSST-based variable-length attribute support is integrated into the fused execution model, enabling parallel in-place decompression and string predicate evaluation within the same kernel.

Data Loader and Metadata Management

A multi-threaded loader preprocesses tables into compressed, type-specific columnar layouts (with min/max zone maps and auxiliary RID indexes for variable-length types), preparing the data for efficient GPU-side pruning and random access. The loader introduces negligible additional overhead (<1.5%), maintaining practical import times even at scale.

Experimental Results

The DPF approach demonstrates strong, repeatable performance and robustness across standard analytics benchmarks:

End-to-End Query Speedup: On TPC-H, DPF achieves 2.66×–6.22× speedup over a GiDP baseline (fragmented GPU-native pipeline). On SSB, the speedup reaches 3.84×–16.81×.
Reduction in Host-GPU Interactions: Kernel invocation counts are reduced up to two orders of magnitude due to fusion, and overall IO volume drops by 29–45% due to adaptive, type-specific compression.
Robustness: DPF's effectiveness is invariant to database page size, query selectivity, and data scale: it outperforms baselines in every tested configuration, with up to 22× speedups on certain queries.
Query Engine Comparison: Against Polars (GPU), Spark-RAPIDS (GPU), and DuckDB (CPU), DPF achieves 7.3×–176.8× speedups (dependent on workload and dataset size) with identical or lower hardware resource utilization.

Implications for Analytical Systems

DPF marks a shift in GPU-accelerated analytics: it eliminates the tight coupling between kernel design and host-driven scheduling, fully leveraging the GPU as a self-sufficient query engine from storage to result with no CPU intervention in the data path. The adoption of BaM for in-kernel IO provides a scalable template for storage-class memory architectures, while pervasive kernel fusion exposes higher memory bandwidth efficiency and reduces PCIe/host-GPU bottlenecks.

Type-specific compression as part of the core architecture (not just at load or storage) unlocks meaningful IO reduction, especially relevant for mixed-schema workloads with substantial variable-length data.

Limitations and Future Directions

While the DPF prototype validates its approach comprehensively, several limitations are present: joins require build-side hash tables to fit entirely in GPU memory; only GPU-FOR and FSST compression schemes are integrated; only LIKE-style variable-length string predicates are accelerated; and the IO stack presently targets NVMe block devices.

Future work includes:

Out-of-core join algorithms for analytical workloads that exceed device memory.
Generalized support for more compression codecs, especially for floating point and fixed-point decimals.
Multi-GPU scaling with interconnect-aware data sharding and join/aggregate parallelism.
Extension of DPF kernel fusion to additional workload types (e.g., graph analytics, OLTP) and exploration of GPU-driven file systems.

Conclusion

DPF represents a system-level advancement for GPU-native database architectures, demonstrating clearly that full data path fusion, GPU-initiated IO, and type-aware columnar compression can jointly yield order-of-magnitude gains in analytical query processing. By systematically removing persistent host-GPU boundaries and aligning all stages of query execution with the GPU’s execution model, DPF sets a practical precedent for future analytical engines that aim to fully exploit the parallelism and IO capabilities of modern accelerators (2605.10511). As analytical workloads continue to grow in volume and complexity, approaches in the DPF paradigm will increasingly shape the design of next-generation data systems.

Markdown Report Issue