Papers
Topics
Authors
Recent
2000 character limit reached

GPU_EXT: Extensible GPU Frameworks

Updated 21 December 2025
  • gpu_ext is a suite of GPU extensions that enhance programmability, observability, and control by extending driver, runtime, and application functionalities.
  • It leverages eBPF hooks, SIMT-aware verification, and middleware discovery to integrate policy enforcement and resource management for improved GPU scheduling.
  • Implementations demonstrate transformative speedups (up to 200–250×) and enhanced multi-tenant isolation, despite current challenges in portability and expressivity.

The term gpu_ext encompasses a diverse family of GPU extensions spanning OS-level resource management, middleware discovery, algorithmic acceleration modules, and custom policy runtimes. Core to all gpu_ext approaches is the extension of GPU stack functionality—at the driver, runtime, or application level—to support programmability, observability, and fine-grained control over memory, scheduling, and task orchestration. Recent research has crystallized around technically distinct, but conceptually unified, gpu_ext systems, notably eBPF-based OS policy runtimes (Zheng et al., 14 Dec 2025), GRID middleware GPU discovery (Isacson et al., 2019), domain-optimized application kernels (Gong et al., 2 Dec 2024, Neep et al., 18 Sep 2025), and high-throughput algorithmic primitives (Kanzaki, 2010, Cardoso et al., 2010). This article delineates the architecture, interfaces, implementation methods, and performance characteristics of modern gpu_ext systems.

1. Architectural Paradigms for gpu_ext

Modern gpu_ext paradigms map to several system layers and technical domains. A structured overview is presented in the table below.

Layer Principal Mechanisms Cited Example
OS/Driver eBPF policy hooks, SIMT-aware verification, device-side maps (Zheng et al., 14 Dec 2025)
Middleware Resource discovery, GLUE2 schema extension (Isacson et al., 2019)
Application Runtime Algorithmic restructuring for kernel batching, memory hierarchy optimization (Gong et al., 2 Dec 2024, Cardoso et al., 2010)
Scientific Libraries CUDA/CUDA Fortran kernel extensions, memory layout rewrites (Neep et al., 18 Sep 2025)

In gpu_ext OS-level frameworks, such as the eBPF-based gpu_ext, the GPU driver and device are programmatically exposed as subsystems with safe, dynamic hooks for scheduling, memory placement, and performance monitoring. Middleware extensions (e.g., ARC’s gpu_ext) provide discovery pipelines enabling cluster and grid resource managers to recognize and schedule tasks based on GPU availability and attributes. Runtime and library-level gpu_ext variants focus on batching, fusing, and parallelizing key computational routines to maximize GPU occupancy and memory throughput, as seen in DFT and lattice QCD.

2. Extensible OS Policy via eBPF-Based gpu_ext

eBPF-driven gpu_ext treats the GPU driver/device as a programmable OS component, combining user-space policy provisioning with device-side managed enforcement (Zheng et al., 14 Dec 2025).

Design and Mechanisms

  • SIMT-Aware Verifier: Guarantees warp-uniformity in device code; statically prohibits divergent control flow or atomics while bounding loop iterations.
  • Sandboxes and Hierarchical Maps: Device-side eBPF runtimes dispatch policy hooks to warp leaders (SIMT model), aggregate results across lanes, and maintain strongly isolated BPF maps split across host, device, and SM memory.
  • Safe Hook Points: Exposed for key lifecycle events: memory activation/access/prefetch/eviction and scheduling task init/exit/queue operations.
  • Formal Type Signatures: For example, memory-access hooks receive structured ctx and return Z\mathbb{Z}, with r≥0r \geq 0 triggering custom policies.
  • Helper Functions (kfuncs): Trusted operations for list reordering, attribute setting, or triggering preemption.

Safety and Overhead

Verification statically guarantees memory safety, loop termination, and deadlock-freedom. The runtime restricts per-hook code complexity and enforces absence of inter-SM sync. End-to-end overhead on microbenchmarks is <0.2%, supporting real-time deployment.

3. gpu_ext in Middleware and Resource Discovery

In GRID, cluster, and cloud environments, gpu_ext extensions enable discovery and advertisement of GPU resources through existing information-provider chains (Isacson et al., 2019).

Key Technical Steps

  • Direct integration with SLURM resource reporting via sinfo -a -h -o "gresinfo=%G".
  • Propagation of GPU resource descriptors as opaque strings in GLUE2 XML, which are parsed on the client.
  • Zero modification of ARC core libraries; only resource plug-ins and GLUE2 printers are extended.
  • Example: Runtime environments for job submission specify --gres=gpu:k80:1 and are surfaced to user queries via arcinfo.

Limitations

Initial prototypes are SLURM-only; resource descriptors are string-based rather than semantically structured for automated matchmaking. GPU utilization metrics are not natively reported.

4. Algorithmic gpu_ext in Scientific Applications

Algorithmic gpu_ext accelerates domain tasks by restructuring code for fine-grained parallelism and hardware-aware memory use.

  • Parallelization: Blocks of NkN_k k-points are grouped to maximize GPU memory occupancy, processing Hamiltonian applications and projectors across all k simultaneously.
  • Data Locality: DEVICE-allocated arrays and batched kernel launches minimize PCIe transfers.
  • FFT/LAPACK Optimization: Hand-ported DEVICE routines for FFT and diagonalization replace GPU libraries to suit small-batch, many-k workloads.
  • Performance: Yields up to 14× speedup on non-FFT routines over standard GPU libraries, with total wall-time improvements of up to 4× over CPU for metallic systems with many k-points.
  • Monte Carlo Workflows: Particle stacks are managed via device-side buffers, and simulation steps are implemented as parallel kernels using CUDA Thrust for compaction.
  • Randomness and Data Movement: Pre-generated host-side RNG arrays guarantee reproducibility; shared/global memory minimizes device-host transfers.
  • Scaling: For Nf>104N_f > 10^4 electrons, GPU execution enables 60–100× speedup over multi-threaded CPU; overhead is dominated by dynamic stack management and RNG.

5. gpu_ext in General-Purpose and System Call Contexts

GENESYS (Veselý et al., 2017) and related work extend gpu_ext to system-call and OS service invocation:

  • System Call Granularity: Supports invocation at thread (work-item), group, and kernel scope; group-level strikes best throughput/latency balance.
  • Kernel Support and Coalescing: Kernel and driver extensions manage shared syscall areas, deliver and process requests via workqueues, and support interrupt coalescing.
  • Ordering: Blocking/non-blocking and strong/relaxed ordering semantics are configurable, enabling up to 30% better throughput by overlapping computation with system-call handling.
  • Applicability: 79% of Linux syscalls are directly implementable for GPU-originating requests.

6. Performance and Impact Metrics

Across representative gpu_ext implementations:

  • OS Policy/Driver Level: Up to 4.8× throughput and 2× tail latency reductions on real-world inference and training workloads (Zheng et al., 14 Dec 2025).
  • Domain Science Kernels: Lattice QCD and SU(2) heat-bath simulations on GPU yield 200–250× speedups vs single-core CPUs (Cardoso et al., 2010); Monte Carlo integration achieves 40–100× speedup compared to legacy codes (Kanzaki, 2010).
  • Strong Scaling: Blocked-k and many-task algorithms maintain near-linear scaling up to k-point or problem size limits imposed by device memory.
  • Multi-Tenant Isolation: eBPF-based gpu_ext policies allow reduced P99 latency (down 95%) for latency-critical workloads while maintaining best-effort throughput.

7. Limitations and Future Directions

Current gpu_ext systems face several established limitations:

  • Portability: eBPF-based gpu_ext currently targets NVIDIA; AMD/Intel require device-specific backends.
  • Semantics and Expressivity: eBPF subset excludes recursion and global synchronization, constraining some policy schemes (Zheng et al., 14 Dec 2025).
  • Resource Discovery Semantics: Middleware-level gpu_ext propagates opaque resource descriptors, lacking standardized dynamic metrics.
  • Algorithmic GPU Library Support: Some application-level gpu_exts require custom DEVICE implementations of math libraries due to limitations or inefficiencies in standard vendor libraries (Gong et al., 2 Dec 2024).

Prospective directions include device-local RNG schemes, event-driven monitoring hooks, portable DSLs targeting gpu_eBPF runtimes, integration with NVML for richer profiling, and broader backend support for AMD/Intel architectures. Extension points to power management, cache replacement, or persistent monitoring can further generalize the gpu_ext framework.


In summary, gpu_ext defines an extensible set of hardware, OS, middleware, and application-layer techniques and frameworks that systematically expose and augment GPU functionality for programmability, policy enforcement, high-throughput scheduling, and accelerated scientific computation. Its architectures are characterized by safe, verifiable interfaces (e.g., eBPF hooks), cross-layer data models, and application-transparent acceleration, delivering transformative gains in scientific computing, workload management, and multi-tenant platforms without extensive manual intervention or codebase rewrites (Zheng et al., 14 Dec 2025, Isacson et al., 2019, Gong et al., 2 Dec 2024, Cardoso et al., 2010, Veselý et al., 2017, Neep et al., 18 Sep 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to gpu_ext.