Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPIR: Enabling Practical Private Information Retrieval with GPUs

Published 6 Apr 2026 in cs.CR and cs.AR | (2604.04696v1)

Abstract: Private information retrieval (PIR) allows private database queries but is hindered by intense server-side computation and memory traffic. Modern lattice-based PIR protocols typically involve three phases: ExpandQuery (expanding a query into encrypted indices), RowSel (encrypted row selection), and ColTor (recursive "column tournament" for final selection). ExpandQuery and ColTor primarily perform number-theoretic transforms (NTTs), whereas RowSel reduces to large-scale independent matrix-matrix multiplications (GEMMs). GPUs are theoretically ideal for these tasks, provided multi-client batching is used to achieve high throughput. However, batching fundamentally reshapes performance bottlenecks; while it amortizes database access costs, it expands working sets beyond the L2 cache capacity, causing divergent memory behaviors and excessive DRAM traffic. We present GPIR, a GPU-accelerated PIR system that rethinks kernel design, data layout, and execution scheduling. We introduce a stage-aware hybrid execution model that dynamically switches between operation-level kernels, which execute each primitive operation separately, and stage-level kernels, which fuse all operations within a protocol stage into a single kernel to maximize on-chip data reuse. For RowSel, we identify a performance gap caused by a structural mismatch between NTT-driven data layouts and tiled GEMM access patterns, which is exacerbated by multi-client batching. We resolve this through a transposed-layout GEMM design and fine-grained pipelining. Finally, we extend GPIR to multi-GPU systems, scaling both query throughput and database capacity with negligible communication overhead. GPIR achieves up to 305.7x higher throughput than PIRonGPU, the state-of-the-art GPU implementation.

Summary

  • The paper introduces a stage-aware hybrid GPU execution model that fuses cryptographic operations to reduce DRAM traffic and latency in privacy-preserving queries.
  • It optimizes data layouts for NTT and GEMM computations, achieving up to 305.7× higher throughput and significantly reducing memory overhead.
  • Multi-GPU orchestration alongside extensive pipelining techniques enables scalability and sub-millisecond latencies on GB-scale databases, setting a new performance standard.

GPIR: Enabling Practical Private Information Retrieval with GPUs

Background and Motivation

Lattice-based Private Information Retrieval (PIR) protocols enable privacy-preserving queries on public databases without revealing access patterns; however, such cryptographic primitives introduce severe server-side computational and memory bottlenecks due to heavy use of homomorphic encryption (HE) and associated polynomial arithmetic over large moduli. The OnionPIR family of schemes exemplifies state-of-the-art single-server PIR with minimal communication cost but high server workload. Modern GPUs offer massive parallelism and memory bandwidth, making them attractive for accelerating these cryptographic workloads, especially when exploiting multi-client batching to amortize costs. Despite this theoretical hardware fit, naive GPU implementations exhibit severe cache and bandwidth bottlenecks due to phase-specific working set and data layout challenges, notably the cache-capacity wall in batched workflows and a structural conflict between NTT-optimized and GEMM-optimized data layouts.

Computation Pipeline and Systemic Bottlenecks

GPIR targets OnionPIRv2, which organizes PIR queries into three computationally distinct phases: ExpandQuery, RowSel, and ColTor. Figure 1

Figure 1: The end-to-end computation flow for OnionPIRv2, highlighting the transformation and selection phases required per encrypted query.

ExpandQuery recursively expands a compact encrypted query into a large set of HE ciphertexts (ct) via binary-tree homomorphic substitutions mainly involving digit decomposition, inverse/forward NTT, and polynomial arithmetic. RowSel executes a large batch of encrypted row selections as independent GEMMs using the NTT-domain, RNS-packed ct outputs, but is encumbered by the physical layout imposed by prior NTT steps. ColTor completes the PIR with a tournament-like column selection, again invoking external product operations (with digit decomposition) to reduce the ct outputs to the correct index.

These phases impose heterogeneous data partitioning requirements (Figure 2): NTTs need limb-major layouts (in RNS parlance, "rows" of the $4N$ prime-packed polynomial coefficients), GEMMs are optimal with kk-major or m/nm/n-major layouts, while digit decomposition aligns coefficient-wise. Transitioning between these layouts induces costly data movement and frequent cache-line invalidations, especially when working sets per batch and per protocol stage breach the 96 MB L2 cache, severely amplifying off-chip DRAM transactions. Figure 2

Figure 2: Comparison of partitioning strategies for core PIR operations, illustrating the mismatch between ideal layouts for NTT, coefficient operations, and GEMMs.

A detailed roofline model (Figure 3) exposes these architectural consequences: under batching, RowSel shifts from a memory-bound regime (dominated by D0×D1D_0 \times D_1 DB access) to being compute-bound but suboptimal—the current layout reduces throughput to a fraction of peak IMAD. Conversely, ExpandQuery and ColTor move deeper into DRAM-bottlenecked regimes under larger batches due to explosive transient working sets in digit decompositions and inadequate data reuse. Figure 3

Figure 3: Roofline analysis of ExpandQuery, RowSel, and ColTor on RTX 5090, showing compute/memory throughput and bottleneck shifts with working set size.

Figure 4

Figure 4: Working set growth per operation across expanding batch sizes, illustrating the cache pressure induced during ExpandQuery and ColTor.

Stage-Aware Hybrid Execution

The paper proposes a stage-aware hybrid execution paradigm to dynamically select kernel granularity for ExpandQuery and ColTor phases. When the working set is L2-resident, fine-grained (operation-level) kernels are used to maximize occupancy and parallel resource utilization. When working sets exceed the cache, GPIR fuses all stage-specific operations into monolithic (stage-level) kernels per ct/tree-node, physically retaining all transient intermediates in SM-local registers/shared memory to minimize DRAM churn. Multi-client batching ensures enough parallel ciphertexts are available to saturate hardware with these coarser kernels while confining working set spikes and significantly reducing DRAM traffic and latency. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: DRAM transactions and execution time for operation-level vs. stage-level kernels, showing stage-dependent traffic and occupancy trade-offs.

The transition point between kernel types is statically defined based on batch size and protocol tree depth, ensuring that L2 bandwidth is maximized before DRAM pressure dominates (as empirically validated in Figure 6). Figure 6

Figure 6: Bandwidth utilization under occupancy scaling, contrasting L2 scaling needs versus DRAM saturation endpoints.

Data Layout–Aware RowSel Optimizations

GPIR identifies that the conventional NTT-centric, pp-major memory layout fundamentally impedes efficient, large-tile GEMM computation required for batched RowSel. The system introduces an explicit transposition step, reorienting ct and sothatlargetilescanbecomputedwithhighresourceefficiencyusingstandardso that large tiles can be computed with high resource efficiency using standardk//m//nmajorlayouts,directlyenablingincreasedtiledimensions,reducedglobaltransactions,andhigherSMoccupancy.Forinstance,ina2GBDBwithbatch32onRTX5090,optimizedtilingyields1.76×highercomputethroughputandincreasedoccupancy.<imgsrc="https://emergentmindstoragecdnc7atfsgud9cecchk.z01.azurefd.net/paperimages/260404696/datalayoutdata.png"alt="Figure7"title=""class="markdownimage"loading="lazy"></p><p><imgsrc="https://emergentmindstoragecdnc7atfsgud9cecchk.z01.azurefd.net/paperimages/260404696/datalayoutbaseline.png"alt="Figure7"title=""class="markdownimage"loading="lazy"></p><p><imgsrc="https://emergentmindstoragecdnc7atfsgud9cecchk.z01.azurefd.net/paperimages/260404696/datalayouttranspose.png"alt="Figure7"title=""class="markdownimage"loading="lazy"><pclass="figurecaption">Figure7:SchematicofrequiredRowSelGEMMs(a),andcomparisonofbaseline(pmajor,b)andtransposed(kmajor,c)tilingapproaches.</p></p><p>However,repeatedglobalmemorytranspositionsbefore/aftereachbatchedRowSelkernelcoulddominateoveralllatency.TheauthorsresolvethisbypipeliningtransposeandGEMMkernelsalongthe-major layouts, directly enabling increased tile dimensions, reduced global transactions, and higher SM occupancy. For instance, in a 2GB DB with batch 32 on RTX 5090, optimized tiling yields 1.76× higher compute throughput and increased occupancy. <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2604-04696/data-layout-data.png" alt="Figure 7" title="" class="markdown-image" loading="lazy"></p> <p><img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2604-04696/data-layout-baseline.png" alt="Figure 7" title="" class="markdown-image" loading="lazy"></p> <p><img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2604-04696/data-layout-transpose.png" alt="Figure 7" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 7: Schematic of required RowSel GEMMs (a), and comparison of baseline (p-major, b) and transposed (k-major, c) tiling approaches.</p></p> <p>However, repeated global memory transpositions before/after each batched RowSel kernel could dominate overall latency. The authors resolve this by pipelining transpose and GEMM kernels along the pdimension:RNSprimeparallelismisexploitedvia<ahref="https://www.emergentmind.com/topics/kernelbenchcuda"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">CUDA</a>streamswithchunkingoverthe dimension: RNS-prime parallelism is exploited via <a href="https://www.emergentmind.com/topics/kernelbench-cuda" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">CUDA</a> streams with chunking over the Npolynomialcoefficientaxis,furtheroverlappedandefficientlyscheduledusingCUDAGraphs.Thismultilevelpipelininghidesupto6.7<imgsrc="https://emergentmindstoragecdnc7atfsgud9cecchk.z01.azurefd.net/paperimages/260404696/pipelining.png"alt="Figure8"title=""class="markdownimage"loading="lazy"><pclass="figurecaption">Figure8:PipelinedexecutiondesignforRowSel,usingprimelevelandNchunkedpartitioningtooverlaptranspositionsandcomputations.</p></p><h2class=paperheadingid=multigpuorchestration>MultiGPUOrchestration</h2><p>GPIRsupportsboththroughputandscaleoutviamultiGPUsystems.Threeexecutionparadigmsareanalyzed:purebatchparallelism(DBreplication,nearlinearQPSscaling), polynomial coefficient axis, further overlapped and efficiently scheduled using CUDA Graphs. This multi-level pipelining hides up to 6.7% of total runtime that would otherwise be lost to data copies. <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2604-04696/pipelining.png" alt="Figure 8" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 8: Pipelined execution design for RowSel, using prime-level and N-chunked partitioning to overlap transpositions and computations.</p></p> <h2 class='paper-heading' id='multi-gpu-orchestration'>Multi-GPU Orchestration</h2> <p>GPIR supports both throughput- and scale-out via multi-GPU systems. Three execution paradigms are analyzed: pure batch parallelism (DB replication, near-linear QPS scaling), sharding (partitioning DB across devices, with partial result aggregation for scale-up), and all-gather for batched ExpandQuery (distributing ExpandQuery load and aggregating expanded ct using high-bandwidth interconnects). The protocols are carefully profiled for inter-GPU communication: RowSel and ColTor aggregation costs are negligible, but ExpandQuery all-gather requires NVLink to prevent throughput loss at high batch sizes. Figure 9

Figure 9: Multi-GPU PIR execution with DB sharding and all-gather patterns, demarcating communication points.

Quantitative Evaluation

A comprehensive ablation study is presented (Figure 10), evaluating GPIR on an RTX 5090 at multiple DB and batch sizes, isolating the effects of hybrid execution, transposed RowSel, kk0-dimension pipelining, and multi-GPU scaling. The system delivers up to 305.7× higher throughput than PIRonGPU and 1.96–2.29× end-to-end speedup over a tightly optimized modern baseline. Notably, these gains persist as DB size increases, demonstrating scalability not observed in prior PIR GPU implementations, which sharply degrade above 2GB. DRAM savings of up to 1.83× (ExpandQuery) and 1.52× (ColTor) over prior work are shown (Figure 11). Figure 10

Figure 10: Execution time per batch after incremental application of each optimization technique.

Figure 11

Figure 11: DRAM traffic reduction for ExpandQuery and ColTor under batching with hybrid kernels, normalized to baseline kernels.

Multi-GPU scaling experiments (Figure 12) indicate 1.73×–1.94× throughput scaling with two GPUs, with NVLink essentially eliminating communication as a bottleneck. Figure 12

Figure 12: Throughput and scalability under multi-GPU sharding and all-gather strategies using PCIe or NVLink.

Implications and Future Trajectories

GPIR decisively demonstrates that architectural software/hardware co-design—including phase- and data layout–adaptive execution and fine-grained pipelines—is essential in realizing the full potential of high-throughput, single-server PIR. The results suggest that properly engineered software can render even GB-scale PIR practical on contemporary GPU platforms, without custom accelerators. Multi-GPU approaches support both larger databases and greater batch-QPS, suggesting deployment feasibility for use-cases such as private search, DNS, and blockchain data access.

On the theoretical front, the work reveals that phase-wise analytic modeling is fundamental in mapping cryptographic protocol complexity to parallel hardware. Practically, the outlined optimizations set a new performance reference for single-server PIR, potentially informing GPU-aware cryptographic library and protocol design. This carries cross-domain implications for large-scale privacy-preserving analytics, ML inference, and federated data property testing, where oblivious access patterns are required.

Given contemporary interest in HE, FHE, and privacy infrastructure, the methods in GPIR are immediately relevant for constructing secure, high-throughput cloud services. Companion future work could explore asynchronous protocol variations that interleave protocol phases (allowing for further hardware pipelining), cross-node load balancing, and architectural extensions to further reduce data layout friction at the hardware level.

Conclusion

GPIR provides a blueprint for practical, high-performance PIR on GPUs by resolving bottlenecks at the intersection of homomorphic cryptography, memory hierarchy, and parallel hardware. The demonstrated architectural awareness—hybrid kernel selection, layout-conscious execution, and orchestration for multi-GPU scaleout—enables GB-scale PIR queries at sub-millisecond latencies, setting a standard for software-only solutions in privacy-preserving database services (2604.04696).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.