SOLANET: Distributed Neighbor Graph Construction on GPU-Accelerated Systems

Published 26 May 2026 in cs.DC | (2605.27691v1)

Abstract: Neighbor graphs capture relationships among data points and are widely used in data analytics and AI workloads. Many studies have explored approximate construction methods for single-node systems, including GPUs. However, extending this to distributed systems for larger data and further acceleration remains challenging due to irregular computation patterns. We present SOLANET, a GPU-accelerated distributed neighbor graph construction toolkit. SOLANET first constructs local graphs on each GPU after data partitioning and then refines them via approximate nearest neighbor (ANN) searches over remote graphs pulled from other GPUs using MPI one-sided operations. SOLANET also provides a lock-free single-GPU neighbor graph construction algorithm for AMD GPUs. Our single-GPU implementation outperforms a state-of-the-art GPU-based approximate neighbor graph construction implementation across multiple datasets on a single MI300A APU. Furthermore, SOLANET demonstrates 11X speedup from 32 to 512 APUs for 1 billion data points and 6.9x speedup from 64 to 512 APUs for 2 billion points.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces SOLANET, a novel framework that scales GPU-accelerated kNN graph construction to billion-point datasets using distributed and lock-free approaches.
The methodology partitions data across GPUs, employs a lock-free NN-Descent algorithm, and uses binary-tree-structured refinement to optimize communication and computation.
Experimental results show significant speedups (up to 11.7x) and high graph quality (recall@32 above 99%), highlighting SOLANET's potential for large-scale AI and data analytics.

SOLANET: Scalable Distributed GPU-Accelerated Neighbor Graph Construction

Introduction and Motivation

Approximate $k$ -nearest neighbor graphs ( $k$ NNGs) are foundational in high-dimensional data analysis, vector databases, clustering, and modern AI workloads including LLM retrieval-augmentation and large-scale recommendation systems. As vector and embedding datasets grow to billions of items and high dimensions, the shortcomings of single-node and single-GPU solutions become significant. The construction of $k$ NNG at this scale presents unique challenges: super-linear time complexity, irregular computation and communication, massive memory footprints, and the necessity for hardware and communication-aware distributed implementations.

SOLANET introduces a distributed, GPU-accelerated neighbor graph construction toolkit that specifically targets these issues, designed and evaluated on contemporary heterogeneous high-performance computing systems with AMD MI300A APUs. Core design elements include partitioned local graph induction, distributed cross-partition refinement via high-throughput approximate nearest neighbor (ANN) search, and lock-free local graph update algorithms. The framework leverages advanced communication primitives (MPI one-sided operations) and is decoupled from local ANN backend implementations.

Methodological Contributions

Distributed $k$ NNG Construction Framework

SOLANET decomposes neighbor graph construction for large-scale data into two main phases:

Local Graph Construction: The input dataset $D$ is partitioned into $P$ subsets, each processed on a separate GPU/MPI rank. Each partition forms an initial local $k$ NNG using a GPU-optimized NN-Descent algorithm.
Cross-Partition Refinement: To recover neighbor edges across partitions, each rank fetches remote graphs and datasets using high-bandwidth, low-latency MPI one-sided gets, executes batched graph-based ANN search, and updates its local $k$ NNG with new candidates.
Figure 1: High-level depiction of local and distributed stages in nearest neighbor graph construction.

This design ensures strong arithmetic intensity, regular and coarse-grained communication, and natural support for asynchronous overlap of communication and computation. By abstracting the local graph construction and search algorithms, SOLANET can adapt to improvements in single-GPU ANN methods.

Lock-Free GPU NN-Descent

The local NN-Descent implementation for AMD GPUs eschews the traditional global memory locks for candidate list updates. Instead, it employs atomic append operations to per-point candidate buffers, followed by thread-serialized graph updates—maximizing parallel efficiency and mitigating serialization bottlenecks known to afflict prior lock-based GPU implementations.

To achieve scalability beyond all-to-all refinement—where communication and search costs saturate early—SOLANET introduces a hierarchical, binary-tree-structured merge pattern for subgraph refinement. At each level, datasets and graphs from groups of ranks are merged, and cross-group ANN search is executed, reducing both the number of refinement steps and data movement volume per refinement phase. This approach underpins the high scalability demonstrated, particularly for billion-scale graphs.

Figure 2: Execution pipeline of ANN search-based distributed kNNG refinement with hierarchical (tree-based) merging.

Experimental Evaluation

Evaluation is conducted on LLNL's Tuolumne cluster, incorporating up to 512 AMD MI300A APUs. SOLANET is tested across diverse datasets, from Fashion-MNIST and GIST to DEEP-1B, SIFT-1B, and synthetic DEEP-2B, up to 2 billion points.

Figure 3: Visualization and characteristics of the Fashion-MNIST dataset as used in evaluation.

Figure 4: Schematic of DEEP-100M and SIFT-100M billion-scale datasets.

Key findings include:

The lock-free single-GPU implementation consistently outperforms hipVS (itself state-of-the-art for AMD) in both construction time and recall across nearly all datasets, except NYTimes.
The distributed engine exhibits near-linear strong scaling up to the point where per-partition size falls below computational saturation or communication overheads dominate. For DEEP-1B and SIFT-1B, SOLANET achieves 11x–11.7x speedup scaling from 32 to 512 APUs, and for DEEP-2B, 6.9x from 64 to 512 APUs.
Compared to NEO-DNND, a distributed CPU-based NN-Descent, SOLANET provides an 8.3x runtime reduction at competitive recall for billion-scale inputs.
Graph quality remains high, with recall@32 above 99% for $L_2$ benchmarks. NEO-DNND graphs used as a reference yield 70–75% recall when compared against SOLANET’s output, whereas SOLANET reaches 99% compared to NEO-DNND, demonstrating both accelerated runtime and improved neighbor accuracy.

Runtime and Scalability Analysis

The authors provide formal runtime complexity estimates, factoring in both computation (ANN search and local $k$ NNG construction) and communication (latency and bandwidth), exploiting empirical findings that ANN search time at fixed $k$ 0 is largely insensitive to the number of source points due to the search radius's boundedness in graph traversal and optimization. The hierarchical merge pattern and group-wise flat refinement minimize redundant data pulls and maximize communication efficiency, critical for practical scaling on modern supercomputers.

Figure 5: Absolute execution time breakdown across distributed phases for the DEEP-1B dataset.

Implications and Future Directions

SOLANET advances neighbor graph construction methodology by introducing GPU-accelerated, distributed design tailored to the demands and properties of billion-point, high-dimensional workloads. From a practical standpoint, this enables real-time or near-real-time construction of high-quality neighbor graphs for retrieval, clustering, and embedding similarity computation at a scale previously not accessible without custom or disk-based solutions.

Theoretically, the architecture decouples distributed strategies from the specifics of ANN backends, allowing transparent exploitation of future improvements in GPU graph search and construction. The lock-free approach and communication-optimized refinement outline a general paradigm for future large-scale distributed similarity indexing beyond ANN, including non-Euclidean or nonmetric similarities.

Future research opportunities include integration with additional ANN algorithms, adaptation to NVIDIA accelerators, and intelligent partitioning schemes exploiting domain semantics to further reduce cross-partition refinement.

Conclusion

SOLANET establishes new standards for scalable, distributed $k$ 1NNG construction on GPU-accelerated platforms. Through architectural innovations in communication, lock-free concurrency, and graph refinement, it demonstrates both superior empirical scaling and graph quality on billion-scale vector workloads. This toolkit provides a high-performance foundation for a host of contemporary AI and data-intensive applications, directly supporting the growth and increasing complexity of scientific and industrial vector analytics.