FaSTED: Fast Tensor Core Euclidean Distance
- The paper introduces FaSTED, a GPU-based algorithm that uses tensor core MMA operations for efficient mixed precision Euclidean distance computation.
- It employs multi-level hierarchical tiling and memory reuse strategies to optimize throughput and reduce latency on modern GPUs.
- Performance evaluations show up to 51× speedups over FP32/FP64 methods with negligible accuracy loss (<0.06%).
The Fast and Scalable Tensor core Euclidean Distance (FaSTED) algorithm is a high-throughput GPU method for computing Euclidean distances using GPU tensor cores, specifically optimized for mixed precision arithmetic. FaSTED exploits large-scale data reuse, aggressive pipelining of memory transfers, detailed register/shared memory coordination, and the native matrix multiply–accumulate (MMA) capabilities of tensor cores to achieve significant speedup over FP32 or FP64 CUDA core algorithms and prior tensor-core methods. It has been primarily designed for similarity search and nearest neighbor operations in high-dimensional data analytics workloads.
1. Algorithmic Architecture and Tiling Strategies
FaSTED is architected around multi-level hierarchical tiling to maximize memory reuse and meet the throughput requirements of FP16-32 tensor cores on modern GPUs (e.g., NVIDIA A100). At the top level, the Euclidean distance matrix for points in dimensions is decomposed into block tiles (e.g., ). Each tile is processed by a group of warps mapped onto a streaming multiprocessor (SM).
Within each tile:
- Data is asynchronously loaded from global memory to shared memory using
cuda::memcpy_async, arranged in a two-stage pipeline. This bypasses the L1 cache and reduces latency, while overlapping memory transfer and computation. - Shared memory is further partitioned into warp tiles (e.g., ). Each warp tile is configured so data fragments (sub-matrices) are loaded into registers with bank-conflict–free addressing, achieved by address swizzling (XOR on bank indices), compatible with efficient ldmatrix instructions for MMA.
- Within warp tiles, fragments (typically size 16×16 for FP16 input) are reused up to 98 times from global memory and 35 times from shared memory to saturate the available tensor core throughput (estimated in Box #1: reuse for global; similar calculations for shared memory).
This hierarchical and swizzled tiling is implemented using explicit PTX mma instructions, and fragments are iteratively multiplied and accumulated across all dimensions in FP16-32 MMA operations. The algorithm slices the distance calculation into segments (each MMA covering 16 dimensions), accumulating partial results in FP32 registers.
2. Mixed Precision Euclidean Distance Computation via Tensor Cores
FaSTED performs computations in FP16-32 mode: each operand is FP16 (for matrix multiplication), and accumulation is FP32. For two points , their squared Euclidean distance is expanded as follows: FaSTED computes partial inner products (last term) in FP16-32 MMA fragments, with precomputed norm sums and in FP32. Final squared distances are accumulated as: where is the MMA-accumulated inner product and are precomputed sums.
This structure is highly favorable for tensor core hardware, as the MMA step matches the primitive operation optimized at the circuit level.
3. Performance Outcomes and Throughput Characteristics
On four real-world high-dimensional datasets (128–960 dimensions, up to points), FaSTED’s brute force approach yields speedups of – over state-of-the-art FP64 tensor core (TED-Join) and FP32 CUDA (GDS-Join, MiSTIC) algorithms. Key factors include:
- Full brute force evaluation ( distances) with deterministic and regular memory access patterns, which are highly suited for tensor core parallelization.
- Asynchronous memory transfers (global shared) and explicit scheduling of MMA operations minimize both data starvation and register pressure.
- Highest throughput (up to at peak for FP16-32 on SXM A100) is achieved if the algorithm reorganizes and reuses all data fragments many times, which is facilitated by the multi-level tiling and swizzling pipeline.
Even without the aid of an index (which would prune unnecessary comparisons), FaSTED outperforms indexing-based methods at high selectivity regimes due to the superior raw throughput of the tensor core path.
4. Accuracy Guarantees and Precision Loss Analysis
FaSTED quantifies its mixed precision accuracy loss as compared to FP64 baselines. Two metrics are employed:
- Overlap of nearest neighbor sets: Measured accuracy ranges from to for selectivity levels tested.
- Error in computed distances: Mean absolute errors are on the order of , with narrow, centered error distributions. This is illustrated in Figure 1 of (Curless et al., 28 Aug 2025).
The negligible precision loss arises only when input data are within the FP16 dynamic range. For typical data analytics tasks and similarity searches, this loss is considered acceptable, but for scientific applications requiring strict numerical guarantees, iterative refinement or other techniques might be required at the cost of throughput.
5. Integration with Indexing Structures and Methodological Context
While the brute force FaSTED implementation performs all comparisons, it can be integrated with external or internal indexes. For example, when used as a subroutine within a grid or tree index:
- Indexes can eliminate pairs guaranteed to be farther than the similarity threshold ().
- FaSTED’s regular data access pattern avoids the branching and load imbalance of traditional CUDA core indexing methods.
- The algorithm’s deterministic tiling is robust to thread divergence, which is a common problem in index-augmented kernels.
This suggests further synergy with indexing for applications where the average number of neighbors is small and brute force would be wasteful.
6. Applicability, Benchmark Datasets, and Dimensional Scalability
FaSTED finds application in similarity search, k-nearest neighbors, clustering, image retrieval, outlier detection, and unsupervised representation learning. Benchmark datasets employed include:
- Sift10M: points, $128$ dimensions
- Tiny5M: points, $384$ dimensions
- Cifar60K: points, $512$ dimensions
- Gist1M: points, $960$ dimensions
Performance scales favorably for and saturates tensor core capacity for ; lower dimensions may require zero-padding for full fragment occupancy, which is a minor inefficiency.
7. Limitations and Prospective Improvements
FaSTED’s performance depends on several factors:
- Quadratic Complexity: In pure brute force mode, it computes distances; indexing integration mitigates this for sparse neighbor regimes.
- Precision Sensitivity: Datasets outside FP16 dynamic range may experience higher rounding errors or require input renormalization.
- Platform Constraints: Power-throttling on PCIe A100 GPUs can reduce achievable throughput compared to SXM A100; hardware-aware adaptation can improve results.
- Padding Overhead: For not divisible by 16 or 64, zero-padding incurs slight overhead.
Future directions proposed include direct incorporation of indexing structures within FaSTED, normalization/scaling for FP16-sensitive datasets, and transferring the architecture to updated hardware or custom ASICs for even higher throughput.
FaSTED exemplifies a shift in GPU algorithm design, moving from generic vectorized summations toward fully staged, fragment-based MMA computation, leveraging the unique properties of tensor cores for data analytics workloads. Its methodology is highly relevant for high-dimensional, large-scale settings requiring extreme throughput and where minor loss of precision is tolerable. The algorithm’s acceleration and engineering practices set a new standard for Euclidean distance calculations on modern hardware accelerators (Curless et al., 28 Aug 2025).