KeOps: GPU-Accelerated Kernel Computation
- KeOps is a GPU-accelerated library that efficiently computes large-scale kernel and distance matrices without materializing dense arrays.
- It supports automatic differentiation and a flexible domain-specific language for formulating custom kernels across PyTorch, Matlab, and R.
- KeOps achieves significant performance improvements by compiling tailored CUDA kernels and employing tiled streaming to manage memory effectively.
The KeOps library is a GPU-accelerated framework for the efficient computation of kernel and distance matrices whose entries are defined by mathematical formulas, with support for automatic differentiation and minimal memory footprint. Designed to address the prohibitive memory scaling of conventional tensor-based approaches, KeOps enables large-scale kernel computations (typical in Gaussian processes, kernel density estimation, shape matching, and optimal transport) by avoiding the explicit materialization of dense matrices and streaming operations directly on the GPU (Charlier et al., 2020).
1. Motivation and Core Objectives
Kernel methods frequently require manipulation of pairwise matrices of the form
for and kernel function . When –, the memory required to store or even stream to GPU DRAM constitutes a significant bottleneck.
KeOps targets two fundamental limitations of mainstream tensor libraries:
- Formula-defined tensors: Entries are specified by user-supplied formulas , avoiding allocation of full arrays.
- Automatic differentiation (autodiff): Supports PyTorch-style backpropagation through large reductions without ever allocating arrays.
Design principles include lazy symbolic representation, tiled Map-Reduce/streaming on the GPU (processing blocks that fit in fast memory), and compile-time code generation for formula- and reduction-specific CUDA kernels (Charlier et al., 2020).
2. Mathematical Foundation
KeOps supports any differentiable formula for defining matrix entries, with canonical examples including:
- Distance matrix:
- Gaussian kernel:
- Laplacian kernel:
The library natively supports automatic differentiation through such reductions. For the Gaussian kernel, the partial derivative with respect to is:
For a matrix–vector product , the backward pass with respect to yields:
where . KeOps generates required chain rules at compile time for seamless backpropagation (Charlier et al., 2020).
3. System Architecture and Implementation
KeOps employs a threefold strategy for large-scale kernel computation:
- Lazy Evaluation with Symbolic Formulas: On the Python side,
LazyTensorobjects maintain only formal symbols for tensor indices and a formula DAG for . No computation or memory allocation for the full matrix occurs until an explicit reduction (such as.sum(), matrix product@, or.logsumexp()) is invoked. - Tiled Streaming Avoiding Memory: The reduction is implemented by iterating over tiles , where subarrays of and are loaded to registers/shared memory, the partial result computed for each block, and accumulated without ever allocating the full block.
- Compile-time CUDA Kernel Generation: Upon invocation, KeOps translates the formula DAG into a specialized C++/CUDA template instantiation (e.g.,
Exp<Minus<SqDist<X,Y>>>) and compiles the numeric kernel withnvcc, adapting block size to hardware. Binaries are cached, so subsequent calls with the same signature are instantaneous.
This architecture yields hardware-level efficiency and scalability (Charlier et al., 2020).
4. Language Bindings and User APIs
KeOps offers bindings and APIs for multiple high-level languages:
- Python and PyTorch: The
pykeops.torchmodule enables lazy symbolic tensors with tight integration with PyTorch data structures. Backpropagation through reductions is supported out of the box. - Matlab: The
mexKeOpsinterface provides symbolic reduction capabilities. - GNU R: R users install via CRAN (
install.packages("keops")), exposing an API mirroring NumPy/PyTorch syntax.
The primary user interaction paradigm involves constructing LazyTensor objects representing and , expressing formulas symbolically, then triggering reduction via operators such as @, .sum(), or .logsumexp(). Full documentation and example galleries are available at https://www.kernel-operations.io (Charlier et al., 2020).
5. Performance Analysis
KeOps achieves substantial performance gains compared to standard frameworks, both in compute time and memory overhead. Benchmarks conducted on an NVIDIA RTX 2080 Ti (float32, ) report the following timings for Gaussian matrix–vector product:
| PyTorch | TF-XLA | Halide | TVM | PyKeOps | KeOps | |
|---|---|---|---|---|---|---|
| 9 ms | 13 ms | 1.0 ms | 3.8 ms | 0.7 ms | 0.4 ms | |
| OOM | 89 ms | 34.1 ms | 36.8 ms | 15.0 ms | 14.6 ms | |
| OOM | OOM | 3.8 s | 2.79 s | 1.39 s | 1.38 s |
KeOps maintains memory usage and outperforms general-purpose frameworks by up to two orders of magnitude at large scales () (Charlier et al., 2020).
6. Practical Usage and Customization
A canonical workflow in Python involves declaring lazy tensors, formulating a symbolic kernel (e.g., Gaussian or Laplacian), and triggering reductions, with gradients propagating automatically:
1 2 3 4 5 6 7 8 |
from pykeops.torch import LazyTensor x_i = LazyTensor(x[:, None, :]) y_j = LazyTensor(y[None, :, :]) D_ij = ((x_i - y_j)**2).sum(dim=2) K_ij = (-D_ij / (2 * sigma**2)).exp() a = K_ij @ b loss = a.sum() loss.backward() # computes ∂loss/∂x, ∂loss/∂y, ∂loss/∂b via KeOps |
Customization is achieved by modifying formulas (e.g., Laplacian kernel: K_ij = (-(D_ij.sqrt())/lambda_).exp()), using batch dimensions, or integrating KeOps into torch.nn.Module pipelines.
7. Availability and Resources
KeOps is distributed via multiple standard repositories:
- Python: Install via pip (
pip install pykeops), supporting both NumPy and PyTorch. - R: Available via CRAN (
install.packages("keops")). - Matlab: Access through the
mexKeOpsbinder.
Comprehensive documentation, practical tutorials (e.g., kernel density estimation, Gaussian processes, Sinkhorn divergences), and source code are accessible at https://www.kernel-operations.io and https://github.com/getkeops/keops (Charlier et al., 2020).
KeOps thus provides a transparent, auto-differentiable, GPU-accelerated computational toolset for large-scale quadratic kernel operations on commodity hardware, enabling research and applications previously limited by memory and performance constraints.