KeOps: GPU-Accelerated Kernel Computation

Updated 10 February 2026

KeOps is a GPU-accelerated library that efficiently computes large-scale kernel and distance matrices without materializing dense arrays.
It supports automatic differentiation and a flexible domain-specific language for formulating custom kernels across PyTorch, Matlab, and R.
KeOps achieves significant performance improvements by compiling tailored CUDA kernels and employing tiled streaming to manage memory effectively.

The KeOps library is a GPU-accelerated framework for the efficient computation of kernel and distance matrices whose entries are defined by mathematical formulas, with support for automatic differentiation and minimal memory footprint. Designed to address the prohibitive memory scaling of conventional tensor-based approaches, KeOps enables large-scale kernel computations (typical in Gaussian processes, kernel density estimation, shape matching, and optimal transport) by avoiding the explicit materialization of dense matrices and streaming operations directly on the GPU (Charlier et al., 2020).

1. Motivation and Core Objectives

Kernel methods frequently require manipulation of pairwise matrices of the form

$K = (K_{ij})_{1 \leq i \leq M,\, 1 \leq j \leq N}, \quad K_{ij} = k(x_i, y_j)$

for $x_i, y_j \in \mathbb{R}^d$ and kernel function $k$ . When $M, N \sim 10^4$ – $10^6$ , the $\mathcal{O}(N^2)$ memory required to store or even stream $K$ to GPU DRAM constitutes a significant bottleneck.

KeOps targets two fundamental limitations of mainstream tensor libraries:

Formula-defined tensors: Entries are specified by user-supplied formulas $F(x_i, y_j)$ , avoiding allocation of full arrays.
Automatic differentiation (autodiff): Supports PyTorch-style backpropagation through large reductions without ever allocating $\mathcal{O}(N^2)$ arrays.

Design principles include lazy symbolic representation, tiled Map-Reduce/streaming on the GPU (processing blocks that fit in fast memory), and compile-time code generation for formula- and reduction-specific CUDA kernels (Charlier et al., 2020).

2. Mathematical Foundation

KeOps supports any differentiable formula $F(x, y)$ for defining matrix entries, with canonical examples including:

Distance matrix: $D_{ij} = \|x_i - y_j\|^2$
Gaussian kernel:

$k_{\rm G}(x, y) = \exp\left(-\tfrac{\|x - y\|^2}{2 \sigma^2}\right)$

Laplacian kernel:

$k_{\rm L}(x, y) = \exp\left(-\tfrac{\|x - y\|}{\lambda}\right)$

The library natively supports automatic differentiation through such reductions. For the Gaussian kernel, the partial derivative with respect to $x$ is:

$\frac{\partial k_{\rm G}}{\partial x} = -\frac{x - y}{\sigma^2} \; k_{\rm G}(x, y)$

For a matrix–vector product $a_i = \sum_j K_{ij} b_j$ , the backward pass with respect to $x_i$ yields:

$\frac{\partial \mathcal{L}}{\partial x_i} = \sum_{j=1}^N \frac{\partial k(x_i, y_j)}{\partial x_i}\left[b_j g_i\right]$

where $g_i = \partial \mathcal{L} / \partial a_i$ . KeOps generates required chain rules at compile time for seamless backpropagation (Charlier et al., 2020).

3. System Architecture and Implementation

KeOps employs a threefold strategy for large-scale kernel computation:

Lazy Evaluation with Symbolic Formulas: On the Python side, LazyTensor objects maintain only formal symbols for tensor indices and a formula DAG for $F$ . No computation or memory allocation for the full matrix occurs until an explicit reduction (such as .sum(), matrix product @, or .logsumexp()) is invoked.
Tiled Streaming Avoiding $\mathcal{O}(N^2)$ Memory: The reduction $a_i = \sum_j F(x_i, y_j)$ is implemented by iterating over tiles $(i_0:i_0+T, j_0:j_0+T)$ , where subarrays of $x$ and $y$ are loaded to registers/shared memory, the partial result computed for each block, and accumulated without ever allocating the full $M \times N$ block.
Compile-time CUDA Kernel Generation: Upon invocation, KeOps translates the formula DAG into a specialized C++/CUDA template instantiation (e.g., Exp<Minus<SqDist<X,Y>>>) and compiles the numeric kernel with nvcc, adapting block size to hardware. Binaries are cached, so subsequent calls with the same signature are instantaneous.

This architecture yields hardware-level efficiency and scalability (Charlier et al., 2020).

4. Language Bindings and User APIs

KeOps offers bindings and APIs for multiple high-level languages:

Python and PyTorch: The pykeops.torch module enables lazy symbolic tensors with tight integration with PyTorch data structures. Backpropagation through reductions is supported out of the box.
Matlab: The mexKeOps interface provides symbolic reduction capabilities.
GNU R: R users install via CRAN (install.packages("keops")), exposing an API mirroring NumPy/PyTorch syntax.

The primary user interaction paradigm involves constructing LazyTensor objects representing $x_i$ and $y_j$ , expressing formulas symbolically, then triggering reduction via operators such as @, .sum(), or .logsumexp(). Full documentation and example galleries are available at https://www.kernel-operations.io (Charlier et al., 2020).

5. Performance Analysis

KeOps achieves substantial performance gains compared to standard frameworks, both in compute time and memory overhead. Benchmarks conducted on an NVIDIA RTX 2080 Ti (float32, $d=3$ ) report the following timings for Gaussian matrix–vector product:

$N$	PyTorch	TF-XLA	Halide	TVM	PyKeOps	KeOps $^{++}$
$10\,\mathrm{k}$	9 ms	13 ms	1.0 ms	3.8 ms	0.7 ms	0.4 ms
$100\,\mathrm{k}$	OOM	89 ms	34.1 ms	36.8 ms	15.0 ms	14.6 ms
$1\,\mathrm{M}$	OOM	OOM	3.8 s	2.79 s	1.39 s	1.38 s

KeOps maintains $O(N)$ memory usage and outperforms general-purpose frameworks by up to two orders of magnitude at large scales ( $N \gtrsim 10^4$ ) (Charlier et al., 2020).

6. Practical Usage and Customization

A canonical workflow in Python involves declaring lazy tensors, formulating a symbolic kernel (e.g., Gaussian or Laplacian), and triggering reductions, with gradients propagating automatically:

from pykeops.torch import LazyTensor
x_i = LazyTensor(x[:, None, :])   
y_j = LazyTensor(y[None, :, :])   
D_ij = ((x_i - y_j)**2).sum(dim=2)
K_ij = (-D_ij / (2 * sigma**2)).exp()
a = K_ij @ b
loss = a.sum()
loss.backward()  # computes ∂loss/∂x, ∂loss/∂y, ∂loss/∂b via KeOps

Customization is achieved by modifying formulas (e.g., Laplacian kernel: K_ij = (-(D_ij.sqrt())/lambda_).exp()), using batch dimensions, or integrating KeOps into torch.nn.Module pipelines.

7. Availability and Resources

KeOps is distributed via multiple standard repositories:

Python: Install via pip (pip install pykeops), supporting both NumPy and PyTorch.
R: Available via CRAN (install.packages("keops")).
Matlab: Access through the mexKeOps binder.

Comprehensive documentation, practical tutorials (e.g., kernel density estimation, Gaussian processes, Sinkhorn divergences), and source code are accessible at https://www.kernel-operations.io and https://github.com/getkeops/keops (Charlier et al., 2020).

KeOps thus provides a transparent, auto-differentiable, GPU-accelerated computational toolset for large-scale quadratic kernel operations on commodity hardware, enabling research and applications previously limited by memory and performance constraints.

Markdown Report Issue Upgrade to Chat

References (1)

Kernel Operations on the GPU, with Autodiff, without Memory Overflows (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KeOps Library.

KeOps: GPU-Accelerated Kernel Computation

1. Motivation and Core Objectives

2. Mathematical Foundation

3. System Architecture and Implementation

4. Language Bindings and User APIs

5. Performance Analysis

6. Practical Usage and Customization

7. Availability and Resources

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KeOps: GPU-Accelerated Kernel Computation

1. Motivation and Core Objectives

2. Mathematical Foundation

3. System Architecture and Implementation

4. Language Bindings and User APIs

5. Performance Analysis

6. Practical Usage and Customization

7. Availability and Resources

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research