Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations (2509.00406v1)

Published 30 Aug 2025 in cs.GR

Abstract: We present a high-performance system for automatic differentiation (AD) of functions defined on triangle meshes that exploits the inherent sparsity and locality of mesh-based energy functions to achieve fast gradient and Hessian computation on the GPU. Our system is designed around per-element forward-mode differentiation, enabling all local computations to remain in GPU registers or shared memory. Unlike reverse-mode approaches that construct and traverse global computation graphs, our method performs differentiation on the fly, minimizing memory traffic and avoiding global synchronization. Our programming model allows users to define local energy terms while the system handles parallel evaluation, derivative computation, and sparse Hessian assembly. We benchmark our system on a range of applications--cloth simulation, surface parameterization, mesh smoothing, and spherical manifold optimization. We achieve a geometric mean speedup of 6.2x over optimized PyTorch implementations for second-order derivatives, and 2.76x speedup for Hessian-vector products. For first-order derivatives, our system is 6.38x, 2.89x, and 1.98x faster than Warp, JAX, and Dr.JIT, respectively, while remaining on par with hand-written derivatives.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a forward-mode AD system that leverages GPU registers and shared memory to exploit mesh sparsity and locality.
  • It uses patch-based partitioning and operator overloading to optimize local derivative computations, achieving notable speedups in gradient and Hessian processing.
  • Benchmark results on simulations like cloth deformation and mesh parameterization demonstrate performance improvements ranging from 2.76× to 7.28× over conventional AD frameworks.

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

Introduction and Motivation

This work introduces a GPU-centric automatic differentiation (AD) system tailored for mesh-based computations, exploiting the inherent locality and sparsity of mesh energy functions. Unlike conventional AD frameworks optimized for dense tensor operations and reverse-mode differentiation, this system leverages forward-mode AD, confining all derivative calculations to GPU registers and shared memory. The approach is motivated by the observation that mesh-based problems—ubiquitous in simulation, geometry processing, and scientific computing—feature shallow, sparsely connected computation graphs, making forward-mode AD more efficient for GPU execution. Figure 1

Figure 1: Fast and flexible AD for mesh-based functions via locality-aware forward-mode AD on the GPU, outperforming state-of-the-art AD frameworks in gradient, Hessian, and Hessian-vector product computations.

AD is essential for optimization-driven tasks in mesh-based applications, but existing frameworks (PyTorch, JAX, Warp, Dr.JIT) are designed for dense, global computation graphs typical of machine learning workloads. These frameworks fail to exploit mesh sparsity, resulting in poor runtime and excessive memory usage. Prior mesh-specific AD tools either run on the CPU (TinyAD), lack higher-order derivatives (Opt, Thallo, Warp), or focus on symbolic simplification (Herholz et al.). The presented system builds on the mesh-centric, element-wise differentiation paradigm but is explicitly designed for GPU execution, optimizing for memory locality and parallel throughput. Figure 2

Figure 2

Figure 2: Dense connectivity in neural networks favors reverse-mode AD (left), while mesh-based problems yield sparse, local computation graphs where forward-mode AD is more efficient (right).

System Architecture and Programming Model

Programming Model

The user defines local energy terms over mesh elements (vertices, edges, faces) using a declarative interface. The system automatically applies these terms in parallel across the mesh, handling derivative tracking, memory management, and parallel execution. Active variables (subject to differentiation) and passive variables (constants) are mixed freely, and the system assembles global objective, gradient, and sparse Hessian structures from local contributions.

Internal Design

  • Patch-Based Partitioning: The mesh is divided into small patches that fit in shared memory, enabling efficient local computation and minimizing global memory accesses.
  • Forward-Mode AD via Operator Overloading: Each CUDA thread block loads patch data once and performs all subsequent computations in registers/shared memory. The custom ActiveT type tracks primal and tangent values for local variables, propagating derivatives via overloaded arithmetic operations.
  • Sparse Hessian Assembly: The system precomputes the sparsity pattern from mesh topology and allocates the global Hessian in CSR format. Local derivatives are accumulated using atomic operations, which are efficient due to low contention in mesh workloads.
  • Matrix-Free Hessian-Vector Products: For optimization routines requiring only Hessian-vector products, the system computes these implicitly, avoiding explicit Hessian construction and reducing memory usage. Figure 3

    Figure 3: Per-element derivative computation and assembly: local gradients and Hessians are computed in registers/shared memory and mapped to global indices for accumulation.

Applications and Performance Evaluation

Cloth Simulation (Mass-Spring Model)

The system is benchmarked on classical mass-spring cloth simulation, where each mesh edge acts as a spring. The energy function combines inertial, elastic, and gravitational terms, optimized via Newton's method. Compared to PyTorch (reverse-mode, dense Hessians) and IndexedSum (sparse Hessian assembly), the presented system achieves 6.2× speedup over IndexedSum and is orders of magnitude faster than PyTorch, which suffers from out-of-memory errors on large meshes.

Mesh Parameterization (Symmetric Dirichlet Energy)

For mesh parameterization, the system minimizes face-based distortion using the symmetric Dirichlet energy. The optimization is performed using Newton's method with matrix-free conjugate gradient, relying on Hessian-vector products. Compared to PyTorch's double backward mechanism, the system achieves a geometric mean 2.76× speedup across mesh resolutions. Figure 4

Figure 4: Symmetric Dirichlet energy minimization for mesh parameterization, showing progressive reduction in distortion and efficient matrix-free optimization.

Manifold Optimization

Spherical parameterization of genus-0 meshes is implemented, combining a barrier function for injectivity and a stretch penalty. Optimization is performed in the tangent space of the sphere using L-BFGS. The system achieves 1.78–1.87× speedup over JAXopt, with differentiation remaining the dominant cost. Figure 5

Figure 5: Spherical parameterization of a genus-0 mesh, showing input, initial mapping, intermediate optimization, and final parameterization with distortion/injectivity visualization.

Laplacian Smoothing

Gradient computation for Laplacian smoothing is benchmarked against PyTorch, JAX, Warp, Dr.JIT, TinyAD, and a manual baseline. The system delivers 7.28× over PyTorch, 6.38× over Warp, 2.89× over JAX, and 1.98× over Dr.JIT, with negligible overhead compared to manual implementation. Atomic scatter for gradient accumulation is validated as efficient due to low write conflicts. Figure 6

Figure 6: Scaling performance of the system versus GPU/CPU frameworks on Laplacian smoothing, demonstrating superior gradient computation throughput.

Implementation Considerations

  • Computational Requirements: The system is designed for NVIDIA GPUs, leveraging CUDA and shared memory. Memory usage is minimized by confining local computations to registers/shared memory and preallocating global structures.
  • Limitations: The current design enforces strict locality (immediate neighbors) to predict Hessian sparsity. Wider support (e.g., kk-ring neighborhoods) and tetrahedral meshes are not yet supported.
  • Deployment: The system is suitable for large-scale mesh-based optimization, simulation, and geometry processing tasks where high throughput and memory efficiency are critical.

Implications and Future Directions

The presented locality-aware AD system demonstrates that mesh-based differentiation is fundamentally local and sparse, enabling substantial performance gains on GPU architectures. The approach shifts the bottleneck from differentiation to linear solvers, opening opportunities for further acceleration in mesh-based optimization pipelines. Future work includes support for tetrahedral meshes, relaxation of locality constraints for wider neighborhood operations, and integration with advanced linear algebra libraries (e.g., cuSPARSE, cuSOLVER).

Conclusion

This work establishes a new paradigm for automatic differentiation in mesh-based computations, aligning algorithmic structure with GPU memory hierarchy and exploiting problem sparsity. The system consistently outperforms state-of-the-art AD frameworks in gradient, Hessian, and Hessian-vector product computations across diverse applications. The locality-aware design is a significant step toward scalable, high-performance geometry processing and simulation on modern parallel hardware.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube