DiffTaichi: Differentiable Physics Simulation

Updated 21 November 2025

DiffTaichi is a differentiable programming language designed for high-performance physical simulations with end-to-end gradient computation using a lightweight tape.
It integrates Python AST transformation, JIT compilation, and megakernel fusion for efficient parallel execution on both CPUs and GPUs.
Benchmarks show DiffTaichi outperforms traditional autodiff frameworks in speed and code brevity, making it valuable for research and advanced simulation tasks.

DiffTaichi is a differentiable programming language tailored for constructing high-performance differentiable physical simulators. Embedded as a statically-typed, just-in-time compiled language within Python, DiffTaichi provides end-to-end backpropagation capabilities through source code transformation, preserving arithmetic intensity and parallelism. Its lightweight tape mechanism records and replays simulation program structure efficiently, supporting gradient-based learning and optimization tasks on diverse physics engines. DiffTaichi matches hand-written CUDA performance and surpasses array-based autodiff frameworks such as TensorFlow in execution speed and code conciseness, as demonstrated across ten distinct physical simulation scenarios (Hu et al., 2019).

1. Language Architecture and Compilation Model

DiffTaichi is embedded in Python and utilizes a small Python AST transformer to capture all @ti.kernel, @ti.func, and data structure declarations, lowering these constructs into an intermediate Taichi IR. Programmers write simulation code in a direct, imperative style. This code is compiled down to fused "megakernels" for CPU and GPU, with automatic data layout and parallel-for optimizations. For example, a mass-spring force application kernel can be expressed as:

@ti.kernel
def apply_spring_force(t: ti.i32):
  for i in range(n_springs):
    a, b = spring_a[i], spring_b[i]
    x_a, x_b = x[t−1, a], x[t−1, b]
    dist = x_a − x_b
    length = dist.norm() + 1e−4
    F = (length − rest_length[i]) * k * dist / length
    force[t, a] += −F
    force[t, b] += +F

The compiler preprocesses each kernel to flatten if-statements into select expressions, forward-substitutes mutable locals, and reduces bodies to (hierarchical) SSA form amenable to source-to-source reverse-mode autodiff.

2. Two-Scale Automatic Differentiation and Tape System

DiffTaichi adopts a two-scale automatic differentiation (AD) strategy. Within each kernel, a source-to-source reverse-mode transformation (SCT) yields an adjoint kernel $f^*$ , retaining the large-kernel, high-arithmetic-intensity structure. Across kernels, a minimal execution tape records only each kernel function and its scalar arguments during the forward pass; global buffer fields (e.g., $x$ , $v$ , $force$ ) act as checkpoints, obviating the need to store intermediate tensor values:

class Tape:
  def __init__():
    self.entries = [] # (kernel_fn, args)
  def record(self, kernel_fn, *args):
    self.entries.append((kernel_fn, args))
    kernel_fn(*args)
  def backward(self):
    for (kernel_fn, args) in reversed(self.entries):
      kernel_fn.grad(*args)

Invocation occurs within with ti.Tape(loss):, upon which the tape replays kernel launches in reverse, triggering the adjoint computations.

3. Gradient Accumulation and Backpropagation Semantics

Every @ti.kernel call inside a with ti.Tape() context logs its invocation. Only names and scalar arguments are retained—no high-dimensional field data. When backward() is called, each recorded kernel is replayed via its generated grad function. Both primal and adjoint kernels accumulate into global buffers using atomic add for thread safety. For example, in a mass–spring simulation using semi-implicit Euler, the force update,

$F = k \bigl(\|x_a - x_b\| - l_0\bigr) \frac{x_a - x_b}{\|x_a - x_b\|}\,,$

is differentiated as:

$\frac{\partial F}{\partial x_a} = k\left[\frac{\delta}{\ell} \otimes \frac{\delta}{\ell} + (\ell-\ell_0)\left(I-\frac{\delta \delta^\top}{\ell^2}\right)\right]$

where $\delta = x_a - x_b$ , $\ell = \|\delta\|$ . The adjoint updates propagate via atomic accumulation into $\bar x_a$ , $\bar x_b$ . For collision handling, branch flattening and continuous time-of-impact (TOI) logic enable physically meaningful gradients: naive discrete events give a gradient of $+1$ w.r.t. initial height, while the TOI correction recovers the correct $-1$ .

4. Performance Characteristics and Benchmarks

DiffTaichi permits fusing multiple simulation stages into single megakernels, minimizing memory traffic and maximizing arithmetic intensity. Parallel-for constructs are mapped directly to GPU kernels; global field updates use atomic addition for differentiability. Benchmarks executed on NVIDIA GTX 1080 Ti hardware demonstrate that DiffTaichi can match the runtime of hand-optimized CUDA code, significantly outperforming array-autodiff frameworks in both speed and code brevity:

Approach	Forward (ms)	Backward (ms)	Total (relative)	Lines of Code
TensorFlow	13.20	35.70	48.90 (188×)	190
Hand-CUDA	0.10	0.14	0.24 (0.92×)	460
DiffTaichi	0.11	0.15	0.26 (1.00×)	110

For continuum MPM with 6.4K particles, DiffTaichi is over 180× faster than TensorFlow and requires less than 25% the code of a hand-engineered CUDA implementation. On grid-based smoke simulation (110×110, 100 steps), DiffTaichi matches or exceeds the efficiency of PyTorch, JAX, and Autograd, achieving the shortest runtime and competitive code size.

5. Application Scenarios and Example Problems

DiffTaichi natively supports differentiable simulation and control tasks. Example problems include:

Mass–Spring Rest-Length Optimization: Given a triangle mesh, spring rest lengths $\ell_0$ are optimized such that the final area matches a target value $A^*$ . A differentiable loss kernel computes the squared area error. Optimization over 200 iterations yields convergence of rest-lengths from $[0.1,0.1,0.14]$ to $[0.600,0.600,0.529]$ , with loss driven to zero using direct gradient descent on $\ell_0$ .

for iter in range(200):
  with ti.Tape(loss):
    forward()
    compute_loss(steps-1)
  for i in range(n_springs):
    rest_length[i] -= lr * rest_length.grad[i]

Differentiable Continuum MPM with Neural Controller: A 2D soft robot (∼6K particles) is actuated by a per-particle MLP controller whose parameters are optimized to maximize forward displacement. End-to-end differentiation (forward + backward ∼0.26 ms/step) enables controller convergence in less than 50 gradient iterations, with only ∼110 lines of simulation code.

Across all ten reference simulators (including rigid bodies, fluids, deformables, and rendering scenarios), the same workflow applies.

6. System Innovations and Research Contributions

DiffTaichi introduces several design choices that address the limitations of prior autodiff frameworks for physical simulation:

Two-scale AD fuses complex simulation code into megakernels, maintaining hardware occupancy and high arithmetic intensity.
Imperative indexing and C++/CUDA-inspired control flow are supported natively, enabling straightforward porting of legacy simulation code.
Lightweight tape records only kernel metadata, not arrays, thus drastically reducing memory usage and overhead.
User-defined complex kernels and checkpointing logic are built-in, supporting intricate control flow, custom gradients, and physics-specific routines (e.g., continuous collision detection for physically correct gradients).

Comprehensive benchmarks on rigid bodies, fluids, deformable objects, mass-spring systems, billiards, robotic systems, liquid coupling, physics-based rendering, and electric field simulation demonstrate the broad applicability and performance profile of DiffTaichi (Hu et al., 2019).

PDF Markdown Chat (Pro)

References (1)

DiffTaichi: Differentiable Programming for Physical Simulation (2019)

Follow Topic

Get notified by email when new papers are published related to DiffTaichi System.