Deep Legendre Transform (DLT): Neural Convex Conjugates
- DLT is a neural framework for evaluating convex conjugates by leveraging the analytic relationship between a convex function and its Legendre transform.
- The method trains neural networks to minimize an empirical squared residual, thereby sidestepping computationally expensive supremum optimizations.
- DLT offers scalability and unbiased a posteriori error estimation, outperforming grid-based methods in high-dimensional convex optimization tasks.
The Deep Legendre Transform (DLT) is a neural framework for learning and evaluating convex conjugates (Legendre–Fenchel transforms) of differentiable convex functions in high dimensions. By exploiting an implicit analytic relation between a convex function and its conjugate, DLT circumvents direct supremum optimization or grid-based discretizations, facilitating scalable computation and a posteriori estimation of approximation error. DLT is applicable in convex optimization, variational analysis, Hamilton–Jacobi PDEs, and elsewhere in mathematics, physics, and economics (Minabutdinov et al., 22 Dec 2025).
1. Theoretical Foundations
For a differentiable convex function , the Legendre–Fenchel transform (convex conjugate) is
When is differentiable on an open domain , this relation simplifies on the dual domain :
This identity follows from the Fenchel–Young theorem. Traditional grid-based approaches, even in optimized form (e.g., Lucet’s nested LLT algorithm), are computationally intractable for large , with costs scaling as or for uniform grid sizes per dimension (Minabutdinov et al., 22 Dec 2025). Neural approaches aiming to avoid grid enumeration often require solving max-min problems, which similarly scale poorly as grows.
2. DLT Methodology and Implicit Optimization Objective
DLT leverages the analytical identity above to train a function parameterized by neural networks to approximate . Instead of optimizing over for each , DLT minimizes the empirical squared residual:
where is a training set sampled from , is the weight decay, and gradient computations are performed via autodiff frameworks such as JAX or PyTorch. This loss equates to squared error between and the true on the pushforward distribution . This approach avoids the need to compute explicitly or solve a supremum for each .
Regularization (e.g., weight decay) is applied in the standard way, and optimization uses Adam with a learning rate of and batch size determined empirically.
3. Neural Network Architectures and Training Protocol
DLT can be instantiated using various neural network architectures with specific design considerations:
| Architecture | Key Properties | Convexity Guarantee |
|---|---|---|
| MLP | 2×128 units, GELU | None |
| ResNet | 2 residual blocks (each Dense(128) → GELU → Dense(128)); skip connections | None |
| ICNN | Input-Convex Network; non-negative hidden weights, softplus nonlinearity; direct input skip | Yes |
| MLP_ICNN | MLP variant, non-negative weights, softplus, no skip | Yes |
| KAN | Kolmogorov–Arnold Network; basis expansion; for symbolic regression | No |
For KANs, each network layer sums over one-dimensional basis functions (including polynomials, exponential, logarithmic, and trigonometric forms); post-training, symbolic regression (e.g., sparse linear regression over ) enables closed-form recovery of learned conjugates.
Training follows a standard stochastic minibatch loop. For improved sampling in cases where distorts measure (notably in Neg-Log), DLT augments training with an inverse network , reducing error by an order of magnitude in high dimensions with modest additional computation.
4. Error Estimation and A Posteriori Guarantees
DLT provides unbiased a posteriori error estimation without requiring knowledge of the exact . Given i.i.d. samples from on ,
estimates the error ; the variance decreases as $1/n$, enabling confidence intervals via CLT. No ground-truth conjugate is necessary—an advantage in domains where lacks a closed analytic form.
5. Numerical Benchmarks and Comparative Performance
DLT achieves high-accuracy approximations across a spectrum of convex test functions, including high-dimensional quadratic, negative logarithm, negative entropy, and more complex cases without closed-form conjugates, such as quadratic-over-linear. Benchmark results display RMSE for DLT nearly identical to direct learning (when is available), with comparable training times. For the quadratic-over-linear function, small errors persist through dimensions.
In comparison to Lucet’s grid-based nested LLT, DLT remains computationally tractable as dimension increases, with active memory usage 1 MB and second-scale runtimes, while grid-based methods become infeasible due to exponential cost in time (up to thousands of seconds) and memory (exceeding gigabytes) for . Inverse-sampling for distorted gradients further reduces estimation error.
| Method | (s) | Memory (MB) | RMSE |
|---|---|---|---|
| Lucet () | 1960 | 1530 | 29.3 |
| DLT () | 26.22 | 1.1 | 0.133 |
6. Advanced Variants: Symbolic Regression and Hamilton–Jacobi Extension
Utilizing Kolmogorov–Arnold Networks with symbolic regression, DLT can extract exact closed-form representations of for classes of separable convex functions, recovering, for example, for a quadratic in with residuals below . This regime permits both high-precision approximation and symbolic interpretability.
DLT generalizes to time-dependent convex-conjugate computations, such as those in Hamilton–Jacobi equations. For instance, the Hopf formula solution of a Hamilton–Jacobi PDE is written in terms of conjugates and can be approximated by a time-augmented DLT (). In benchmarks for quadratic initial data/Hamiltonians at , time-parameterized DLT attains $3$– smaller error than the Deep Galerkin Method (DGM) for , despite the latter’s lower PDE residual.
7. Limitations and Domain of Applicability
DLT requires to be convex and differentiable on an open set ; its learned approximation is valid on , which may not coincide with the full effective domain of . The framework’s predictive quality depends critically on neural network expressivity and sufficient computational resources. DLT does not address nonconvex or nondifferentiable targets directly.
DLT is particularly suited for situations prohibitive to grid-based or direct-supremum computation, including high-dimensional convex optimization duality, optimal transport, variational constructs (Moreau envelopes), indirect utility in economics, thermodynamic duality, Hamilton–Jacobi theory, and deep generative modeling (e.g., Wasserstein gradient flows) (Minabutdinov et al., 22 Dec 2025).