Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Legendre Transform (DLT): Neural Convex Conjugates

Updated 13 March 2026
  • DLT is a neural framework for evaluating convex conjugates by leveraging the analytic relationship between a convex function and its Legendre transform.
  • The method trains neural networks to minimize an empirical squared residual, thereby sidestepping computationally expensive supremum optimizations.
  • DLT offers scalability and unbiased a posteriori error estimation, outperforming grid-based methods in high-dimensional convex optimization tasks.

The Deep Legendre Transform (DLT) is a neural framework for learning and evaluating convex conjugates (Legendre–Fenchel transforms) of differentiable convex functions in high dimensions. By exploiting an implicit analytic relation between a convex function and its conjugate, DLT circumvents direct supremum optimization or grid-based discretizations, facilitating scalable computation and a posteriori estimation of approximation error. DLT is applicable in convex optimization, variational analysis, Hamilton–Jacobi PDEs, and elsewhere in mathematics, physics, and economics (Minabutdinov et al., 22 Dec 2025).

1. Theoretical Foundations

For a differentiable convex function f:RnRf:\mathbb{R}^n\to\mathbb{R}, the Legendre–Fenchel transform (convex conjugate) is

f(y)=supxRn{x,yf(x)}.f^*(y)=\sup_{x\in\mathbb{R}^n}\{\langle x,y\rangle - f(x)\}.

When ff is differentiable on an open domain CRnC\subseteq\mathbb{R}^n, this relation simplifies on the dual domain D=f(C)D=\nabla f(C):

f(f(x))=x,f(x)f(x),xC.f^*(\nabla f(x)) = \langle x,\nabla f(x)\rangle - f(x), \quad x\in C.

This identity follows from the Fenchel–Young theorem. Traditional grid-based approaches, even in optimized form (e.g., Lucet’s nested LLT algorithm), are computationally intractable for large dd, with costs scaling as O(N2d)O(N^{2d}) or O(dNd+1)O(dN^{d+1}) for uniform grid sizes NN per dimension (Minabutdinov et al., 22 Dec 2025). Neural approaches aiming to avoid grid enumeration often require solving max-min problems, which similarly scale poorly as dd grows.

2. DLT Methodology and Implicit Optimization Objective

DLT leverages the analytical identity above to train a function gθ:DRg_\theta:D\to\mathbb{R} parameterized by neural networks to approximate ff^*. Instead of optimizing over xx for each yy, DLT minimizes the empirical squared residual:

L(θ)=xX[gθ(f(x))+f(x)x,f(x)]2+λθ2,\mathcal{L}(\theta) = \sum_{x\in X}\bigl[g_\theta(\nabla f(x)) + f(x) - \langle x,\nabla f(x)\rangle \bigr]^2 + \lambda\|\theta\|^2,

where XX is a training set sampled from CC, λ\lambda is the weight decay, and gradient computations are performed via autodiff frameworks such as JAX or PyTorch. This loss equates to squared L2L^2 error between gθg_\theta and the true ff^* on the pushforward distribution ν=μ(f)1\nu = \mu\circ(\nabla f)^{-1}. This approach avoids the need to compute ff^* explicitly or solve a supremum for each yy.

Regularization (e.g., weight decay) is applied in the standard way, and optimization uses Adam with a learning rate of 10310^{-3} and batch size determined empirically.

3. Neural Network Architectures and Training Protocol

DLT can be instantiated using various neural network architectures with specific design considerations:

Architecture Key Properties Convexity Guarantee
MLP 2×128 units, GELU None
ResNet 2 residual blocks (each Dense(128) → GELU → Dense(128)); skip connections None
ICNN Input-Convex Network; non-negative hidden weights, softplus nonlinearity; direct input skip Yes
MLP_ICNN MLP variant, non-negative weights, softplus, no skip Yes
KAN Kolmogorov–Arnold Network; basis expansion; for symbolic regression No

For KANs, each network layer sums over one-dimensional basis functions BmB_m (including polynomials, exponential, logarithmic, and trigonometric forms); post-training, symbolic regression (e.g., sparse linear regression over BmB_m) enables closed-form recovery of learned conjugates.

Training follows a standard stochastic minibatch loop. For improved sampling in cases where f\nabla f distorts measure (notably in Neg-Log), DLT augments training with an inverse network hφ(f)1h_\varphi\approx(\nabla f)^{-1}, reducing L2L^2 error by an order of magnitude in high dimensions with modest additional computation.

4. Error Estimation and A Posteriori Guarantees

DLT provides unbiased a posteriori error estimation without requiring knowledge of the exact ff^*. Given nn i.i.d. samples xix_i from μ\mu on CC,

E^n(g)=1ni=1n[g(f(xi))+f(xi)xi,f(xi)]2\widehat E_n(g) = \frac{1}{n} \sum_{i=1}^n \left[ g(\nabla f(x_i)) + f(x_i) - \langle x_i, \nabla f(x_i) \rangle \right]^2

estimates the L2L^2 error gfL2(D,ν)2\| g - f^* \|^2_{L^2(D,\nu)}; the variance decreases as $1/n$, enabling confidence intervals via CLT. No ground-truth conjugate is necessary—an advantage in domains where ff^* lacks a closed analytic form.

5. Numerical Benchmarks and Comparative Performance

DLT achieves high-accuracy approximations across a spectrum of convex test functions, including high-dimensional quadratic, negative logarithm, negative entropy, and more complex cases without closed-form conjugates, such as quadratic-over-linear. Benchmark results display RMSE for DLT nearly identical to direct learning (when ff^* is available), with comparable training times. For the quadratic-over-linear function, small L2L^2 errors persist through d=100d=100 dimensions.

In comparison to Lucet’s grid-based nested LLT, DLT remains computationally tractable as dimension increases, with active memory usage \sim1 MB and second-scale runtimes, while grid-based methods become infeasible due to exponential cost in time (up to thousands of seconds) and memory (exceeding gigabytes) for d>8d>8. Inverse-sampling for distorted gradients further reduces estimation error.

Method tsolvet_\text{solve} (s) Memory (MB) RMSE
Lucet (d=8d=8) 1960 1530 29.3
DLT (d=8d=8) 26.22 1.1 0.133

6. Advanced Variants: Symbolic Regression and Hamilton–Jacobi Extension

Utilizing Kolmogorov–Arnold Networks with symbolic regression, DLT can extract exact closed-form representations of ff^* for classes of separable convex functions, recovering, for example, f(y)=0.500y12+0.500y22f^*(y)=0.500\,y_1^2 + 0.500\,y_2^2 for a quadratic in d=2d=2 with residuals below 101410^{-14}. This regime permits both high-precision approximation and symbolic interpretability.

DLT generalizes to time-dependent convex-conjugate computations, such as those in Hamilton–Jacobi equations. For instance, the Hopf formula solution u(x,t)u(x,t) of a Hamilton–Jacobi PDE is written in terms of conjugates and can be approximated by a time-augmented DLT (Gθ(y,t)G_\theta(y,t)). In benchmarks for quadratic initial data/Hamiltonians at d=10,30d=10,30, time-parameterized DLT attains $3$–9×9\times smaller L2L^2 error than the Deep Galerkin Method (DGM) for t0.5t\ge 0.5, despite the latter’s lower PDE residual.

7. Limitations and Domain of Applicability

DLT requires ff to be convex and differentiable on an open set CC; its learned approximation is valid on D=f(C)D=\nabla f(C), which may not coincide with the full effective domain of ff^*. The framework’s predictive quality depends critically on neural network expressivity and sufficient computational resources. DLT does not address nonconvex or nondifferentiable targets directly.

DLT is particularly suited for situations prohibitive to grid-based or direct-supremum computation, including high-dimensional convex optimization duality, optimal transport, variational constructs (Moreau envelopes), indirect utility in economics, thermodynamic duality, Hamilton–Jacobi theory, and deep generative modeling (e.g., Wasserstein gradient flows) (Minabutdinov et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Legendre Transform (DLT).