Deep Legendre Transform (DLT): Neural Convex Conjugates

Updated 13 March 2026

DLT is a neural framework for evaluating convex conjugates by leveraging the analytic relationship between a convex function and its Legendre transform.
The method trains neural networks to minimize an empirical squared residual, thereby sidestepping computationally expensive supremum optimizations.
DLT offers scalability and unbiased a posteriori error estimation, outperforming grid-based methods in high-dimensional convex optimization tasks.

The Deep Legendre Transform (DLT) is a neural framework for learning and evaluating convex conjugates (Legendre–Fenchel transforms) of differentiable convex functions in high dimensions. By exploiting an implicit analytic relation between a convex function and its conjugate, DLT circumvents direct supremum optimization or grid-based discretizations, facilitating scalable computation and a posteriori estimation of approximation error. DLT is applicable in convex optimization, variational analysis, Hamilton–Jacobi PDEs, and elsewhere in mathematics, physics, and economics (Minabutdinov et al., 22 Dec 2025).

1. Theoretical Foundations

For a differentiable convex function $f:\mathbb{R}^n\to\mathbb{R}$ , the Legendre–Fenchel transform (convex conjugate) is

$f^*(y)=\sup_{x\in\mathbb{R}^n}\{\langle x,y\rangle - f(x)\}.$

When $f$ is differentiable on an open domain $C\subseteq\mathbb{R}^n$ , this relation simplifies on the dual domain $D=\nabla f(C)$ :

$f^*(\nabla f(x)) = \langle x,\nabla f(x)\rangle - f(x), \quad x\in C.$

This identity follows from the Fenchel–Young theorem. Traditional grid-based approaches, even in optimized form (e.g., Lucet’s nested LLT algorithm), are computationally intractable for large $d$ , with costs scaling as $O(N^{2d})$ or $O(dN^{d+1})$ for uniform grid sizes $N$ per dimension (Minabutdinov et al., 22 Dec 2025). Neural approaches aiming to avoid grid enumeration often require solving max-min problems, which similarly scale poorly as $d$ grows.

2. DLT Methodology and Implicit Optimization Objective

DLT leverages the analytical identity above to train a function $g_\theta:D\to\mathbb{R}$ parameterized by neural networks to approximate $f^*$ . Instead of optimizing over $x$ for each $y$ , DLT minimizes the empirical squared residual:

$\mathcal{L}(\theta) = \sum_{x\in X}\bigl[g_\theta(\nabla f(x)) + f(x) - \langle x,\nabla f(x)\rangle \bigr]^2 + \lambda\|\theta\|^2,$

where $X$ is a training set sampled from $C$ , $\lambda$ is the weight decay, and gradient computations are performed via autodiff frameworks such as JAX or PyTorch. This loss equates to squared $L^2$ error between $g_\theta$ and the true $f^*$ on the pushforward distribution $\nu = \mu\circ(\nabla f)^{-1}$ . This approach avoids the need to compute $f^*$ explicitly or solve a supremum for each $y$ .

Regularization (e.g., weight decay) is applied in the standard way, and optimization uses Adam with a learning rate of $10^{-3}$ and batch size determined empirically.

3. Neural Network Architectures and Training Protocol

DLT can be instantiated using various neural network architectures with specific design considerations:

Architecture	Key Properties	Convexity Guarantee
MLP	2×128 units, GELU	None
ResNet	2 residual blocks (each Dense(128) → GELU → Dense(128)); skip connections	None
ICNN	Input-Convex Network; non-negative hidden weights, softplus nonlinearity; direct input skip	Yes
MLP_ICNN	MLP variant, non-negative weights, softplus, no skip	Yes
KAN	Kolmogorov–Arnold Network; basis expansion; for symbolic regression	No

For KANs, each network layer sums over one-dimensional basis functions $B_m$ (including polynomials, exponential, logarithmic, and trigonometric forms); post-training, symbolic regression (e.g., sparse linear regression over $B_m$ ) enables closed-form recovery of learned conjugates.

Training follows a standard stochastic minibatch loop. For improved sampling in cases where $\nabla f$ distorts measure (notably in Neg-Log), DLT augments training with an inverse network $h_\varphi\approx(\nabla f)^{-1}$ , reducing $L^2$ error by an order of magnitude in high dimensions with modest additional computation.

4. Error Estimation and A Posteriori Guarantees

DLT provides unbiased a posteriori error estimation without requiring knowledge of the exact $f^*$ . Given $n$ i.i.d. samples $x_i$ from $\mu$ on $C$ ,

$\widehat E_n(g) = \frac{1}{n} \sum_{i=1}^n \left[ g(\nabla f(x_i)) + f(x_i) - \langle x_i, \nabla f(x_i) \rangle \right]^2$

estimates the $L^2$ error $\| g - f^* \|^2_{L^2(D,\nu)}$ ; the variance decreases as $1/n$, enabling confidence intervals via CLT. No ground-truth conjugate is necessary—an advantage in domains where $f^*$ lacks a closed analytic form.

5. Numerical Benchmarks and Comparative Performance

DLT achieves high-accuracy approximations across a spectrum of convex test functions, including high-dimensional quadratic, negative logarithm, negative entropy, and more complex cases without closed-form conjugates, such as quadratic-over-linear. Benchmark results display RMSE for DLT nearly identical to direct learning (when $f^*$ is available), with comparable training times. For the quadratic-over-linear function, small $L^2$ errors persist through $d=100$ dimensions.

In comparison to Lucet’s grid-based nested LLT, DLT remains computationally tractable as dimension increases, with active memory usage $\sim$ 1 MB and second-scale runtimes, while grid-based methods become infeasible due to exponential cost in time (up to thousands of seconds) and memory (exceeding gigabytes) for $d>8$ . Inverse-sampling for distorted gradients further reduces estimation error.

Method	$t_\text{solve}$ (s)	Memory (MB)	RMSE
Lucet ( $d=8$ )	1960	1530	29.3
DLT ( $d=8$ )	26.22	1.1	0.133

6. Advanced Variants: Symbolic Regression and Hamilton–Jacobi Extension

Utilizing Kolmogorov–Arnold Networks with symbolic regression, DLT can extract exact closed-form representations of $f^*$ for classes of separable convex functions, recovering, for example, $f^*(y)=0.500\,y_1^2 + 0.500\,y_2^2$ for a quadratic in $d=2$ with residuals below $10^{-14}$ . This regime permits both high-precision approximation and symbolic interpretability.

DLT generalizes to time-dependent convex-conjugate computations, such as those in Hamilton–Jacobi equations. For instance, the Hopf formula solution $u(x,t)$ of a Hamilton–Jacobi PDE is written in terms of conjugates and can be approximated by a time-augmented DLT ( $G_\theta(y,t)$ ). In benchmarks for quadratic initial data/Hamiltonians at $d=10,30$ , time-parameterized DLT attains $3$– $9\times$ smaller $L^2$ error than the Deep Galerkin Method (DGM) for $t\ge 0.5$ , despite the latter’s lower PDE residual.

7. Limitations and Domain of Applicability

DLT requires $f$ to be convex and differentiable on an open set $C$ ; its learned approximation is valid on $D=\nabla f(C)$ , which may not coincide with the full effective domain of $f^*$ . The framework’s predictive quality depends critically on neural network expressivity and sufficient computational resources. DLT does not address nonconvex or nondifferentiable targets directly.

DLT is particularly suited for situations prohibitive to grid-based or direct-supremum computation, including high-dimensional convex optimization duality, optimal transport, variational constructs (Moreau envelopes), indirect utility in economics, thermodynamic duality, Hamilton–Jacobi theory, and deep generative modeling (e.g., Wasserstein gradient flows) (Minabutdinov et al., 22 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Legendre Transform (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Legendre Transform (DLT).