Worst-case generation via minimax optimization in Wasserstein space (2512.08176v1)

Published 9 Dec 2025 in stat.ML, cs.LG, and math.OC

Abstract: Worst-case generation plays a critical role in evaluating robustness and stress-testing systems under distribution shifts, in applications ranging from machine learning models to power grids and medical prediction systems. We develop a generative modeling framework for worst-case generation for a pre-specified risk, based on min-max optimization over continuous probability distributions, namely the Wasserstein space. Unlike traditional discrete distributionally robust optimization approaches, which often suffer from scalability issues, limited generalization, and costly worst-case inference, our framework exploits the Brenier theorem to characterize the least favorable (worst-case) distribution as the pushforward of a transport map from a continuous reference measure, enabling a continuous and expressive notion of risk-induced generation beyond classical discrete DRO formulations. Based on the min-max formulation, we propose a Gradient Descent Ascent (GDA)-type scheme that updates the decision model and the transport map in a single loop, establishing global convergence guarantees under mild regularity assumptions and possibly without convexity-concavity. We also propose to parameterize the transport map using a neural network that can be trained simultaneously with the GDA iterations by matching the transported training samples, thereby achieving a simulation-free approach. The efficiency of the proposed method as a risk-induced worst-case generator is validated by numerical experiments on synthetic and image data.

Abstract PDF Chat (Pro)

Summary

The paper introduces a minimax optimization framework that leverages Wasserstein geometry to generate worst-case distributions in adversarial settings.
It employs a single-loop gradient descent-ascent algorithm with neural transport mapping to ensure both computational efficiency and robust convergence.
Empirical results on tasks like MNIST and CIFAR-10 validate the model’s scalability and effectiveness in robust risk mitigation.

Worst-case Sample Generation via Minimax Optimization in Wasserstein Space

Introduction and Problem Formulation

The paper "Worst-case generation via minimax optimization in Wasserstein space" (2512.08176) develops a minimax framework for risk-driven adversarial data generation with an explicit focus on distributionally robust optimization (DRO) using Wasserstein geometry. The central objective is to construct and sample worst-case distributions that exhibit maximal risk with respect to a decision model, surpassing traditional discrete DRO approaches both in expressiveness and computational tractability.

The minimax formulation commences from the penalized DRO objective:

$\min_{\theta \in \mathbb{R}^p} \max_{Q \in \mathcal{P}_2} \mathbb{E}_{v \sim Q} [\ell(\theta, v)] - \frac{1}{2\gamma} W_2^2(Q, P),$

where $P$ is the reference distribution, $\theta$ parametrizes the decision model, $\ell$ is a loss function, and $\gamma$ controls the radius of the ambiguity set. Leveraging the Brenier theorem, the inner maximization over $Q$ is equivalently reformulated via a transport map $T$ : $Q = T_\# P$ , reducing adversarial learning to functional space optimization.

Algorithmic Framework: Gradient Descent-Ascent and Neural Transport

The essential contribution is a single-loop gradient descent-ascent (GDA) scheme which jointly and efficiently updates the model parameters $\theta$ and the transport map $T$ . The gradient expressions for $(\theta, T)$ are derived:

$\partial_\theta L(\theta, T) = \mathbb{E}_{x \sim P} \partial_\theta \ell(\theta, T(x))$
$\partial_T L(\theta, T)(x) = \partial_v \ell(\theta, T(x)) - \frac{1}{\gamma}(T(x) - x)$

The GDA recursion is given by:

$\theta_{k+1} = \theta_k - \tau \cdot \partial_\theta L(\theta_k, T_k); \quad T_{k+1}(x) = T_k(x) + \eta \cdot \partial_T L(\theta_k, T_k)(x),$

where $(\tau, \eta)$ are step sizes, and the updates operate either in vector or function space.

Further, a neural network parametrization $T_\varphi$ of the transport map is introduced and trained via $L^2$ -matching loss to ensure out-of-sample generalization and scalability to high-dimensional data, such as images.

Convergence Theory

The paper establishes nonconvex convergence guarantees for the GDA algorithm under Polyak-Łojasiewicz (PL) or strong concavity conditions, without requiring convexity-concavity of the objective. The minimax problem admits two regimes:

$T$ -fast (Inner maximization): Under an NC-PL assumption, GDA finds an $\epsilon$ -stationary point in $O(\epsilon^{-2})$ iterations for arbitrary $\gamma$ , with step sizes set relative to smoothness and condition number derived from $\ell$ and $\gamma$ , ensuring stability even for large perturbations.
$\theta$ -fast (Inner minimization): When $L(\theta, T)$ is strongly concave in $\theta$ after averaging over $T_\# P$ , similar convergence rates apply (again $O(\epsilon^{-2})$ ), contingent on a class of "good" transport maps which preserve concavity.

A more general regime with nonconvex-nonconcave structure (NC-NC) is handled using a one-sided proximal point method (PPM), also achieving $O(\epsilon^{-2})$ stationary point complexity under interaction-dominant landscape assumptions.

Empirical Analysis

Validation is performed on synthetic regression and image classification tasks (e.g., MNIST, CIFAR-10 in the latent VAE space), comparing GDA's particle optimization and neural map learning to classical elimination approaches.

Empirical results demonstrate:

Steady and monotonic decrease of gradient norms for both $\theta$ and $T$ (Figure 1, Figure 2, Figure 3),
Rapid convergence that matches or surpasses elimination-based (double-loop) methods, especially in terms of the number of gradient evaluations,
Successful generalization of the neural transport map $T_\varphi$ to test samples, evidenced by semantically meaningful adversarial deformations interpolated in latent space (Figure 4).

Figure 1: GDA with momentum on MNIST using different batch sizes yields consistent convergence, with gradient norms $\mathrm{GN}_\theta$ and $\mathrm{GN}_T$ steadily decreasing across mini-batches for $\gamma=8.0$ .

Figure 5: Illustrative results of particle optimization and sample trajectories in MNIST latent space, showing effective worst-case sample generation and model stress testing.

Figure 2: GDA without momentum on MNIST displays similar convergence properties, confirming the robustness of the single-loop minimax strategy.

Figure 3: GDA with momentum and varying step sizes $\eta$ , $\tau$ highlights the algorithm’s stability and efficacy for different configurations with batch size $m=500$ .

Figure 4: Interpolation panels document the learned deformations for test samples and their nearest neighbors, underscoring the neural transport map’s generalization capacity beyond the training distribution.

Theoretical and Practical Implications

Contrary to discrete DRO, which is computationally inhibitive and lacks generalization beyond training samples, the proposed continuous transport-map-based minimax approach enables expressive, risk-tailored generation of worst-case data points and is computationally scalable via neural parameterization and stochastic GDA.

The convergence theory provided accommodates large ambiguity radii $\gamma$ , enabling exploration of worst-case distributions that can induce substantial distributional shifts. This is relevant for robust machine learning, stress-testing in operational domains (autonomous systems, power grids, healthcare), and for designing models with strong risk mitigation capabilities.

The framework’s reliance on Wasserstein geometry and functional optimization opens possibilities for future use of more advanced first-order minimax algorithms (e.g., OGDA, extragradient), extension to multimodal or highly structured distributions, and further analysis of neural approximation error in functional spaces.

Conclusion

This work advances risk-induced generative modeling under distributional uncertainty by reformulating adversarial DRO as functional minimax optimization in Wasserstein space, validated by strong theoretical convergence guarantees and empirical studies on high-dimensional tasks. The neural transport map ensures scalability and generalization, offering a principled route toward simulation-free, robust worst-case sample generation. Future directions include extending minimax theory to richer neural architectures and broader domains, and analyzing approximation effects induced by neural parameterizations.