Stochastic Optimization with Optimal Importance Sampling (2504.03560v1)

Published 4 Apr 2025 in math.OC, cs.LG, math.ST, stat.ML, and stat.TH

Abstract: Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its power, the performance of IS is often highly sensitive to the choice of the proposal distribution and frequently requires stochastic calibration techniques. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a unique challenge: the decision and the IS distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both the analysis of convergence for decision iterates and the efficiency of the IS scheme. In this paper, we propose an iterative gradient-based algorithm that jointly updates the decision variable and the IS distribution without requiring time-scale separation between the two. Our method achieves the lowest possible asymptotic variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family. Furthermore, we show that these properties are preserved under linear constraints by incorporating a recent variant of Nesterov's dual averaging method.

Summary

The paper presents a joint update scheme for decision variables and IS parameters, eliminating nested loops and reducing variance.
It leverages a variant of Nesterov's Dual Averaging to efficiently handle linear constraints while updating both optimization and sampling variables.
Averaged iterates achieve theoretically optimal asymptotic performance, matching the variance of an ideal importance sampling distribution.

This paper introduces a novel algorithm for solving constrained convex stochastic optimization problems of the form $\min_{\theta\in\Theta} \mathbb E_{X \sim \mathbb P}\left[ F(\theta, X) \right]$ , where $\Theta = \{\theta \in \mathbb R^s:\, A \theta \leq b\}$ . The key challenge addressed is the high variance often encountered when using standard stochastic gradient methods, particularly when the expectation involves rare events. Importance Sampling (IS) is leveraged as a variance reduction technique.

The core difficulty with applying IS in optimization is the "curse of circularity": the optimal IS distribution depends on the unknown optimal solution $\theta^\star$ , while finding $\theta^\star$ efficiently requires a good IS distribution. Existing methods often require nested loops, time-scale separation, or prior knowledge of the mapping from $\theta$ to the optimal IS parameters.

Key Contributions and Method

The paper proposes a single-loop iterative algorithm that jointly updates the decision variable $\theta$ and the parameters $\mu$ of an IS distribution $\mathbb P_\mu$ from a predefined family $\mathcal{M} = \{\mu \in \mathbb R^m:\, C \mu \leq d\}$ . The algorithm avoids the need for time-scale separation or nested optimization.

Joint Update Scheme: The algorithm uses a variant of Nesterov's Dual Averaging (NDA) applied to the combined state vector

(\theta, \mu)

. The update rule is:

1
2
3

[theta_{n+1}, mu_{n+1}] = argmin_{(\theta, \mu) \in Theta x M} { < sum_{k=0}^n alpha_{k+1} [G_k; H_k], [theta; mu] > + 0.5 || [theta - theta_0; mu - mu_0] ||^2 }
bar{theta}_n = (1/n) sum_{i=0}^{n-1} theta_i
bar{mu}_n = (1/n) sum_{i=0}^{n-1} mu_i

where

\alpha_{n+1} = \alpha/(n+1)^\gamma

with

\gamma \in (1/2, 1)

.

Stochastic Gradients:
- $G_k = G_{\mu_k}(\theta_k, X_{k+1}^{(\mu_k)}) = G(\theta_k, X_{k+1}^{(\mu_k)}) \ell(X_{k+1}^{(\mu_k)}, \mu_k)$ . This is the IS gradient for the primary objective $f(\theta)$ . It uses a sample $X_{k+1}^{(\mu_k)}$ drawn from the current IS distribution $\mathbb P_{\mu_k}$ . $G(\theta, x) = \nabla_\theta F(\theta, x)$ and $\ell(x, \mu) = d\mathbb P / d\mathbb P_\mu (x)$ is the likelihood ratio.
- $H_k = H(\theta_k, \mu_k, X_{k+1}) = \|\text{P}_{A_{a}^{\theta_k}} G(\theta_k, X_{k+1})\|^2 \nabla_\mu \ell(X_{k+1}, \mu_k)$ . This is the stochastic gradient for the IS parameter update. It aims to minimize the variance $v(\theta, \mu) = \mathbb E_{X \sim \mathbb P} [\|\text{P}_{A_{a}^\theta} G(\theta, X)\|^2 \ell(X, \mu)]$ , specifically evaluated at $\theta^\star$ . It uses a sample $X_{k+1}$ drawn from the original distribution $\mathbb P$ . $\text{P}_{A_{a}^\theta}$ is the projector onto the null space of the active constraints at $\theta$ .
Theoretical Guarantees:
- Global Convergence: Under convexity of $f(\theta)$ and log-convexity of $\ell(x, \mu)$ w.r.t $\mu$ , along with other regularity conditions, the iterates $(\theta_n, \mu_n)$ converge almost surely to $(\theta^\star, \mu^\star)$ , where $\theta^\star$ minimizes $f(\theta)$ and $\mu^\star$ minimizes the asymptotic variance $v(\theta^\star, \mu)$ .
- Asymptotic Optimality: The averaged iterates $\bar{\theta}_n$ achieve the minimal possible asymptotic variance among all methods using the given IS family $\{\mathbb P_\mu\}$ . The central limit theorem (CLT) holds:
  
  $\sqrt{n}(\bar{\theta}_n - \theta^\star) \overset{d}{\to} \mathcal N(0, \Sigma_G^\star)$
  
  where $\Sigma_G^\star = \text{Q}^{\dagger}\, \text{Var}_{X^{(\mu^\star)} \sim \mathbb P_{\mu^\star}\left[G_{\mu^\star}(\theta^\star, X^{(\mu^\star)})\right] \text{Q}^{\dagger}$ and $\text{Q} = \text{P}_{A^\star_{a}} \nabla^2 f(\theta^\star) \text{P}_{A^\star_{a}}$ . This matches the variance achievable if the optimal IS distribution $\mathbb P_{\mu^\star}$ were known beforehand.
- Constraint Handling: The use of NDA ensures the method correctly handles linear constraints on both $\theta$ and $\mu$ , and identifies the active constraints in finite time almost surely.

Practical Implementation Details

Applicable IS Families: The method works for common IS families where $\ell(x, \mu)$ $ℓ (x, μ)$ is log-convex and differentiable in $\mu$ $μ$ , such as:
- Exponential Tilting (ET)
- Mean Translation (MT) for log-concave base distributions
- Mixture Models
Computational Requirements:
- Ability to sample from the original distribution $\mathbb P$ .
- Ability to sample from the IS distribution $\mathbb P_\mu$ for any feasible $\mu$ .
- Ability to compute the gradient $G(\theta, x) = \nabla_\theta F(\theta, x)$ .
- Ability to compute the likelihood ratio $\ell(x, \mu)$ and its gradient $\nabla_\mu \ell(x, \mu)$ .
- Ability to compute the active constraint projector $\text{P}_{A_{a}^\theta}$ (though the analysis relies on $\text{P}_{A_{a}^\star}$ , the algorithm uses the current iterate's active set).
- Solving the NDA subproblem at each iteration, which involves minimizing a quadratic function over the feasible set $\Theta \times \mathcal{M}$ . This is often a projection-like operation.
Assumptions for Implementation: The underlying objective $f(\theta)$ must be convex. The chosen IS family must satisfy the log-convexity and differentiability assumption on $\ell(x, \mu)$ . The feasible sets $\Theta$ and $\mathcal{M}$ must be convex, closed, and bounded polytopes (defined by linear inequalities).
Secondary IS: The paper notes that the variance of the $H_k$ gradient itself could be reduced using another layer of IS (secondary IS), suggesting potential strategies but not exploring them in depth.

Summary of Benefits

Provides a principled way to adapt the IS distribution during optimization without needing prior knowledge of the optimal IS parameters.
Achieves theoretically optimal asymptotic performance within the chosen IS family.
Unified single-loop approach simplifies implementation compared to multi-level or alternating methods.
Naturally handles linear constraints on both decision and IS parameters.

This work offers a theoretically sound framework for integrating adaptive importance sampling into constrained stochastic optimization, potentially leading to significant efficiency gains in problems with high variance or rare events.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/mathOCb/status/1909114085696024608