Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Optimization with Optimal Importance Sampling (2504.03560v1)

Published 4 Apr 2025 in math.OC, cs.LG, math.ST, stat.ML, and stat.TH

Abstract: Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its power, the performance of IS is often highly sensitive to the choice of the proposal distribution and frequently requires stochastic calibration techniques. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a unique challenge: the decision and the IS distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both the analysis of convergence for decision iterates and the efficiency of the IS scheme. In this paper, we propose an iterative gradient-based algorithm that jointly updates the decision variable and the IS distribution without requiring time-scale separation between the two. Our method achieves the lowest possible asymptotic variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family. Furthermore, we show that these properties are preserved under linear constraints by incorporating a recent variant of Nesterov's dual averaging method.

Summary

  • The paper presents a joint update scheme for decision variables and IS parameters, eliminating nested loops and reducing variance.
  • It leverages a variant of Nesterov's Dual Averaging to efficiently handle linear constraints while updating both optimization and sampling variables.
  • Averaged iterates achieve theoretically optimal asymptotic performance, matching the variance of an ideal importance sampling distribution.

This paper introduces a novel algorithm for solving constrained convex stochastic optimization problems of the form minθΘEXP[F(θ,X)]\min_{\theta\in\Theta} \mathbb E_{X \sim \mathbb P}\left[ F(\theta, X) \right], where Θ={θRs:Aθb}\Theta = \{\theta \in \mathbb R^s:\, A \theta \leq b\}. The key challenge addressed is the high variance often encountered when using standard stochastic gradient methods, particularly when the expectation involves rare events. Importance Sampling (IS) is leveraged as a variance reduction technique.

The core difficulty with applying IS in optimization is the "curse of circularity": the optimal IS distribution depends on the unknown optimal solution θ\theta^\star, while finding θ\theta^\star efficiently requires a good IS distribution. Existing methods often require nested loops, time-scale separation, or prior knowledge of the mapping from θ\theta to the optimal IS parameters.

Key Contributions and Method

The paper proposes a single-loop iterative algorithm that jointly updates the decision variable θ\theta and the parameters μ\mu of an IS distribution Pμ\mathbb P_\mu from a predefined family M={μRm:Cμd}\mathcal{M} = \{\mu \in \mathbb R^m:\, C \mu \leq d\}. The algorithm avoids the need for time-scale separation or nested optimization.

  1. Joint Update Scheme: The algorithm uses a variant of Nesterov's Dual Averaging (NDA) applied to the combined state vector (θ,μ)(\theta, \mu). The update rule is:
    1
    2
    3
    
    [theta_{n+1}, mu_{n+1}] = argmin_{(\theta, \mu) \in Theta x M} { < sum_{k=0}^n alpha_{k+1} [G_k; H_k], [theta; mu] > + 0.5 || [theta - theta_0; mu - mu_0] ||^2 }
    bar{theta}_n = (1/n) sum_{i=0}^{n-1} theta_i
    bar{mu}_n = (1/n) sum_{i=0}^{n-1} mu_i
    where αn+1=α/(n+1)γ\alpha_{n+1} = \alpha/(n+1)^\gamma with γ(1/2,1)\gamma \in (1/2, 1).
  2. Stochastic Gradients:
    • Gk=Gμk(θk,Xk+1(μk))=G(θk,Xk+1(μk))(Xk+1(μk),μk)G_k = G_{\mu_k}(\theta_k, X_{k+1}^{(\mu_k)}) = G(\theta_k, X_{k+1}^{(\mu_k)}) \ell(X_{k+1}^{(\mu_k)}, \mu_k). This is the IS gradient for the primary objective f(θ)f(\theta). It uses a sample Xk+1(μk)X_{k+1}^{(\mu_k)} drawn from the current IS distribution Pμk\mathbb P_{\mu_k}. G(θ,x)=θF(θ,x)G(\theta, x) = \nabla_\theta F(\theta, x) and (x,μ)=dP/dPμ(x)\ell(x, \mu) = d\mathbb P / d\mathbb P_\mu (x) is the likelihood ratio.
    • Hk=H(θk,μk,Xk+1)=PAaθkG(θk,Xk+1)2μ(Xk+1,μk)H_k = H(\theta_k, \mu_k, X_{k+1}) = \|\text{P}_{A_{a}^{\theta_k}} G(\theta_k, X_{k+1})\|^2 \nabla_\mu \ell(X_{k+1}, \mu_k). This is the stochastic gradient for the IS parameter update. It aims to minimize the variance v(θ,μ)=EXP[PAaθG(θ,X)2(X,μ)]v(\theta, \mu) = \mathbb E_{X \sim \mathbb P} [\|\text{P}_{A_{a}^\theta} G(\theta, X)\|^2 \ell(X, \mu)], specifically evaluated at θ\theta^\star. It uses a sample Xk+1X_{k+1} drawn from the original distribution P\mathbb P. PAaθ\text{P}_{A_{a}^\theta} is the projector onto the null space of the active constraints at θ\theta.
  3. Theoretical Guarantees:
    • Global Convergence: Under convexity of f(θ)f(\theta) and log-convexity of (x,μ)\ell(x, \mu) w.r.t μ\mu, along with other regularity conditions, the iterates (θn,μn)(\theta_n, \mu_n) converge almost surely to (θ,μ)(\theta^\star, \mu^\star), where θ\theta^\star minimizes f(θ)f(\theta) and μ\mu^\star minimizes the asymptotic variance v(θ,μ)v(\theta^\star, \mu).
    • Asymptotic Optimality: The averaged iterates θˉn\bar{\theta}_n achieve the minimal possible asymptotic variance among all methods using the given IS family {Pμ}\{\mathbb P_\mu\}. The central limit theorem (CLT) holds:

      n(θˉnθ)dN(0,ΣG)\sqrt{n}(\bar{\theta}_n - \theta^\star) \overset{d}{\to} \mathcal N(0, \Sigma_G^\star)

      where $\Sigma_G^\star = \text{Q}^{\dagger}\, \text{Var}_{X^{(\mu^\star)} \sim \mathbb P_{\mu^\star}\left[G_{\mu^\star}(\theta^\star, X^{(\mu^\star)})\right] \text{Q}^{\dagger}$ and Q=PAa2f(θ)PAa\text{Q} = \text{P}_{A^\star_{a}} \nabla^2 f(\theta^\star) \text{P}_{A^\star_{a}}. This matches the variance achievable if the optimal IS distribution Pμ\mathbb P_{\mu^\star} were known beforehand.

    • Constraint Handling: The use of NDA ensures the method correctly handles linear constraints on both θ\theta and μ\mu, and identifies the active constraints in finite time almost surely.

Practical Implementation Details

  • Applicable IS Families: The method works for common IS families where (x,μ)\ell(x, \mu) is log-convex and differentiable in μ\mu, such as:
    • Exponential Tilting (ET)
    • Mean Translation (MT) for log-concave base distributions
    • Mixture Models
  • Computational Requirements:
    • Ability to sample from the original distribution P\mathbb P.
    • Ability to sample from the IS distribution Pμ\mathbb P_\mu for any feasible μ\mu.
    • Ability to compute the gradient G(θ,x)=θF(θ,x)G(\theta, x) = \nabla_\theta F(\theta, x).
    • Ability to compute the likelihood ratio (x,μ)\ell(x, \mu) and its gradient μ(x,μ)\nabla_\mu \ell(x, \mu).
    • Ability to compute the active constraint projector PAaθ\text{P}_{A_{a}^\theta} (though the analysis relies on PAa\text{P}_{A_{a}^\star}, the algorithm uses the current iterate's active set).
    • Solving the NDA subproblem at each iteration, which involves minimizing a quadratic function over the feasible set Θ×M\Theta \times \mathcal{M}. This is often a projection-like operation.
  • Assumptions for Implementation: The underlying objective f(θ)f(\theta) must be convex. The chosen IS family must satisfy the log-convexity and differentiability assumption on (x,μ)\ell(x, \mu). The feasible sets Θ\Theta and M\mathcal{M} must be convex, closed, and bounded polytopes (defined by linear inequalities).
  • Secondary IS: The paper notes that the variance of the HkH_k gradient itself could be reduced using another layer of IS (secondary IS), suggesting potential strategies but not exploring them in depth.

Summary of Benefits

  • Provides a principled way to adapt the IS distribution during optimization without needing prior knowledge of the optimal IS parameters.
  • Achieves theoretically optimal asymptotic performance within the chosen IS family.
  • Unified single-loop approach simplifies implementation compared to multi-level or alternating methods.
  • Naturally handles linear constraints on both decision and IS parameters.

This work offers a theoretically sound framework for integrating adaptive importance sampling into constrained stochastic optimization, potentially leading to significant efficiency gains in problems with high variance or rare events.

X Twitter Logo Streamline Icon: https://streamlinehq.com