Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 11 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 192 tok/s Pro
2000 character limit reached

SAGE: Adaptive Set-Based Gradient Estimator

Updated 29 August 2025
  • SAGE is a set-based adaptive gradient estimator that uses Taylor series analysis and convex constraints to quantify uncertainty in gradient estimation.
  • It employs an adaptive sampling strategy by measuring the polytope diameter to select the most informative sample for reducing estimation error.
  • The method demonstrates robust performance and sample efficiency in noisy, expensive function evaluations, outperforming traditional finite-difference techniques.

The Set-based Adaptive Gradient Estimator (SAGE) is an approach to gradient estimation for black-box scalar functions, which formulates and refines a set of admissible gradients using sample-based constraints derived from Taylor series analysis. SAGE provides both a principled framework for quantifying the uncertainty in the gradient estimate and an adaptive sampling mechanism for efficiently reducing this uncertainty, offering robustness particularly in the presence of bounded noise and sample efficiency for expensive function evaluations (Jr. et al., 26 Aug 2025).

1. Theoretical Underpinnings: Taylor Series and Set-Membership Formulation

At the core of SAGE is a rigorous use of the multivariate Taylor series. For a function f:RDRf: \mathbb{R}^D \rightarrow \mathbb{R} possessing a Lipschitz continuous Hessian, the difference in function values between two points, (xi,zi)(x_i, z_i) and (xj,zj)(x_j, z_j), is expanded to relate the empirical slope to the true gradient and the Hessian properties. The two-point directional slope,

g~ij=zjziμij,\tilde{g}_{ij} = \frac{z_j - z_i}{\mu_{ij}},

approximates the projection of the true gradient g(xi)g(x_i) along the unit vector uij=(xjxi)/μiju_{ij} = (x_j - x_i)/\mu_{ij}. Lemma 1 in the cited work establishes the critical bound:

g~ijg(xi)uij12Hiμij+16γHμij2,|\tilde{g}_{ij} - g(x_i)^\top u_{ij}| \leq \frac{1}{2}H_i \mu_{ij} + \frac{1}{6}\gamma_H \mu_{ij}^2,

where HiH_i is the spectral norm of the Hessian at xix_i and γH\gamma_H the Hessian's Lipschitz constant. This inequality defines a “slab” of admissible projections for each sample pair.

By aggregating these pairwise bounds over a set of samples XnX_n, every direction constrains the true gradient. Theorem 1 formalizes that g(xi)g(x_i) must lie inside a convex polytope in RD\mathbb{R}^D defined by the intersection of all such slabs:

G(i)={gRD:[uij uij]g[g~ij g~ij]+12[μij]Hi+16[μij2]γH,xjXn{xi}}.\mathcal{G}^{(i)} = \left\{ g \in \mathbb{R}^D : \begin{bmatrix}-u_{ij}^\top \ u_{ij}^\top\end{bmatrix}g \le \begin{bmatrix}-\tilde{g}_{ij} \ \tilde{g}_{ij}\end{bmatrix} + \frac{1}{2}[\mu_{ij}]H_i + \frac{1}{6}[\mu_{ij}^2]\gamma_H, \forall x_j \in X_n\setminus\{x_i\} \right\}.

This set-based perspective contrasts sharply with classic gradient estimators, which typically yield only a point estimate.

2. Adaptive Sampling and Refinement of the Gradient Set

The polytope G(i)\mathcal{G}^{(i)} initially constructed from available samples may exhibit large uncertainty (quantified as the polytope diameter ρ(G(i))\rho(\mathcal{G}^{(i)})). SAGE addresses this with an adaptive sampling strategy:

  • The diameter ρ(G(i))\rho(\mathcal{G}^{(i)}) is computed (the maximum distance between any two vertices of the polytope).
  • If ρ(G(i))\rho(\mathcal{G}^{(i)}) exceeds a user-specified precision ρ\rho^* (or the theoretical limit under bounded noise), the direction dd of maximum uncertainty (the line connecting the two most distant polytope vertices) is identified.
  • A new sample is taken along dd at distance α\alpha from xix_i, where α\alpha is chosen—optimally, in the noisy case, by solving a cubic equation rooted in the Taylor bound coefficients. This process ensures new samples are maximally informative.
  • The new sample is incorporated, further constraining G(i)\mathcal{G}^{(i)}, and the process continues until desired precision is attained.

This mechanism circumvents conventional heuristics for step size selection (e.g., as in finite differences) and produces near-optimal sample placement with respect to the attainable gradient precision.

3. Algorithmic Implementation

SAGE proceeds iteratively as follows:

  1. Initialization: Begin with an initial (possibly minimal) data set XnX_n, including the target point xkx_k and prior samples.
  2. Sample Selection: Filter for samples at distances close to the theoretically optimal radius, to keep constraint set tractable.
  3. Linear Program Formulation: Define variables (g(i),Hi,γH)(g^{(i)}, H_i, \gamma_H) and, if noise is present, a noise bound ϵˉ\bar{\epsilon}. The constraints are linear inequalities as described above.
  4. LP Solution: Solve the LP to find the central/optimal point in G(i)\mathcal{G}^{(i)}.
  5. Uncertainty Check: Measure the diameter of the solution set. If the gradient estimate meets the prescribed precision, terminate.
  6. Adaptive Sampling: Otherwise, compute the uncertainty-maximizing direction dd, select optimal α\alpha, and sample anew at xj=xk+αd^x_j = x_k + \alpha \hat{d}.
  7. Repeat: Update sample set and repeat until criterion is met.

This approach leverages convex optimization techniques (for the LP) and only adds samples if and when necessary for refinement.

4. Performance Characteristics and Comparative Evaluation

Comprehensive statistical testing demonstrates SAGE’s capabilities:

  • In noiseless settings, standard finite-difference methods (forward, central) can exhibit strong performance; however, as noise increases, their utility sharply decreases.
  • At high noise levels (e.g., ϵˉ=1.0\bar{\epsilon} = 1.0), SAGE’s performance surpasses all competitors, including state-of-the-art methods such as NMXFD, GSG, and CGSG. SAGE delivered up to an order of magnitude improvement over these methods when averaged across test cases.
  • This superior behavior is attributed to SAGE’s explicit modeling of noise effects in its constraints and its optimal selection of sampling radius, which avoids the pathologies of sampling too close to the point of interest under high noise.

Performance, measured as improvement at process end and average improvement throughout optimization, confirms that SAGE robustly adapts to both smooth and noisy environments.

5. Sample Efficiency and Robustness in Practice

SAGE is particularly well-suited for scenarios where function evaluations are costly, such as simulation-based design or real-world experimental optimization. Instead of employing a fixed sample schedule, SAGE maximally exploits existing data and only acquires new evaluations as necessary to tighten the gradient estimate. The algorithm’s formulation guarantees a worst-case error bound (the polytope diameter), providing a certificate of robustness.

Importantly, SAGE's LP structure accommodates explicit knowledge or estimates of Hessian norms, Lipschitz constants, and noise bounds, making it a flexible tool for settings with variable and quantifiable uncertainty.

The limiting factor is computational: formulating and solving an LP with many constraints may become expensive in high dimensions, but this is mitigated by filtering the most informative constraints (i.e., those closest to the optimal distance from the query point).

Relative to prior and contemporary work:

  • Unlike finite-difference and Gaussian smoothed gradient estimators, SAGE formalizes sample-based uncertainty as a convex set, explicitly quantifies attainable accuracy, and algorithmically controls it via adaptive sampling.
  • The set-membership and polytope approach is distinct from methods such as Smart Gradient (which adaptively changes coordinate systems or combines descent directions using orthogonalization) (Fattah et al., 2021); SAGE’s core innovation lies in the exploitation of validated inequalities to define the gradient feasible region.
  • Computational cost for large point sets or high dimensions remains a challenge; plausible future directions include enhanced constraint selection mechanisms, scalable convex approximations, and extension to cases where loss landscape regularity assumptions may be relaxed.

7. Applications and Implications

SAGE’s framework makes it attractive for any application requiring high-precision gradient estimation under practical sampling and noise constraints, notably:

  • Expensive simulation-based optimization (engineering, science).
  • Black-box optimization where function smoothness is known or can be estimated.
  • Safety-critical optimization, where guaranteed gradient bounds are desirable.
  • Cases where robustness to noise is paramount.

By integrating data-driven gradient set characterization, adaptive experimental design, and convex optimization, SAGE establishes a paradigm for gradient estimation that is both theoretically sound and practically robust (Jr. et al., 26 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube