Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 11 tok/s

GPT-5 High 14 tok/s Pro

GPT-4o 99 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 192 tok/s Pro

2000 character limit reached

SAGE: Adaptive Set-Based Gradient Estimator

Updated 29 August 2025

SAGE is a set-based adaptive gradient estimator that uses Taylor series analysis and convex constraints to quantify uncertainty in gradient estimation.
It employs an adaptive sampling strategy by measuring the polytope diameter to select the most informative sample for reducing estimation error.
The method demonstrates robust performance and sample efficiency in noisy, expensive function evaluations, outperforming traditional finite-difference techniques.

The Set-based Adaptive Gradient Estimator (SAGE) is an approach to gradient estimation for black-box scalar functions, which formulates and refines a set of admissible gradients using sample-based constraints derived from Taylor series analysis. SAGE provides both a principled framework for quantifying the uncertainty in the gradient estimate and an adaptive sampling mechanism for efficiently reducing this uncertainty, offering robustness particularly in the presence of bounded noise and sample efficiency for expensive function evaluations (Jr. et al., 26 Aug 2025).

1. Theoretical Underpinnings: Taylor Series and Set-Membership Formulation

At the core of SAGE is a rigorous use of the multivariate Taylor series. For a function $f: \mathbb{R}^D \rightarrow \mathbb{R}$ possessing a Lipschitz continuous Hessian, the difference in function values between two points, $(x_i, z_i)$ and $(x_j, z_j)$ , is expanded to relate the empirical slope to the true gradient and the Hessian properties. The two-point directional slope,

$\tilde{g}_{ij} = \frac{z_j - z_i}{\mu_{ij}},$

approximates the projection of the true gradient $g(x_i)$ along the unit vector $u_{ij} = (x_j - x_i)/\mu_{ij}$ . Lemma 1 in the cited work establishes the critical bound:

$|\tilde{g}_{ij} - g(x_i)^\top u_{ij}| \leq \frac{1}{2}H_i \mu_{ij} + \frac{1}{6}\gamma_H \mu_{ij}^2,$

where $H_i$ is the spectral norm of the Hessian at $x_i$ and $\gamma_H$ the Hessian's Lipschitz constant. This inequality defines a “slab” of admissible projections for each sample pair.

By aggregating these pairwise bounds over a set of samples $X_n$ , every direction constrains the true gradient. Theorem 1 formalizes that $g(x_i)$ must lie inside a convex polytope in $\mathbb{R}^D$ defined by the intersection of all such slabs:

$\mathcal{G}^{(i)} = \left\{ g \in \mathbb{R}^D : \begin{bmatrix}-u_{ij}^\top \ u_{ij}^\top\end{bmatrix}g \le \begin{bmatrix}-\tilde{g}_{ij} \ \tilde{g}_{ij}\end{bmatrix} + \frac{1}{2}[\mu_{ij}]H_i + \frac{1}{6}[\mu_{ij}^2]\gamma_H, \forall x_j \in X_n\setminus\{x_i\} \right\}.$

This set-based perspective contrasts sharply with classic gradient estimators, which typically yield only a point estimate.

The polytope $\mathcal{G}^{(i)}$ initially constructed from available samples may exhibit large uncertainty (quantified as the polytope diameter $\rho(\mathcal{G}^{(i)})$ ). SAGE addresses this with an adaptive sampling strategy:

The diameter $\rho(\mathcal{G}^{(i)})$ is computed (the maximum distance between any two vertices of the polytope).
If $\rho(\mathcal{G}^{(i)})$ exceeds a user-specified precision $\rho^*$ (or the theoretical limit under bounded noise), the direction $d$ of maximum uncertainty (the line connecting the two most distant polytope vertices) is identified.
A new sample is taken along $d$ at distance $\alpha$ from $x_i$ , where $\alpha$ is chosen—optimally, in the noisy case, by solving a cubic equation rooted in the Taylor bound coefficients. This process ensures new samples are maximally informative.
The new sample is incorporated, further constraining $\mathcal{G}^{(i)}$ , and the process continues until desired precision is attained.

This mechanism circumvents conventional heuristics for step size selection (e.g., as in finite differences) and produces near-optimal sample placement with respect to the attainable gradient precision.

3. Algorithmic Implementation

SAGE proceeds iteratively as follows:

Initialization: Begin with an initial (possibly minimal) data set $X_n$ , including the target point $x_k$ and prior samples.
Sample Selection: Filter for samples at distances close to the theoretically optimal radius, to keep constraint set tractable.
Linear Program Formulation: Define variables $(g^{(i)}, H_i, \gamma_H)$ and, if noise is present, a noise bound $\bar{\epsilon}$ . The constraints are linear inequalities as described above.
LP Solution: Solve the LP to find the central/optimal point in $\mathcal{G}^{(i)}$ .
Uncertainty Check: Measure the diameter of the solution set. If the gradient estimate meets the prescribed precision, terminate.
Adaptive Sampling: Otherwise, compute the uncertainty-maximizing direction $d$ , select optimal $\alpha$ , and sample anew at $x_j = x_k + \alpha \hat{d}$ .
Repeat: Update sample set and repeat until criterion is met.

This approach leverages convex optimization techniques (for the LP) and only adds samples if and when necessary for refinement.

4. Performance Characteristics and Comparative Evaluation

Comprehensive statistical testing demonstrates SAGE’s capabilities:

In noiseless settings, standard finite-difference methods (forward, central) can exhibit strong performance; however, as noise increases, their utility sharply decreases.
At high noise levels (e.g., $\bar{\epsilon} = 1.0$ ), SAGE’s performance surpasses all competitors, including state-of-the-art methods such as NMXFD, GSG, and CGSG. SAGE delivered up to an order of magnitude improvement over these methods when averaged across test cases.
This superior behavior is attributed to SAGE’s explicit modeling of noise effects in its constraints and its optimal selection of sampling radius, which avoids the pathologies of sampling too close to the point of interest under high noise.

Performance, measured as improvement at process end and average improvement throughout optimization, confirms that SAGE robustly adapts to both smooth and noisy environments.

5. Sample Efficiency and Robustness in Practice

SAGE is particularly well-suited for scenarios where function evaluations are costly, such as simulation-based design or real-world experimental optimization. Instead of employing a fixed sample schedule, SAGE maximally exploits existing data and only acquires new evaluations as necessary to tighten the gradient estimate. The algorithm’s formulation guarantees a worst-case error bound (the polytope diameter), providing a certificate of robustness.

Importantly, SAGE's LP structure accommodates explicit knowledge or estimates of Hessian norms, Lipschitz constants, and noise bounds, making it a flexible tool for settings with variable and quantifiable uncertainty.

The limiting factor is computational: formulating and solving an LP with many constraints may become expensive in high dimensions, but this is mitigated by filtering the most informative constraints (i.e., those closest to the optimal distance from the query point).

Relative to prior and contemporary work:

Unlike finite-difference and Gaussian smoothed gradient estimators, SAGE formalizes sample-based uncertainty as a convex set, explicitly quantifies attainable accuracy, and algorithmically controls it via adaptive sampling.
The set-membership and polytope approach is distinct from methods such as Smart Gradient (which adaptively changes coordinate systems or combines descent directions using orthogonalization) (Fattah et al., 2021); SAGE’s core innovation lies in the exploitation of validated inequalities to define the gradient feasible region.
Computational cost for large point sets or high dimensions remains a challenge; plausible future directions include enhanced constraint selection mechanisms, scalable convex approximations, and extension to cases where loss landscape regularity assumptions may be relaxed.

7. Applications and Implications

SAGE’s framework makes it attractive for any application requiring high-precision gradient estimation under practical sampling and noise constraints, notably:

Expensive simulation-based optimization (engineering, science).
Black-box optimization where function smoothness is known or can be estimated.
Safety-critical optimization, where guaranteed gradient bounds are desirable.
Cases where robustness to noise is paramount.

By integrating data-driven gradient set characterization, adaptive experimental design, and convex optimization, SAGE establishes a paradigm for gradient estimation that is both theoretically sound and practically robust (Jr. et al., 26 Aug 2025).

PDF Markdown Chat (Upgrade)

References (2)

SAGE: A Set-based Adaptive Gradient Estimator (2025)

Smart Gradient -- An Adaptive Technique for Improving Gradient Estimation (2021)

SAGE: Adaptive Set-Based Gradient Estimator

1. Theoretical Underpinnings: Taylor Series and Set-Membership Formulation

2. Adaptive Sampling and Refinement of the Gradient Set

3. Algorithmic Implementation

4. Performance Characteristics and Comparative Evaluation

5. Sample Efficiency and Robustness in Practice

7. Applications and Implications

Follow-up Questions

Don't miss out on important new AI/ML research

SAGE: Adaptive Set-Based Gradient Estimator

1. Theoretical Underpinnings: Taylor Series and Set-Membership Formulation

2. Adaptive Sampling and Refinement of the Gradient Set

3. Algorithmic Implementation

4. Performance Characteristics and Comparative Evaluation

5. Sample Efficiency and Robustness in Practice

6. Context, Related Methods, and Future Work

7. Applications and Implications

Follow-up Questions

Related Topics

Don't miss out on important new AI/ML research