Sparse Polyak for High-Dimensional Estimation

Updated 15 September 2025

The paper introduces Sparse Polyak, an adaptive step-size method that uses hard-thresholding to estimate a restricted Lipschitz constant, ensuring linear convergence.
It achieves rate-invariance by decoupling the iteration count from the ambient dimension, making it efficient for high-dimensional sparse estimation.
Empirical results in sparse linear and logistic regression demonstrate that Sparse Polyak recovers support accurately and outperforms classical methods even when d exceeds n.

Sparse Polyak refers to a family of adaptive step-size rules, motivated by the original Polyak step-size, that are designed to address the statistical and computational challenges in optimization for high-dimensional M-estimation. Its distinctive feature is the use of restricted (sparse) gradient information in the construction of the step-size, aligning the optimization geometry with the low-dimensional structure inherent in many modern estimation problems. Unlike conventional adaptive methods, which base the step-size on global smoothness properties, Sparse Polyak uses a hard-thresholded gradient to estimate a restricted Lipschitz constant, yielding performance that is rate-invariant to the ambient dimension. This approach achieves linear convergence up to the optimal statistical precision and maintains efficiency even when the number of parameters dramatically exceeds the sample size (Qiao et al., 11 Sep 2025).

1. Motivation and Context: High-dimensional M-Estimation

Sparse Polyak was proposed to address the inefficient scaling of standard Polyak step-sizes in high-dimensional statistical estimation. When the number of parameters $d$ surpasses the sample size $n$ , and only a sparse vector $\theta^*$ is to be estimated (with $\Vert \theta^* \Vert_0 = s^* \ll d$ ), full-gradient-based step-size rules can become overly pessimistic. The norm $\lVert \nabla f(\theta_t) \rVert^2$ may grow with $d$ , causing the adaptive step-size to be excessively small and drastically increasing the number of required iterations to reach optimal statistical accuracy — even when the statistical difficulty itself does not scale with $d$ . Thus, classical approaches lack "rate-invariance" in high dimensions (Qiao et al., 11 Sep 2025).

Sparse Polyak overcomes this by focusing the step-size adaptation on the relevant sparse coordinates (the active support), exploiting the low-dimensional structure of high-dimensional M-estimation problems. The method is principally motivated by problems in sparse linear and logistic regression, but its principle is broadly applicable to any regime where low-dimensional structure can be captured via hard-thresholding.

2. Sparse Polyak Step-Size: Restricted Smoothness and Construction

The central innovation in Sparse Polyak is the use of hard-thresholding in the step-size denominator:

$\gamma_t = \frac{\max\{ f(\theta_t) - \widehat{f}, 0 \}}{5 \| \operatorname{HT}_s(\nabla f(\theta_t)) \|^2 }$

where $\operatorname{HT}_s$ denotes the hard-thresholding operator (keeping only the $s$ largest-magnitude components), and $\widehat{f}$ is a lower-bound surrogate for the optimal function value (e.g., the statistical minimum). The update is then

$\theta_{t+1} = \operatorname{HT}_s \left( \theta_t - \gamma_t \nabla f(\theta_t) \right)$

This construction directly targets the restricted smoothness constant. In classical Polyak step-size,

$\gamma_t^{\text{classic}} = \frac{f(\theta_t) - f^*}{\| \nabla f(\theta_t) \|^2}$

the denominator reflects the full Lipschitz constant $L$ . In Sparse Polyak, restricting the gradient norm to its top $s$ coordinates replaces $L$ by a restricted Lipschitz constant $\overline{L} = L + 3\tau s$ , where $L$ is the global smoothness and $\tau$ quantifies additional curvature along sparse directions (the RSS assumption). Unlike $L$ , $\overline{L}$ does not scale with $d$ for fixed $s$ , yielding a "dimension-free" step-size even when $n \ll d$ .

3. Theoretical Guarantees and Convergence Analysis

Sparse Polyak provides strong theoretical guarantees in the context of high-dimensional sparse M-estimation. Under the restricted strong convexity and restricted smoothness assumptions (standard in high-dimensional theory), and with the sparsity level $s$ chosen to cover the true support plus a margin (e.g., $s \geq c s^*$ for a constant $c$ dependent on the condition number), the following guarantee is established [(Qiao et al., 11 Sep 2025), Theorem 1]:

$\Vert \theta_{t+1} - \widehat{\theta} \rVert^2 \leq \left( 1 - \frac{1}{80 \overline{\kappa}} \right) \Vert \theta_t - \widehat{\theta} \rVert^2$

where $\overline{\kappa} = \overline{L} / \overline{\mu}$ is the restricted condition number and $\widehat{\theta}$ is any $s$ -sparse minimizer of $f$ (or an estimator achieving the lower-bound target function value $\widehat{f}$ ). If the iterate enters a ball of radius proportional to the statistical error, the iterates remain in a neighborhood of optimal radius.

An adaptive ("double-loop") variant is developed in the absence of a precise lower bound $\widehat{f}$ , using repeated surrogate refinement analogous to the epoch-based variants of adaptive Polyak step-size in the low-dimensional setting.

Additionally, support recovery is guaranteed under a quantitative signal-to-noise condition: the minimal nonzero component of $\widehat{\theta}$ must dominate the restricted gradient norm at optimum, i.e.,

$| \widehat{\theta} |_{\min} \geq \frac{7 \| \operatorname{HT}(\nabla f(\widehat{\theta})) \|}{\overline{\mu}}$

where $\overline{\mu}$ is the restricted strong convexity constant. Upon entering this regime, the iterates recover the exact support and maintain it.

4. Rate-Invariance and High-dimensional Adaptivity

Rate-invariance denotes the property that the number of optimization iterations required to achieve a prescribed statistical accuracy is independent of the ambient dimension $d$ so long as the effective problem dimension (e.g., $s \log d / n$ ) is fixed. This is in stark contrast to fixed step-size and classical Polyak step-size rules, where the iteration count grows with $d$ due to the scaling of $L$ in the denominator.

Sparse Polyak achieves rate-invariance: as $d$ increases, while the statistical error is unchanged, the number of iterations required to reach this error remains constant. Empirically, this is supported by experiments in both logistic and linear regression, where the same optimization precision is reached using essentially the same number of iterations as $d$ grows, provided $s \log d / n$ remains fixed (Qiao et al., 11 Sep 2025). In contrast, the iteration count for the classical Polyak method increases substantially.

5. Algorithmic Formulation: Hard-Thresholded Iterative Updates

The method is implemented as hard-thresholded iterations:

$\theta_{t+1} = \operatorname{HT}_s \big( \theta_t - \gamma_t \nabla f(\theta_t) \big)$

where the step-size is

$\gamma_t = \frac{\max\{f(\theta_t) - \widehat{f}, 0\}}{5 \|\operatorname{HT}_s(\nabla f(\theta_t))\|^2}$

Choice of $s$ must exceed the true sparsity $s^*$ and be proportional to the effective sparsity (depending on the restricted condition number). The denominator is always computed over the hard-thresholded gradient, never the full gradient, to preserve dimension independence. The numerator uses a lower bound on the optimal value, which may be estimated adaptively.

The operator $\operatorname{HT}_s$ can be computed efficiently in $O(d + s\log s)$ time via quick-selection over the gradient.

An adaptive surrogate refinement (Algorithm 2 in (Qiao et al., 11 Sep 2025)) sequentially updates the target function value to mitigate lack of knowledge about the true statistical optimum, yielding O(1) global convergence to the minimax error rate.

6. Empirical Performance and Comparisons

Extensive numerical experiments on both synthetic and real datasets confirm Sparse Polyak's rate-invariance and superior performance as $d$ increases (Qiao et al., 11 Sep 2025). Key findings:

Synthetic regression (logistic or linear): As $d/n$ increases with constant $s$ and $s \log d / n$ , Sparse Polyak maintains a nearly constant optimization iteration count to reach statistical precision, while classical Polyak requires progressively more iterations.
Real data (Wave Energy Farm, Molecule Musk): While highly tuned fixed step-size schemes perform comparably, Sparse Polyak always outperforms classical Polyak, particularly as problem dimensionality increases.
Active support adjustment: When the signal-to-noise condition is met, the method correctly identifies the sparse support of the solution in finite time.

The adaptivity to the restricted smoothness is the critical factor in this empirical robustness.

7. Implications and Extensions

Sparse Polyak's principle—replacing global notions of smoothness with restricted versions tailored to the underlying low-dimensional structure—addresses a fundamental limitation of adaptive step-size approaches in high dimensions. This methodology is not limited to $l_0$ -sparse estimation: it can, in principle, be extended to other structured problems (e.g., low-rank matrix recovery or group sparsity) by appropriately redefining the "active" set and the thresholding/selection routine.

A plausible implication is that similar adaptivity may yield benefits for other adaptive rules (e.g., Barzilai-Borwein or AdaGrad variants) provided the restricted constants (curvature or variance) can be reliably estimated from the relevant subspace.

The Sparse Polyak technique thus stands as an effective solution for optimization within the high-dimensional, sparse statistical estimation regime, combining minimax optimality in statistical error with rate-invariant computational efficiency.

PDF Markdown Chat (Pro)

References (1)

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sparse Polyak.