Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 236 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Sparse Polyak for High-Dimensional Estimation

Updated 15 September 2025
  • The paper introduces Sparse Polyak, an adaptive step-size method that uses hard-thresholding to estimate a restricted Lipschitz constant, ensuring linear convergence.
  • It achieves rate-invariance by decoupling the iteration count from the ambient dimension, making it efficient for high-dimensional sparse estimation.
  • Empirical results in sparse linear and logistic regression demonstrate that Sparse Polyak recovers support accurately and outperforms classical methods even when d exceeds n.

Sparse Polyak refers to a family of adaptive step-size rules, motivated by the original Polyak step-size, that are designed to address the statistical and computational challenges in optimization for high-dimensional M-estimation. Its distinctive feature is the use of restricted (sparse) gradient information in the construction of the step-size, aligning the optimization geometry with the low-dimensional structure inherent in many modern estimation problems. Unlike conventional adaptive methods, which base the step-size on global smoothness properties, Sparse Polyak uses a hard-thresholded gradient to estimate a restricted Lipschitz constant, yielding performance that is rate-invariant to the ambient dimension. This approach achieves linear convergence up to the optimal statistical precision and maintains efficiency even when the number of parameters dramatically exceeds the sample size (Qiao et al., 11 Sep 2025).

1. Motivation and Context: High-dimensional M-Estimation

Sparse Polyak was proposed to address the inefficient scaling of standard Polyak step-sizes in high-dimensional statistical estimation. When the number of parameters dd surpasses the sample size nn, and only a sparse vector θ\theta^* is to be estimated (with θ0=sd\Vert \theta^* \Vert_0 = s^* \ll d), full-gradient-based step-size rules can become overly pessimistic. The norm f(θt)2\lVert \nabla f(\theta_t) \rVert^2 may grow with dd, causing the adaptive step-size to be excessively small and drastically increasing the number of required iterations to reach optimal statistical accuracy — even when the statistical difficulty itself does not scale with dd. Thus, classical approaches lack "rate-invariance" in high dimensions (Qiao et al., 11 Sep 2025).

Sparse Polyak overcomes this by focusing the step-size adaptation on the relevant sparse coordinates (the active support), exploiting the low-dimensional structure of high-dimensional M-estimation problems. The method is principally motivated by problems in sparse linear and logistic regression, but its principle is broadly applicable to any regime where low-dimensional structure can be captured via hard-thresholding.

2. Sparse Polyak Step-Size: Restricted Smoothness and Construction

The central innovation in Sparse Polyak is the use of hard-thresholding in the step-size denominator:

γt=max{f(θt)f^,0}5HTs(f(θt))2\gamma_t = \frac{\max\{ f(\theta_t) - \widehat{f}, 0 \}}{5 \| \operatorname{HT}_s(\nabla f(\theta_t)) \|^2 }

where HTs\operatorname{HT}_s denotes the hard-thresholding operator (keeping only the ss largest-magnitude components), and f^\widehat{f} is a lower-bound surrogate for the optimal function value (e.g., the statistical minimum). The update is then

θt+1=HTs(θtγtf(θt))\theta_{t+1} = \operatorname{HT}_s \left( \theta_t - \gamma_t \nabla f(\theta_t) \right)

This construction directly targets the restricted smoothness constant. In classical Polyak step-size,

γtclassic=f(θt)ff(θt)2\gamma_t^{\text{classic}} = \frac{f(\theta_t) - f^*}{\| \nabla f(\theta_t) \|^2}

the denominator reflects the full Lipschitz constant LL. In Sparse Polyak, restricting the gradient norm to its top ss coordinates replaces LL by a restricted Lipschitz constant L=L+3τs\overline{L} = L + 3\tau s, where LL is the global smoothness and τ\tau quantifies additional curvature along sparse directions (the RSS assumption). Unlike LL, L\overline{L} does not scale with dd for fixed ss, yielding a "dimension-free" step-size even when ndn \ll d.

3. Theoretical Guarantees and Convergence Analysis

Sparse Polyak provides strong theoretical guarantees in the context of high-dimensional sparse M-estimation. Under the restricted strong convexity and restricted smoothness assumptions (standard in high-dimensional theory), and with the sparsity level ss chosen to cover the true support plus a margin (e.g., scss \geq c s^* for a constant cc dependent on the condition number), the following guarantee is established [(Qiao et al., 11 Sep 2025), Theorem 1]:

θt+1θ^2(1180κ)θtθ^2\Vert \theta_{t+1} - \widehat{\theta} \rVert^2 \leq \left( 1 - \frac{1}{80 \overline{\kappa}} \right) \Vert \theta_t - \widehat{\theta} \rVert^2

where κ=L/μ\overline{\kappa} = \overline{L} / \overline{\mu} is the restricted condition number and θ^\widehat{\theta} is any ss-sparse minimizer of ff (or an estimator achieving the lower-bound target function value f^\widehat{f}). If the iterate enters a ball of radius proportional to the statistical error, the iterates remain in a neighborhood of optimal radius.

An adaptive ("double-loop") variant is developed in the absence of a precise lower bound f^\widehat{f}, using repeated surrogate refinement analogous to the epoch-based variants of adaptive Polyak step-size in the low-dimensional setting.

Additionally, support recovery is guaranteed under a quantitative signal-to-noise condition: the minimal nonzero component of θ^\widehat{\theta} must dominate the restricted gradient norm at optimum, i.e.,

θ^min7HT(f(θ^))μ| \widehat{\theta} |_{\min} \geq \frac{7 \| \operatorname{HT}(\nabla f(\widehat{\theta})) \|}{\overline{\mu}}

where μ\overline{\mu} is the restricted strong convexity constant. Upon entering this regime, the iterates recover the exact support and maintain it.

4. Rate-Invariance and High-dimensional Adaptivity

Rate-invariance denotes the property that the number of optimization iterations required to achieve a prescribed statistical accuracy is independent of the ambient dimension dd so long as the effective problem dimension (e.g., slogd/ns \log d / n) is fixed. This is in stark contrast to fixed step-size and classical Polyak step-size rules, where the iteration count grows with dd due to the scaling of LL in the denominator.

Sparse Polyak achieves rate-invariance: as dd increases, while the statistical error is unchanged, the number of iterations required to reach this error remains constant. Empirically, this is supported by experiments in both logistic and linear regression, where the same optimization precision is reached using essentially the same number of iterations as dd grows, provided slogd/ns \log d / n remains fixed (Qiao et al., 11 Sep 2025). In contrast, the iteration count for the classical Polyak method increases substantially.

5. Algorithmic Formulation: Hard-Thresholded Iterative Updates

The method is implemented as hard-thresholded iterations:

θt+1=HTs(θtγtf(θt))\theta_{t+1} = \operatorname{HT}_s \big( \theta_t - \gamma_t \nabla f(\theta_t) \big)

where the step-size is

γt=max{f(θt)f^,0}5HTs(f(θt))2\gamma_t = \frac{\max\{f(\theta_t) - \widehat{f}, 0\}}{5 \|\operatorname{HT}_s(\nabla f(\theta_t))\|^2}

Choice of ss must exceed the true sparsity ss^* and be proportional to the effective sparsity (depending on the restricted condition number). The denominator is always computed over the hard-thresholded gradient, never the full gradient, to preserve dimension independence. The numerator uses a lower bound on the optimal value, which may be estimated adaptively.

The operator HTs\operatorname{HT}_s can be computed efficiently in O(d+slogs)O(d + s\log s) time via quick-selection over the gradient.

An adaptive surrogate refinement (Algorithm 2 in (Qiao et al., 11 Sep 2025)) sequentially updates the target function value to mitigate lack of knowledge about the true statistical optimum, yielding O(1) global convergence to the minimax error rate.

6. Empirical Performance and Comparisons

Extensive numerical experiments on both synthetic and real datasets confirm Sparse Polyak's rate-invariance and superior performance as dd increases (Qiao et al., 11 Sep 2025). Key findings:

  • Synthetic regression (logistic or linear): As d/nd/n increases with constant ss and slogd/ns \log d / n, Sparse Polyak maintains a nearly constant optimization iteration count to reach statistical precision, while classical Polyak requires progressively more iterations.
  • Real data (Wave Energy Farm, Molecule Musk): While highly tuned fixed step-size schemes perform comparably, Sparse Polyak always outperforms classical Polyak, particularly as problem dimensionality increases.
  • Active support adjustment: When the signal-to-noise condition is met, the method correctly identifies the sparse support of the solution in finite time.

The adaptivity to the restricted smoothness is the critical factor in this empirical robustness.

7. Implications and Extensions

Sparse Polyak's principle—replacing global notions of smoothness with restricted versions tailored to the underlying low-dimensional structure—addresses a fundamental limitation of adaptive step-size approaches in high dimensions. This methodology is not limited to l0l_0-sparse estimation: it can, in principle, be extended to other structured problems (e.g., low-rank matrix recovery or group sparsity) by appropriately redefining the "active" set and the thresholding/selection routine.

A plausible implication is that similar adaptivity may yield benefits for other adaptive rules (e.g., Barzilai-Borwein or AdaGrad variants) provided the restricted constants (curvature or variance) can be reliably estimated from the relevant subspace.

The Sparse Polyak technique thus stands as an effective solution for optimization within the high-dimensional, sparse statistical estimation regime, combining minimax optimality in statistical error with rate-invariant computational efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sparse Polyak.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube