Conic Descent Optimization

Updated 24 January 2026

Conic Descent is a family of first-order optimization techniques that model objectives with conic structures and geometric duality for both unconstrained and conic-constrained problems.
CD methods compute effective stepsizes via trial steps and rigorous safeguards, ensuring convergence under smoothness and convexity assumptions.
Variants like MOCO integrate momentum and memory-efficient strategies to tackle large-scale semidefinite programming with proven O(1/k) convergence rates.

Conic Descent (CD) refers to a family of first-order optimization algorithms designed for both unconstrained and conic-constrained problems. Distinguished by their use of conic models and geometric duality, CD methods systematically deliver efficient stepsizes along search directions and are particularly suited for large-scale, low-storage scenarios in signal processing, machine learning, and semidefinite programming (SDP). CD fuses structure-exploiting model construction, rigorous convergence guarantees, explicit dual certificates, and can be extended with momentum and memory-efficient variants.

1. Conic Descent for Unconstrained Optimization

The archetypal Conic Descent method for unconstrained smooth minimization was introduced by Liu and Liu and rigorously detailed by Liu (Liu et al., 2019). At each iterate $x_k$ , the method constructs a local conic model of the objective %%%%1%%%%: $\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}$ where $f_k = f(x_k)$ , $g_k = \nabla f(x_k)$ , and $B_k$ is a positive definite matrix. The model is selected whenever a quadratic model is poorly justified, detected by the "closeness-to-quadratic" statistic

$\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|$

with $s_{k-1}=x_k-x_{k-1}$ and $y_{k-1}=g_k-g_{k-1}$ .

The trial step $d = -\alpha g_k$ leads to a scalarized conic model in $\alpha$ ,

$\phi_k^1(\alpha) = f_k - \frac{\alpha \, g_k^T g_k}{1 - \alpha b_k^T g_k} + \frac{\alpha^2}{2} \frac{g_k^T B_k g_k}{(1 - \alpha b_k^T g_k)^2}$

whose stationary point,

$\alpha_k^S = \frac{g_k^T g_k}{g_k^T B_k g_k + (g_k^T g_k)(b_k^T g_k)}$

is used—subject to safeguards and projection onto Barzilai-Borwein stepsize bounds, when gradient history allows.

The full algorithm incorporates a Zhang–Hager nonmonotone line search and, as fallback, quadratic or derivative-based models if the conic step is invalid. Inner products and updates incur only $O(n)$ storage, and the method is robust to poor curvature information.

2. Conic Descent for General Conic Programs

CD was extended to general conic-constrained optimization—minimize $f(x)$ subject to $x\in\mathcal{K}$ for a closed convex cone $\mathcal{K}$ —with a clear geometric and dual framework (Li et al., 2023). The primal-dual structure motivates an update that alternates between:

Ray minimization: For current $x_k\in\mathcal{K}$ , find scaling $\eta_k \geq 0$ minimizing $f(\eta_k x_k)$ , enforcing $\langle \eta_k x_k, \nabla f(\eta_k x_k)\rangle = 0$ .
Ray search: At $\eta_k x_k$ , minimize the linear surrogate $\langle \nabla f(\eta_k x_k), v\rangle$ over $v\in\mathcal{K}$ , $\|v\|\leq 1$ ; this is a Frank–Wolfe subproblem on the dual. Then, perform a one-dimensional search in the $v_k$ direction.

The update is thus

$x_{k+1} = \eta_k x_k + \theta_k v_k$

with $\theta_k = \arg\min_{\theta\geq 0}f(\eta_k x_k + \theta v_k)$ . This structure allows a unified view of CD as alternating between complementary slackness and dual feasibility pushes.

3. Rigorous Convergence Theory and Stopping Criteria

For $L$ -smooth and strictly convex $f$ with a convex cone $\mathcal{K}$ , CD achieves explicit $O(1/k)$ convergence rates in both primal and dual gaps (Li et al., 2023): $f(\eta_{k+1}x_{k+1}) - f(x^*) \leq \frac{2L\|x^*\|^2}{k+2} - \rho_k$ and

$[\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*)]^2 \leq \frac{4L^2 \|x^*\|^2}{k+1}$

where $\mathcal{K}^*$ is the dual cone and $\rho_k$ is a nonnegative term. These bounds translate directly to the number of iterations required for primal or dual accuracy.

A distinctive feature is an analytic stopping certificate: $[\mathrm{dist}_*(g_k,\mathcal{K}^*)]^2 \leq C L^2 \|x^*\|^2 / (k+1)$ where $g_k$ is the running average of gradients. The quantity $-\langle g_k, v_k\rangle$ monitors KKT residuals, certifying $\varepsilon$ -solution status when it drops below $\sqrt{\varepsilon}$ .

For unconstrained (non-conic) problems, under standard smoothness and convexity assumptions, global convergence and $R$ -linear rate for strongly convex $f$ are established (Liu et al., 2019).

4. Momentum and Preconditioning Variants

The MOCO (Momentum Conic Descent) algorithm introduces a heavy-ball type averaging of gradients,

$g_k = (1-\delta_k)g_{k-1} + \delta_k \nabla f(\eta_k x_k)$

with $\delta_k = 2/(k+2)$ , yielding smoothed dual steps and mitigated oscillations (Li et al., 2023). CD and MOCO share the same $O(1/k)$ rate, though MOCO introduces a nonnegative gap term expressing its momentum benefit.

Preconditioning is enabled by the dual convergence bounds' dependence on $L^2\|x^*\|^2$ . Changing variables $x=Pz$ for a well-conditioned $P$ can dramatically reduce the iteration count for a given dual error. The ideal $P$ balances coordinate-wise Lipschitz constants, reducing both $L'$ and $\|z^*\|'$ .

5. Algorithmic Summaries and Key Parameters

In unconstrained smooth optimization, CD employs the following scheme:

For $k=0$ , select step length heuristically.
For $k>0$ , test quadraticity ( $\mu_k$ ). If not quadratic, use the conic model and $\alpha_k^S$ ; otherwise, fall back to quadratic models or differences.
Safeguard key model parameters: clamp $\gamma_k\in[0.01,2]$ , $\beta_k\in[-5000,5000]$ ; limit stepsize $\alpha_k \in [\lambda_{\min}, \lambda_{\max}]$ (typically $[10^{-30},10^{30}]$ ).
Apply nonmonotone line search (Zhang–Hager) to validate new iterates.
For fallback models, test for gradient collinearity and adapt $\alpha_k$ accordingly ( $\xi_3 = 0.9$ threshold, $\delta=10$ for step increase, $\tau_k=\min\{0.1\alpha_{k-1},0.01\}$ for finite-difference Hessian).

In the conic case, at each iteration, CD (or MOCO) alternates explicit ray and Frank–Wolfe subproblems on $\mathcal{K}$ , with memory and per-iteration cost scaling linearly in $n$ given access to gradients and projections.

6. Memory-Efficient Variants for Large-Scale Semidefinite Programs

Large SDPs, especially in lifted formulations where $X \in \mathbb{S}^n_+$ , pose severe memory barriers. The memory-efficient MOCO (Li et al., 2023):

Works with reduced variables $y_k = \mathcal{G}(X_k)-z$ in $\mathbb{R}^d$ rather than $X_k \in \mathbb{R}^{n \times n}$ .
Maintains random sketches $S_k = X_k \Omega$ for fixed Gaussian matrix $\Omega \in \mathbb{R}^{n \times R}$ , updating via $S_{k+1} = \eta_k S_k + \theta_k q_k (q_k^T \Omega)$ , where $q_k$ solves the current subproblem.
Enables recovery of low-rank approximations to $X_k$ using $S_k$ and $\Omega$ with controlled error, reducing storage cost to $O(d + nR)$ .

This scheme preserves $O(1/k)$ convergence for primal and dual certificates and has been validated empirically for matrix completion and phase retrieval tasks, where it achieves solution quality comparable to classic CD and FW but with significantly lower memory usage and runtime (Li et al., 2023).

7. Empirical Performance and Practical Implementation

Numerical experiments (Liu et al., 2019) on benchmark suites (80pAndr and 144pCUTEr) and advanced SDPs (Li et al., 2023) demonstrate:

On unconstrained problems, CD matches or outperforms Barzilai-Borwein (BB), spectral BB, CGOPT, and CG_DESCENT in both iteration count and total function/gradient evaluations.
CD solves all 80 large-scale Andrei problems compared to 76 for its closest fallback-variant; in function calls, CD wins on 77% of problems over BB/SBB4.
In SDPs, memory-efficient MOCO (and MOCO with greedy rank-update steps) achieves lowest primal error versus time for large-scale matrix completion ( $n$ up to 1600) and recovers high-dimensional images in lifted phase retrieval at half the runtime and memory of naive approaches.

Algorithmic cost per iteration involves only a single gradient, several inner products, and sparse matrix-vector updates in standard settings; in memory-optimized SDP variants, extra cost is limited to the sketch size $R$ .

Safeguard and tuning parameters such as $\lambda_{\min}$ , $\lambda_{\max}$ , $\delta$ , $\xi_3$ , and nonmonotone search settings should be adapted for problem scaling and to control step lengths, convergence speed, and stability.

References:

"An Improved Gradient Method with Approximately Optimal Stepsize Based on Conic model for Unconstrained Optimization" (Liu et al., 2019)
"Conic Descent Redux for Memory-Efficient Optimization" (Li et al., 2023)

Markdown Upgrade to Chat

References (2)

An Improved Gradient Method with Approximately Optimal Stepsize Based on Conic model for Unconstrained Optimization (2019)

Conic Descent Redux for Memory-Efficient Optimization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conic Descent (CD).