Papers
Topics
Authors
Recent
2000 character limit reached

Conic Descent Optimization

Updated 24 January 2026
  • Conic Descent is a family of first-order optimization techniques that model objectives with conic structures and geometric duality for both unconstrained and conic-constrained problems.
  • CD methods compute effective stepsizes via trial steps and rigorous safeguards, ensuring convergence under smoothness and convexity assumptions.
  • Variants like MOCO integrate momentum and memory-efficient strategies to tackle large-scale semidefinite programming with proven O(1/k) convergence rates.

Conic Descent (CD) refers to a family of first-order optimization algorithms designed for both unconstrained and conic-constrained problems. Distinguished by their use of conic models and geometric duality, CD methods systematically deliver efficient stepsizes along search directions and are particularly suited for large-scale, low-storage scenarios in signal processing, machine learning, and semidefinite programming (SDP). CD fuses structure-exploiting model construction, rigorous convergence guarantees, explicit dual certificates, and can be extended with momentum and memory-efficient variants.

1. Conic Descent for Unconstrained Optimization

The archetypal Conic Descent method for unconstrained smooth minimization was introduced by Liu and Liu and rigorously detailed by Liu (Liu et al., 2019). At each iterate xkx_k, the method constructs a local conic model of the objective %%%%1%%%%: ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2} where fk=f(xk)f_k = f(x_k), gk=f(xk)g_k = \nabla f(x_k), and BkB_k is a positive definite matrix. The model is selected whenever a quadratic model is poorly justified, detected by the "closeness-to-quadratic" statistic

μk=2fk1fk+gkTsk1sk1Tyk11\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|

with sk1=xkxk1s_{k-1}=x_k-x_{k-1} and yk1=gkgk1y_{k-1}=g_k-g_{k-1}.

The trial step d=αgkd = -\alpha g_k leads to a scalarized conic model in α\alpha,

ϕk1(α)=fkαgkTgk1αbkTgk+α22gkTBkgk(1αbkTgk)2\phi_k^1(\alpha) = f_k - \frac{\alpha \, g_k^T g_k}{1 - \alpha b_k^T g_k} + \frac{\alpha^2}{2} \frac{g_k^T B_k g_k}{(1 - \alpha b_k^T g_k)^2}

whose stationary point,

αkS=gkTgkgkTBkgk+(gkTgk)(bkTgk)\alpha_k^S = \frac{g_k^T g_k}{g_k^T B_k g_k + (g_k^T g_k)(b_k^T g_k)}

is used—subject to safeguards and projection onto Barzilai-Borwein stepsize bounds, when gradient history allows.

The full algorithm incorporates a Zhang–Hager nonmonotone line search and, as fallback, quadratic or derivative-based models if the conic step is invalid. Inner products and updates incur only O(n)O(n) storage, and the method is robust to poor curvature information.

2. Conic Descent for General Conic Programs

CD was extended to general conic-constrained optimization—minimize f(x)f(x) subject to xKx\in\mathcal{K} for a closed convex cone K\mathcal{K}—with a clear geometric and dual framework (Li et al., 2023). The primal-dual structure motivates an update that alternates between:

  • Ray minimization: For current xkKx_k\in\mathcal{K}, find scaling ηk0\eta_k \geq 0 minimizing f(ηkxk)f(\eta_k x_k), enforcing ηkxk,f(ηkxk)=0\langle \eta_k x_k, \nabla f(\eta_k x_k)\rangle = 0.
  • Ray search: At ηkxk\eta_k x_k, minimize the linear surrogate f(ηkxk),v\langle \nabla f(\eta_k x_k), v\rangle over vKv\in\mathcal{K}, v1\|v\|\leq 1; this is a Frank–Wolfe subproblem on the dual. Then, perform a one-dimensional search in the vkv_k direction.

The update is thus

xk+1=ηkxk+θkvkx_{k+1} = \eta_k x_k + \theta_k v_k

with θk=argminθ0f(ηkxk+θvk)\theta_k = \arg\min_{\theta\geq 0}f(\eta_k x_k + \theta v_k). This structure allows a unified view of CD as alternating between complementary slackness and dual feasibility pushes.

3. Rigorous Convergence Theory and Stopping Criteria

For LL-smooth and strictly convex ff with a convex cone K\mathcal{K}, CD achieves explicit O(1/k)O(1/k) convergence rates in both primal and dual gaps (Li et al., 2023): f(ηk+1xk+1)f(x)2Lx2k+2ρkf(\eta_{k+1}x_{k+1}) - f(x^*) \leq \frac{2L\|x^*\|^2}{k+2} - \rho_k and

[dist(f(ηkxk),K)]24L2x2k+1[\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*)]^2 \leq \frac{4L^2 \|x^*\|^2}{k+1}

where K\mathcal{K}^* is the dual cone and ρk\rho_k is a nonnegative term. These bounds translate directly to the number of iterations required for primal or dual accuracy.

A distinctive feature is an analytic stopping certificate: [dist(gk,K)]2CL2x2/(k+1)[\mathrm{dist}_*(g_k,\mathcal{K}^*)]^2 \leq C L^2 \|x^*\|^2 / (k+1) where gkg_k is the running average of gradients. The quantity gk,vk-\langle g_k, v_k\rangle monitors KKT residuals, certifying ε\varepsilon-solution status when it drops below ε\sqrt{\varepsilon}.

For unconstrained (non-conic) problems, under standard smoothness and convexity assumptions, global convergence and RR-linear rate for strongly convex ff are established (Liu et al., 2019).

4. Momentum and Preconditioning Variants

The MOCO (Momentum Conic Descent) algorithm introduces a heavy-ball type averaging of gradients,

gk=(1δk)gk1+δkf(ηkxk)g_k = (1-\delta_k)g_{k-1} + \delta_k \nabla f(\eta_k x_k)

with δk=2/(k+2)\delta_k = 2/(k+2), yielding smoothed dual steps and mitigated oscillations (Li et al., 2023). CD and MOCO share the same O(1/k)O(1/k) rate, though MOCO introduces a nonnegative gap term expressing its momentum benefit.

Preconditioning is enabled by the dual convergence bounds' dependence on L2x2L^2\|x^*\|^2. Changing variables x=Pzx=Pz for a well-conditioned PP can dramatically reduce the iteration count for a given dual error. The ideal PP balances coordinate-wise Lipschitz constants, reducing both LL' and z\|z^*\|'.

5. Algorithmic Summaries and Key Parameters

In unconstrained smooth optimization, CD employs the following scheme:

  • For k=0k=0, select step length heuristically.
  • For k>0k>0, test quadraticity (μk\mu_k). If not quadratic, use the conic model and αkS\alpha_k^S; otherwise, fall back to quadratic models or differences.
  • Safeguard key model parameters: clamp γk[0.01,2]\gamma_k\in[0.01,2], βk[5000,5000]\beta_k\in[-5000,5000]; limit stepsize αk[λmin,λmax]\alpha_k \in [\lambda_{\min}, \lambda_{\max}] (typically [1030,1030][10^{-30},10^{30}]).
  • Apply nonmonotone line search (Zhang–Hager) to validate new iterates.
  • For fallback models, test for gradient collinearity and adapt αk\alpha_k accordingly (ξ3=0.9\xi_3 = 0.9 threshold, δ=10\delta=10 for step increase, τk=min{0.1αk1,0.01}\tau_k=\min\{0.1\alpha_{k-1},0.01\} for finite-difference Hessian).

In the conic case, at each iteration, CD (or MOCO) alternates explicit ray and Frank–Wolfe subproblems on K\mathcal{K}, with memory and per-iteration cost scaling linearly in nn given access to gradients and projections.

6. Memory-Efficient Variants for Large-Scale Semidefinite Programs

Large SDPs, especially in lifted formulations where XS+nX \in \mathbb{S}^n_+, pose severe memory barriers. The memory-efficient MOCO (Li et al., 2023):

  • Works with reduced variables yk=G(Xk)zy_k = \mathcal{G}(X_k)-z in Rd\mathbb{R}^d rather than XkRn×nX_k \in \mathbb{R}^{n \times n}.
  • Maintains random sketches Sk=XkΩS_k = X_k \Omega for fixed Gaussian matrix ΩRn×R\Omega \in \mathbb{R}^{n \times R}, updating via Sk+1=ηkSk+θkqk(qkTΩ)S_{k+1} = \eta_k S_k + \theta_k q_k (q_k^T \Omega), where qkq_k solves the current subproblem.
  • Enables recovery of low-rank approximations to XkX_k using SkS_k and Ω\Omega with controlled error, reducing storage cost to O(d+nR)O(d + nR).

This scheme preserves O(1/k)O(1/k) convergence for primal and dual certificates and has been validated empirically for matrix completion and phase retrieval tasks, where it achieves solution quality comparable to classic CD and FW but with significantly lower memory usage and runtime (Li et al., 2023).

7. Empirical Performance and Practical Implementation

Numerical experiments (Liu et al., 2019) on benchmark suites (80pAndr and 144pCUTEr) and advanced SDPs (Li et al., 2023) demonstrate:

  • On unconstrained problems, CD matches or outperforms Barzilai-Borwein (BB), spectral BB, CGOPT, and CG_DESCENT in both iteration count and total function/gradient evaluations.
  • CD solves all 80 large-scale Andrei problems compared to 76 for its closest fallback-variant; in function calls, CD wins on 77% of problems over BB/SBB4.
  • In SDPs, memory-efficient MOCO (and MOCO with greedy rank-update steps) achieves lowest primal error versus time for large-scale matrix completion (nn up to 1600) and recovers high-dimensional images in lifted phase retrieval at half the runtime and memory of naive approaches.

Algorithmic cost per iteration involves only a single gradient, several inner products, and sparse matrix-vector updates in standard settings; in memory-optimized SDP variants, extra cost is limited to the sketch size RR.

Safeguard and tuning parameters such as λmin\lambda_{\min}, λmax\lambda_{\max}, δ\delta, ξ3\xi_3, and nonmonotone search settings should be adapted for problem scaling and to control step lengths, convergence speed, and stability.


References:

  • "An Improved Gradient Method with Approximately Optimal Stepsize Based on Conic model for Unconstrained Optimization" (Liu et al., 2019)
  • "Conic Descent Redux for Memory-Efficient Optimization" (Li et al., 2023)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conic Descent (CD).