Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conic Descent Optimization

Updated 24 January 2026
  • Conic Descent is a family of first-order optimization techniques that model objectives with conic structures and geometric duality for both unconstrained and conic-constrained problems.
  • CD methods compute effective stepsizes via trial steps and rigorous safeguards, ensuring convergence under smoothness and convexity assumptions.
  • Variants like MOCO integrate momentum and memory-efficient strategies to tackle large-scale semidefinite programming with proven O(1/k) convergence rates.

Conic Descent (CD) refers to a family of first-order optimization algorithms designed for both unconstrained and conic-constrained problems. Distinguished by their use of conic models and geometric duality, CD methods systematically deliver efficient stepsizes along search directions and are particularly suited for large-scale, low-storage scenarios in signal processing, machine learning, and semidefinite programming (SDP). CD fuses structure-exploiting model construction, rigorous convergence guarantees, explicit dual certificates, and can be extended with momentum and memory-efficient variants.

1. Conic Descent for Unconstrained Optimization

The archetypal Conic Descent method for unconstrained smooth minimization was introduced by Liu and Liu and rigorously detailed by Liu (Liu et al., 2019). At each iterate xkx_k, the method constructs a local conic model of the objective ff: ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2} where fk=f(xk)f_k = f(x_k), gk=∇f(xk)g_k = \nabla f(x_k), and BkB_k is a positive definite matrix. The model is selected whenever a quadratic model is poorly justified, detected by the "closeness-to-quadratic" statistic

μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|

with sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1} and yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}.

The trial step d=−αgkd = -\alpha g_k leads to a scalarized conic model in ff0,

ff1

whose stationary point,

ff2

is used—subject to safeguards and projection onto Barzilai-Borwein stepsize bounds, when gradient history allows.

The full algorithm incorporates a Zhang–Hager nonmonotone line search and, as fallback, quadratic or derivative-based models if the conic step is invalid. Inner products and updates incur only ff3 storage, and the method is robust to poor curvature information.

2. Conic Descent for General Conic Programs

CD was extended to general conic-constrained optimization—minimize ff4 subject to ff5 for a closed convex cone ff6—with a clear geometric and dual framework (Li et al., 2023). The primal-dual structure motivates an update that alternates between:

  • Ray minimization: For current ff7, find scaling ff8 minimizing ff9, enforcing Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}0.
  • Ray search: At Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}1, minimize the linear surrogate Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}2 over Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}3, Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}4; this is a Frank–Wolfe subproblem on the dual. Then, perform a one-dimensional search in the Ï•k(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}5 direction.

The update is thus

ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}6

with ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}7. This structure allows a unified view of CD as alternating between complementary slackness and dual feasibility pushes.

3. Rigorous Convergence Theory and Stopping Criteria

For ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}8-smooth and strictly convex ϕk(d)=fk+gkTd1+bkTd+12dTBkd(1+bkTd)2\phi_k(d) = f_k + \frac{g_k^T d}{1 + b_k^T d} + \frac{1}{2}\frac{d^T B_k d}{(1 + b_k^T d)^2}9 with a convex cone fk=f(xk)f_k = f(x_k)0, CD achieves explicit fk=f(xk)f_k = f(x_k)1 convergence rates in both primal and dual gaps (Li et al., 2023): fk=f(xk)f_k = f(x_k)2 and

fk=f(xk)f_k = f(x_k)3

where fk=f(xk)f_k = f(x_k)4 is the dual cone and fk=f(xk)f_k = f(x_k)5 is a nonnegative term. These bounds translate directly to the number of iterations required for primal or dual accuracy.

A distinctive feature is an analytic stopping certificate: fk=f(xk)f_k = f(x_k)6 where fk=f(xk)f_k = f(x_k)7 is the running average of gradients. The quantity fk=f(xk)f_k = f(x_k)8 monitors KKT residuals, certifying fk=f(xk)f_k = f(x_k)9-solution status when it drops below gk=∇f(xk)g_k = \nabla f(x_k)0.

For unconstrained (non-conic) problems, under standard smoothness and convexity assumptions, global convergence and gk=∇f(xk)g_k = \nabla f(x_k)1-linear rate for strongly convex gk=∇f(xk)g_k = \nabla f(x_k)2 are established (Liu et al., 2019).

4. Momentum and Preconditioning Variants

The MOCO (Momentum Conic Descent) algorithm introduces a heavy-ball type averaging of gradients,

gk=∇f(xk)g_k = \nabla f(x_k)3

with gk=∇f(xk)g_k = \nabla f(x_k)4, yielding smoothed dual steps and mitigated oscillations (Li et al., 2023). CD and MOCO share the same gk=∇f(xk)g_k = \nabla f(x_k)5 rate, though MOCO introduces a nonnegative gap term expressing its momentum benefit.

Preconditioning is enabled by the dual convergence bounds' dependence on gk=∇f(xk)g_k = \nabla f(x_k)6. Changing variables gk=∇f(xk)g_k = \nabla f(x_k)7 for a well-conditioned gk=∇f(xk)g_k = \nabla f(x_k)8 can dramatically reduce the iteration count for a given dual error. The ideal gk=∇f(xk)g_k = \nabla f(x_k)9 balances coordinate-wise Lipschitz constants, reducing both BkB_k0 and BkB_k1.

5. Algorithmic Summaries and Key Parameters

In unconstrained smooth optimization, CD employs the following scheme:

  • For BkB_k2, select step length heuristically.
  • For BkB_k3, test quadraticity (BkB_k4). If not quadratic, use the conic model and BkB_k5; otherwise, fall back to quadratic models or differences.
  • Safeguard key model parameters: clamp BkB_k6, BkB_k7; limit stepsize BkB_k8 (typically BkB_k9).
  • Apply nonmonotone line search (Zhang–Hager) to validate new iterates.
  • For fallback models, test for gradient collinearity and adapt μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|0 accordingly (μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|1 threshold, μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|2 for step increase, μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|3 for finite-difference Hessian).

In the conic case, at each iteration, CD (or MOCO) alternates explicit ray and Frank–Wolfe subproblems on μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|4, with memory and per-iteration cost scaling linearly in μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|5 given access to gradients and projections.

6. Memory-Efficient Variants for Large-Scale Semidefinite Programs

Large SDPs, especially in lifted formulations where μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|6, pose severe memory barriers. The memory-efficient MOCO (Li et al., 2023):

  • Works with reduced variables μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|7 in μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|8 rather than μk=∣ 2 fk−1−fk+gkTsk−1sk−1Tyk−1−1 ∣\mu_k = \Big|\, 2\,\frac{f_{k-1}-f_k + g_k^T s_{k-1}}{s_{k-1}^T y_{k-1}} - 1 \,\Big|9.
  • Maintains random sketches sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}0 for fixed Gaussian matrix sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}1, updating via sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}2, where sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}3 solves the current subproblem.
  • Enables recovery of low-rank approximations to sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}4 using sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}5 and sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}6 with controlled error, reducing storage cost to sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}7.

This scheme preserves sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}8 convergence for primal and dual certificates and has been validated empirically for matrix completion and phase retrieval tasks, where it achieves solution quality comparable to classic CD and FW but with significantly lower memory usage and runtime (Li et al., 2023).

7. Empirical Performance and Practical Implementation

Numerical experiments (Liu et al., 2019) on benchmark suites (80pAndr and 144pCUTEr) and advanced SDPs (Li et al., 2023) demonstrate:

  • On unconstrained problems, CD matches or outperforms Barzilai-Borwein (BB), spectral BB, CGOPT, and CG_DESCENT in both iteration count and total function/gradient evaluations.
  • CD solves all 80 large-scale Andrei problems compared to 76 for its closest fallback-variant; in function calls, CD wins on 77% of problems over BB/SBB4.
  • In SDPs, memory-efficient MOCO (and MOCO with greedy rank-update steps) achieves lowest primal error versus time for large-scale matrix completion (sk−1=xk−xk−1s_{k-1}=x_k-x_{k-1}9 up to 1600) and recovers high-dimensional images in lifted phase retrieval at half the runtime and memory of naive approaches.

Algorithmic cost per iteration involves only a single gradient, several inner products, and sparse matrix-vector updates in standard settings; in memory-optimized SDP variants, extra cost is limited to the sketch size yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}0.

Safeguard and tuning parameters such as yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}1, yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}2, yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}3, yk−1=gk−gk−1y_{k-1}=g_k-g_{k-1}4, and nonmonotone search settings should be adapted for problem scaling and to control step lengths, convergence speed, and stability.


References:

  • "An Improved Gradient Method with Approximately Optimal Stepsize Based on Conic model for Unconstrained Optimization" (Liu et al., 2019)
  • "Conic Descent Redux for Memory-Efficient Optimization" (Li et al., 2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conic Descent (CD).