Papers
Topics
Authors
Recent
Search
2000 character limit reached

Momentum-Conic Descent (MOCO) Optimization

Updated 24 January 2026
  • Momentum-Conic Descent (MOCO) is an advanced optimization method for convex conic programs enhanced by heavy-ball momentum to boost convergence in both primal and dual formulations.
  • It employs a geometric ray-search strategy that alternates between ray minimization and a Frank–Wolfe-type subproblem for efficient descent over closed convex cones.
  • MOCO integrates preconditioning and memory-efficient sketching techniques, making it highly effective for large-scale semidefinite programming in signal processing and machine learning.

Momentum Conic Descent (MOCO) is an advanced first-order optimization method designed for convex conic programs where the objective is minimized over a closed convex cone. MOCO generalizes the original Conic Descent (CD) algorithm by incorporating a heavy-ball momentum term, yielding enhanced convergence rates and efficiency in both primal and dual formulations. This algorithm is particularly relevant for large-scale semidefinite programming (SDP) problems in signal processing and machine learning, and introduces innovations in stopping criteria, preconditioning, and memory-efficient computation for low-rank solutions (Li et al., 2023).

1. Primal and Dual Formulation of Conic Programs

Consider the convex conic program:

minx  f(x)subject to    xK,\min_{x} \; f(x) \quad \text{subject to} \;\; x \in \mathcal{K},

where KRd\mathcal{K} \subseteq \mathbb{R}^d is a closed convex cone, and f:KRf:\mathcal{K} \rightarrow \mathbb{R} is convex and differentiable. The equivalent unconstrained formulation leverages the indicator function:

minx  F(x):=f(x)+IK(x).\min_{x} \; F(x) := f(x) + \mathbb{I}_{\mathcal{K}}(x).

The Fenchel dual is expressed as:

L(x,y)=f(x)+IK(x)y,x,yRd,L(x, y) = f(x) + \mathbb{I}_{\mathcal{K}}(x) - \langle y, x \rangle, \quad y \in \mathbb{R}^d,

resulting in the dual problem:

supyK  [f(y)],\sup_{y \in \mathcal{K}^*} \; [-f^*(y)],

where ff^* is the convex conjugate of ff, and K\mathcal{K}^* is the dual cone. Strong duality holds under mild regularity conditions such as Slater's condition.

2. Geometric Ray-Search Intuition and Algorithmic Structure

Every xKx \in \mathcal{K} admits the representation x=ηvx = \eta v, with vKv \in \mathcal{K}, v=1\|v\|=1, and scalar η0\eta \geq 0. The algorithm first solves a univariate problem along each ray:

η(v)=argminη0  f(ηv).\eta^*(v) = \arg\min_{\eta \geq 0} \; f(\eta v).

Finding the optimal vv^* reduces to a compact search over directions on the cone. Conic Descent alternates between ray minimization and a Frank–Wolfe-type subproblem for ray search:

  • Ray minimization: ηk=argminη0f(ηxk)\eta_k = \arg\min_{\eta \geq 0} f(\eta x_k).
  • Ray search: vk=argminvK,v1gk,vv_k = \arg\min_{v \in \mathcal{K}, \|v\| \leq 1} \langle g_k, v \rangle with gkg_k as the descent direction.

MOCO extends this by incorporating a momentum term via heavy-ball averaging for gkg_k, enhancing descent speed.

3. Momentum-Conic Descent (MOCO) Algorithm

MOCO iteratively updates both the search direction and scaling using momentum-augmented gradients. The principal steps per iteration kk are:

  1. Ray Minimization: ηk=argminη0f(ηxk)\eta_k = \arg\min_{\eta \geq 0} f(\eta x_k).
  2. Momentum Update: gk=(1δk)gk1+δkf(ηkxk)g_k = (1 - \delta_k) g_{k-1} + \delta_k \nabla f(\eta_k x_k), with δk=2/(k+2)\delta_k = 2/(k+2).
  3. Frank–Wolfe Subproblem (Ray Search): vk=argminvK,v1gk,vv_k = \arg\min_{v \in \mathcal{K},\|v\| \leq 1} \langle g_k, v \rangle.
  4. Step-Size Line Search: θk=argminθ0f(ηkxk+θvk)\theta_k = \arg\min_{\theta \geq 0} f(\eta_k x_k + \theta v_k).
  5. Primal Update: xk+1=ηkxk+θkvkx_{k+1} = \eta_k x_k + \theta_k v_k.

At termination, the solution is given by x^=ηKxKx̂ = \eta_K x_K.

Key MOCO equations:

  • Conic dual: (D)  maxuKf(u)(D)\;\max_{u\in\mathcal K^*}\,-\,f^*(u)
  • Descent direction: vk=arg minvK,v1gk,vv_k=\argmin_{v\in\mathcal K,\|v\|\le1}\langle g_k, v\rangle
  • Heavy-ball momentum: gk=(1δk)gk1+δkf(ηkxk)g_k=(1-\delta_k)g_{k-1} + \delta_k \nabla f(\eta_k x_k)
  • Primal update: xk+1=ηkxk+θkvkx_{k+1}=\eta_k x_k+\theta_k v_k

4. Convergence Rates and Proof Sketches

Convergence analysis for MOCO under strict convexity and Lipschitz gradient conditions shows:

  • Primal Rate:

f(ηk+1xk+1)f(x)2Lx2k+2ρk,f(\eta_{k+1}x_{k+1}) - f(x^*) \leq \frac{2L\|x^*\|^2}{k+2} - \rho_k,

where ρk0\rho_k \geq 0 quantifies additional reduction from momentum.

  • Dual Rate:

[dist(f(ηkxk),K)]24L2x2k+1.[\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*)]^2 \leq \frac{4L^2\|x^*\|^2}{k+1}.

Thus, an ϵ\epsilon-approximate KKT point is obtained in O(L2x2/ϵ)O(L^2\|x^*\|^2/\epsilon) iterations.

The proof leverages Bregman-type lower bounds built from linearizations of ff, and invokes a generalization of Nesterov’s lemma to relate primal and dual gaps.

5. Stopping Criterion and Preconditioning Techniques

Direct computation of the dual residual requires a projection onto K\mathcal{K}^*, often computationally expensive. Instead, MOCO uses the subproblem multiplier:

dist(gk,K)=minvK,v1gk,v=gk,vk,\mathrm{dist}_*(g_k, \mathcal{K}^*) = -\min_{v \in \mathcal{K}, \|v\| \leq 1} \langle g_k, v \rangle = -\langle g_k, v_k \rangle,

with guaranteed rate:

[dist(gk,K)]29.7L2x2k+1.[\mathrm{dist}_*(g_k, \mathcal{K}^*)]^2 \leq \frac{9.7L^2\|x^*\|^2}{k+1}.

Termination is certified when gk,vkO(ϵ)\langle g_k, v_k \rangle \geq -O(\sqrt{\epsilon}), yielding dist(f(ηkxk),K)O(ϵ)\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*) \leq O(\sqrt{\epsilon}).

Preconditioning by linear change-of-variables x=Pzx = Pz can sharply reduce the dual error constant. An appropriately chosen positive-definite PP that balances the Hessian and cone geometry minimizes LP1xL\|P^{-1}x^*\|, thereby accelerating convergence.

6. Memory-Efficient MOCO for SDP with Low-Rank Structure

MOCO adapts for large-scale semidefinite programs (SDP):

minXf(G(X)z)subject to    XSn+,\min_X f(\mathcal{G}(X) - z) \quad \text{subject to} \;\; X \in S_n^+,

with G:SnRd\mathcal{G}: S_n \rightarrow \mathbb{R}^d linear and ff with LfL_f-Lipschitz gradient. To circumvent storing XkRn×nX_k \in \mathbb{R}^{n \times n}, MOCO maintains:

  • yk=G(Xk)zRdy_k = \mathcal{G}(X_k) - z \in \mathbb{R}^d
  • A random sketch Sk=XkΩRn×RS_k = X_k \Omega \in \mathbb{R}^{n \times R}, using fixed Gaussian Ω\Omega with RnR \ll n

The affine update for the sketch:

Sk+1=ηkSk+θkqk(qkTΩ),S_{k+1} = \eta_k S_k + \theta_k q_k (q_k^T \Omega),

where qkq_k is the minimal-eigenvector of Gk=G(f(yk))G_k = \mathcal{G}^*(\nabla f(y_k)). Each Frank–Wolfe iteration costs O(n2)O(n^2) via Lanczos, with total memory O(d+nR)O(d + nR). Recovery of an ϵ\epsilon-accurate X^kX̂_k from the sketch is controlled by the true rank rr and the excess singular values, provided r<Rr < R.

7. Empirical Performance and Practical Guidelines

Numerical experiments demonstrate MOCO's effectiveness on raised-up SDP problems such as matrix completion and phase-retrieval:

  • For matrix completion (ASn+A \in S_n^+ recovery from noisy, partial entries), MOCO and CD have comparable runtime–primal error profiles, but greedy-accelerated MOCOg outperforms all methods at large nn.
  • For phase-retrieval (rank-1 SDP lifted from quadratic measurements), MOCOg and a heuristic step-size variant (MOCOh) match or surpass CDg in visual quality and runtime-loss performance, significantly outperforming standard Frank–Wolfe approaches.

Noteworthy practical observations include:

  • Momentum-augmented Frank–Wolfe within the conic framework (MOCO) yields tighter convergence by the positive momentum term ρk\rho_k.
  • The stopping criterion gk,vk0\langle g_k, v_k \rangle \approx 0 is efficiently computed and directly certifies dual feasibility.
  • Preconditioners significantly reduce dual residual constants, allowing earlier termination.
  • The memory-efficient variant using sketching is effective for large-scale SDP with rigorous low-rank recovery.
  • Greedy acceleration (Burer–Monteiro step) and heuristic step-size selection (e.g., θk=2M/(k+2)\theta_k = 2M/(k+2) where ME[bi]/mM \approx \mathbb{E}[b_i]/m) expedite convergence without substantial additional memory cost (Li et al., 2023).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Momentum-Conic Descent (MOCO).