Momentum-Conic Descent (MOCO) Optimization

Updated 24 January 2026

Momentum-Conic Descent (MOCO) is an advanced optimization method for convex conic programs enhanced by heavy-ball momentum to boost convergence in both primal and dual formulations.
It employs a geometric ray-search strategy that alternates between ray minimization and a Frank–Wolfe-type subproblem for efficient descent over closed convex cones.
MOCO integrates preconditioning and memory-efficient sketching techniques, making it highly effective for large-scale semidefinite programming in signal processing and machine learning.

Momentum Conic Descent (MOCO) is an advanced first-order optimization method designed for convex conic programs where the objective is minimized over a closed convex cone. MOCO generalizes the original Conic Descent (CD) algorithm by incorporating a heavy-ball momentum term, yielding enhanced convergence rates and efficiency in both primal and dual formulations. This algorithm is particularly relevant for large-scale semidefinite programming (SDP) problems in signal processing and machine learning, and introduces innovations in stopping criteria, preconditioning, and memory-efficient computation for low-rank solutions (Li et al., 2023).

1. Primal and Dual Formulation of Conic Programs

Consider the convex conic program:

$\min_{x} \; f(x) \quad \text{subject to} \;\; x \in \mathcal{K},$

where $\mathcal{K} \subseteq \mathbb{R}^d$ is a closed convex cone, and $f:\mathcal{K} \rightarrow \mathbb{R}$ is convex and differentiable. The equivalent unconstrained formulation leverages the indicator function:

$\min_{x} \; F(x) := f(x) + \mathbb{I}_{\mathcal{K}}(x).$

The Fenchel dual is expressed as:

$L(x, y) = f(x) + \mathbb{I}_{\mathcal{K}}(x) - \langle y, x \rangle, \quad y \in \mathbb{R}^d,$

resulting in the dual problem:

$\sup_{y \in \mathcal{K}^*} \; [-f^*(y)],$

where $f^*$ is the convex conjugate of $f$ , and $\mathcal{K}^*$ is the dual cone. Strong duality holds under mild regularity conditions such as Slater's condition.

2. Geometric Ray-Search Intuition and Algorithmic Structure

Every $x \in \mathcal{K}$ admits the representation $x = \eta v$ , with $v \in \mathcal{K}$ , $\|v\|=1$ , and scalar $\eta \geq 0$ . The algorithm first solves a univariate problem along each ray:

$\eta^*(v) = \arg\min_{\eta \geq 0} \; f(\eta v).$

Finding the optimal $v^*$ reduces to a compact search over directions on the cone. Conic Descent alternates between ray minimization and a Frank–Wolfe-type subproblem for ray search:

Ray minimization: $\eta_k = \arg\min_{\eta \geq 0} f(\eta x_k)$ .
Ray search: $v_k = \arg\min_{v \in \mathcal{K}, \|v\| \leq 1} \langle g_k, v \rangle$ with $g_k$ as the descent direction.

MOCO extends this by incorporating a momentum term via heavy-ball averaging for $g_k$ , enhancing descent speed.

3. Momentum-Conic Descent (MOCO) Algorithm

MOCO iteratively updates both the search direction and scaling using momentum-augmented gradients. The principal steps per iteration $k$ are:

Ray Minimization: $\eta_k = \arg\min_{\eta \geq 0} f(\eta x_k)$ .
Momentum Update: $g_k = (1 - \delta_k) g_{k-1} + \delta_k \nabla f(\eta_k x_k)$ , with $\delta_k = 2/(k+2)$ .
Frank–Wolfe Subproblem (Ray Search): $v_k = \arg\min_{v \in \mathcal{K},\|v\| \leq 1} \langle g_k, v \rangle$ .
Step-Size Line Search: $\theta_k = \arg\min_{\theta \geq 0} f(\eta_k x_k + \theta v_k)$ .
Primal Update: $x_{k+1} = \eta_k x_k + \theta_k v_k$ .

At termination, the solution is given by $x̂ = \eta_K x_K$ .

Key MOCO equations:

Conic dual: $(D)\;\max_{u\in\mathcal K^*}\,-\,f^*(u)$
Descent direction: $v_k=\argmin_{v\in\mathcal K,\|v\|\le1}\langle g_k, v\rangle$
Heavy-ball momentum: $g_k=(1-\delta_k)g_{k-1} + \delta_k \nabla f(\eta_k x_k)$
Primal update: $x_{k+1}=\eta_k x_k+\theta_k v_k$

4. Convergence Rates and Proof Sketches

Convergence analysis for MOCO under strict convexity and Lipschitz gradient conditions shows:

Primal Rate:

$f(\eta_{k+1}x_{k+1}) - f(x^*) \leq \frac{2L\|x^*\|^2}{k+2} - \rho_k,$

where $\rho_k \geq 0$ quantifies additional reduction from momentum.

Dual Rate:

$[\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*)]^2 \leq \frac{4L^2\|x^*\|^2}{k+1}.$

Thus, an $\epsilon$ -approximate KKT point is obtained in $O(L^2\|x^*\|^2/\epsilon)$ iterations.

The proof leverages Bregman-type lower bounds built from linearizations of $f$ , and invokes a generalization of Nesterov’s lemma to relate primal and dual gaps.

5. Stopping Criterion and Preconditioning Techniques

Direct computation of the dual residual requires a projection onto $\mathcal{K}^*$ , often computationally expensive. Instead, MOCO uses the subproblem multiplier:

$\mathrm{dist}_*(g_k, \mathcal{K}^*) = -\min_{v \in \mathcal{K}, \|v\| \leq 1} \langle g_k, v \rangle = -\langle g_k, v_k \rangle,$

with guaranteed rate:

$[\mathrm{dist}_*(g_k, \mathcal{K}^*)]^2 \leq \frac{9.7L^2\|x^*\|^2}{k+1}.$

Termination is certified when $\langle g_k, v_k \rangle \geq -O(\sqrt{\epsilon})$ , yielding $\mathrm{dist}_*(\nabla f(\eta_k x_k), \mathcal{K}^*) \leq O(\sqrt{\epsilon})$ .

Preconditioning by linear change-of-variables $x = Pz$ can sharply reduce the dual error constant. An appropriately chosen positive-definite $P$ that balances the Hessian and cone geometry minimizes $L\|P^{-1}x^*\|$ , thereby accelerating convergence.

6. Memory-Efficient MOCO for SDP with Low-Rank Structure

MOCO adapts for large-scale semidefinite programs (SDP):

$\min_X f(\mathcal{G}(X) - z) \quad \text{subject to} \;\; X \in S_n^+,$

with $\mathcal{G}: S_n \rightarrow \mathbb{R}^d$ linear and $f$ with $L_f$ -Lipschitz gradient. To circumvent storing $X_k \in \mathbb{R}^{n \times n}$ , MOCO maintains:

$y_k = \mathcal{G}(X_k) - z \in \mathbb{R}^d$
A random sketch $S_k = X_k \Omega \in \mathbb{R}^{n \times R}$ , using fixed Gaussian $\Omega$ with $R \ll n$

The affine update for the sketch:

$S_{k+1} = \eta_k S_k + \theta_k q_k (q_k^T \Omega),$

where $q_k$ is the minimal-eigenvector of $G_k = \mathcal{G}^*(\nabla f(y_k))$ . Each Frank–Wolfe iteration costs $O(n^2)$ via Lanczos, with total memory $O(d + nR)$ . Recovery of an $\epsilon$ -accurate $X̂_k$ from the sketch is controlled by the true rank $r$ and the excess singular values, provided $r < R$ .

7. Empirical Performance and Practical Guidelines

Numerical experiments demonstrate MOCO's effectiveness on raised-up SDP problems such as matrix completion and phase-retrieval:

For matrix completion ( $A \in S_n^+$ recovery from noisy, partial entries), MOCO and CD have comparable runtime–primal error profiles, but greedy-accelerated MOCOg outperforms all methods at large $n$ .
For phase-retrieval (rank-1 SDP lifted from quadratic measurements), MOCOg and a heuristic step-size variant (MOCOh) match or surpass CDg in visual quality and runtime-loss performance, significantly outperforming standard Frank–Wolfe approaches.

Noteworthy practical observations include:

Momentum-augmented Frank–Wolfe within the conic framework (MOCO) yields tighter convergence by the positive momentum term $\rho_k$ .
The stopping criterion $\langle g_k, v_k \rangle \approx 0$ is efficiently computed and directly certifies dual feasibility.
Preconditioners significantly reduce dual residual constants, allowing earlier termination.
The memory-efficient variant using sketching is effective for large-scale SDP with rigorous low-rank recovery.
Greedy acceleration (Burer–Monteiro step) and heuristic step-size selection (e.g., $\theta_k = 2M/(k+2)$ where $M \approx \mathbb{E}[b_i]/m$ ) expedite convergence without substantial additional memory cost (Li et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Conic Descent Redux for Memory-Efficient Optimization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Momentum-Conic Descent (MOCO).