SPG-AHOC: Smoothed Proximal Gradient for DAG Recovery

Updated 2 February 2026

SPG-AHOC is an optimization framework that uses a smoothing surrogate and proximal gradient updates to achieve exact support recovery in high-dimensional linear SEMs.
It combines structured sparsity regularization with a hybrid-order acyclicity constraint to enable finite-iteration support identification in causal discovery.
Empirical results demonstrate improved sparsity, superior structural recovery, and enhanced computational efficiency compared to methods like NOTEARS and DAGMA.

Smoothed Proximal Gradient (SPG-AHOC) is an optimization framework that combines the machinery of the smoothed proximal gradient (SPG) method with new structural constraints, primarily for exact support recovery in high-dimensional, constrained estimation settings. Originally developed for structured sparse regression, SPG has recently been adapted to causal discovery with the SPG-AHOC algorithm, enabling finite-iteration support identification in linear structural equation models under acyclicity constraints (Wu et al., 26 Jan 2026). The integration of smoothed regularization, non-trivial manifold identification, and constraint formulation achieves superior structure recovery compared to traditional continuous optimization approaches.

1. Fundamental Principles of Smoothed Proximal Gradient Methods

The SPG framework addresses optimization of composite objectives where a smooth data term is combined with a nonsmooth regularizer, such as in structured sparse regression. A distinctive feature is the use of a smoothing surrogate, typically via Nesterov’s technique, for the nonseparable and nonsmooth regularizer. The general form is: $\min_{\beta} f(\beta) := g(\beta) + \Omega(\beta) + \lambda\|\beta\|_1$ where $g$ is smooth (e.g., squared loss) and $\Omega$ encodes structured sparsity (overlapping group lasso, graph-fused lasso, etc.). Smoothing replaces $\Omega$ with a differentiable approximation $\Omega_\mu$ , yielding a regularized problem amenable to first-order methods with controlled bias: $\Omega_\mu(\beta) = \max_{\alpha \in Q} \left[\alpha^\top C \beta - \mu d(\alpha)\right]$ with $\nabla \Omega_\mu(\beta) = C^\top \alpha^*$ and Lipschitz constant $L = L_g + L_\Omega$ (Chen et al., 2010, Chen et al., 2012).

The use of acceleration strategies, e.g., FISTA-type extrapolation, further boosts convergence rates to $O(1/\varepsilon)$ for objective value error, surpassing classical subgradient schemes (Chen et al., 2012, Zhang et al., 2021).

2. Hybrid-Order Acyclicity Constraint and Problem Formulation

In the causal discovery setting, SPG-AHOC extends the conventional SPG machinery to optimization over the space of weighted adjacency matrices $W \in \mathbb{R}^{d \times d}$ for linear structural equation models (SEMs). The optimization problem is: $\min_{W} L(W;X) + \lambda_1 \|W\|_1 \quad \text{subject to} \ h_\text{AHOC}(W) = 0$ with $L(W;X)$ the least-squares data fit and $\|W\|_1$ promoting sparsity (Wu et al., 26 Jan 2026). Critical to the approach is the Hybrid-Order Acyclicity Constraint (AHOC), which unifies quadratic and $\ell_1$ -core terms: $M(W; \alpha) = \alpha (W \circ W) + (1-\alpha) |W|$

$B(W; \alpha, \epsilon) = \frac{M(W; \alpha)}{\|M(W; \alpha)\|_F + \epsilon}$

$h_\text{AHOC}(W; \alpha, \epsilon) = \mathrm{tr} \exp(B(W; \alpha, \epsilon)) - d$

AHOC exhibits essential mathematical stability properties: it yields a non-vanishing gradient near $W=0$ and uniform boundedness near cyclic regions, unlike alternatives such as $h_\text{exp}(W) = \mathrm{tr} \exp (W \circ W) - d$ (Wu et al., 26 Jan 2026).

3. Smoothed Proximal Gradient Algorithm with AHOC Constraint

SPG-AHOC operates by replacing the nonsmooth $|W|$ within $h_\text{AHOC}$ with a smooth surrogate: $\mathrm{smooth\_abs}(x; \delta) = \sqrt{x^2 + \delta^2}$ yielding the smoothed core

$\tilde{M}(W; \alpha, \delta) = \alpha (W \circ W) + (1-\alpha) \sqrt{W^2 + \delta^2}$

and consequently a smooth, Lipschitz-continuous constraint function $\tilde{h}_\text{AHOC}(W; \alpha, \epsilon, \delta)$ .

The constraint is enforced via an augmented Lagrangian approach: $F_t(W) = L(W; X) + \mu_t \tilde{h}_\text{AHOC}(W) + \frac{\rho_t}{2} [\tilde{h}_\text{AHOC}(W)]^2 + \lambda_1 \|W\|_1$ At each iteration, a proximal gradient update is performed on the differentiable part, followed by coordinatewise soft-thresholding to enforce sparsity:

1
2
3

G = ∇f(W_k)
Z = W_k - η * G
W_{k+1} = soft_threshold(Z, η*λ1)

where

f

encapsulates the smooth loss plus constraint penalty. The step-size

\eta

is adapted by backtracking to ensure a sufficient decrease of the objective (Wu et al., 26 Jan 2026).

This iterative scheme leverages the “manifold identification” property: after finitely many iterations, the support (zero pattern) of $W$ stabilizes exactly, owing to the geometric structure of the proximal operator and constraint (Wu et al., 26 Jan 2026).

4. Finite-Time Oracle Property and Theoretical Guarantees

A central advance of SPG-AHOC is the finite-time oracle property. Under standard Lasso-type identifiability conditions—local restricted strong convexity, irrepresentability, and a minimal signal strength (beta-min)—it is proven:

There exists a critical neighborhood such that for small enough smoothing parameter $\delta < \delta^*$ $δ < δ^{*}$ , after a finite iteration $K$ $K$ , all subsequent iterates $W_k$ $W_{k}$ ( $k \geq K$ $k \geq K$ ) exactly recover the support and signs of the true DAG:
- $\mathrm{supp}(W_k) = \mathrm{supp}(W^*)$
- $\mathrm{sign}(W_k) = \mathrm{sign}(W^*)$
- with high probability $1 - O(\exp(-cN))$ (Wu et al., 26 Jan 2026).

The argument proceeds in two stages: first, establishing primal–dual statistical consistency of the underlying estimator, second, leveraging the strict feasibility gap in the dual certificate to show explicit finite $K$ for support stabilization. The smoothing bias $O(\delta)$ is quantifiably controlled.

5. Numerical and Empirical Performance

SPG-AHOC demonstrates strong empirical advantages relative to earlier continuous DAG structure learning algorithms such as NOTEARS and DAGMA:

On benchmark synthetic and real graph datasets, SPG-AHOC achieves Structural Hamming Distance (SHD) at least on par with or superior to NOTEARS/DAGMA.
Sparsity in the returned $W$ is markedly improved: NOTEARS typically duplicates fully dense matrices (requiring heuristic post-thresholding), while SPG-AHOC produces 98–99.8% exact zeros, fully eliminating the need for post hoc support selection.
Computationally, for moderate $d$ ( $d \leq 200$ ), SPG-AHOC is 2–5x faster than NOTEARS and competitive with DAGMA, attributed to active set restriction induced by early topological locking (Wu et al., 26 Jan 2026).

6. Comparison to Other SPG Applications and Extensions

Core advances in SPG-AHOC build upon earlier SPG innovations for structured sparse regression and clustering:

General SPG approaches smooth composite nonsmooth penalties and leverage FISTA-type momentum (Chen et al., 2012, Chen et al., 2010).
The Smoothing Proximal Gradient with Extrapolation (SPGE) method achieves convergence rates $o(1/k^{1-\sigma})$ and reduced zig-zag in nonconvex $\ell_0$ -relaxation (Zhang et al., 2021).
Convex clustering via SPG requires only $O(knp)$ per iteration and achieves comparable accuracy to interior-point and AMA/ADMM solvers but at drastically reduced computation and memory (Zhou et al., 2020).

A recurring insight is that controlled smoothing enables exact recovery and sharply improves the interpretability of first-order optimization output—no longer requiring dense solutions or hand-thresholding. Extensions of SPG-AHOC may incorporate adaptive restarts, alternative constraint stabilization (e.g., line-search for $L$ ), or asynchronous coordinate updates (Zhang et al., 2021).

7. Limitations and Open Directions

Despite the theoretical and empirical strengths, several limitations persist:

Parameter selection (e.g., step-size, smoothing parameter, constraint weight) can be delicate and problem-specific, requiring explicit calibration for optimal practical results (Zhang et al., 2021).
Sublinear convergence remains fundamental to these first-order methods; further improvements hinge on structural properties (e.g., strong convexity or Kurdyka-Łojasiewicz conditions).
While SPG-AHOC's finite-time identification holds under standard assumptions, behavior in the presence of model misspecification or strong correlations requires further investigation (Wu et al., 26 Jan 2026).

Potential extensions include adaptive restart schemes, improved line-search for step-size selection, handling nonconvex differentiable loss terms, and distributed or stochastic variants for large-scale DAG recovery.

References:

(Chen et al., 2010): "Smoothing proximal gradient method for general structured sparse regression" (Chen et al., 2012): "Smoothing Proximal Gradient Method for General Structured Sparse Learning" (Zhou et al., 2020): "An Efficient Smoothing Proximal Gradient Algorithm for Convex Clustering" (Zhang et al., 2021): "The Smoothing Proximal Gradient Algorithm with Extrapolation for the Relaxation of $\ell_0$ Regularization Problem" (Wu et al., 26 Jan 2026): "Smooth, Sparse, and Stable: Finite-Time Exact Skeleton Recovery via Smoothed Proximal Gradients"