ArcGD Optimiser: Phase-Aware Gradient Descent
- ArcGD Optimiser is an optimization algorithm based on arc-length principles with phase-aware, user-controlled dynamics to prevent overshooting and stalling in non-convex landscapes.
- Its three-phase update mechanism—exploration, transition, and vanishing phases—ensures bounded updates, smooth acceleration, and minimum progress in low-gradient zones.
- Empirical results across benchmarks show ArcGD achieves faster convergence, reduced overfitting, and enhanced generalization compared to Adam and other traditional optimizers.
Arc Gradient Descent (ArcGD) is an optimization algorithm derived from a mathematical reformulation of classical gradient descent. Employing an arc-length principle and phase-aware, user-controlled step dynamics, ArcGD constrains parameter updates within tunable bounds to address issues of overshooting in high-gradient regions and stalling near critical points. The method was formally derived, implemented, and empirically evaluated across highly non-convex optimization landscapes and standard deep-learning benchmarks. Notably, ArcGD embodies a spectrum of behaviors interpolating between classical gradient descent and sign-based optimizers, and connects to the Lion optimizer in special cases (Verma et al., 7 Dec 2025).
1. Mathematical Formulation
The core of ArcGD is a mathematically grounded step size schedule based on the arc length of the loss function. For a one-dimensional differentiable function , the arc length is
Discretization over a small step yields
Enforcing an upper bound on the arc-length increment, , with a ceiling , gives the elementwise update rule
where . Generalizing to the multidimensional case, the algorithm applies this update per parameter dimension.
To address near-zero gradients and inertial regions, the update comprises three phases—each regulated by user-controlled parameters decomposing the total step ceiling:
where . The terms represent:
- High-phase: for (bounded ceiling).
- Transition-phase: for moderate , shaping mid-range step.
- Low-phase: , enforcing a nonzero lower bound in vanishing-gradient regimes.
An adaptive floor variant is also provided to prevent divergence of the low-phase term:
and the update becomes
2. Algorithmic Procedure
ArcGD uses a simultaneous, component-wise update. The procedure is as follows:
- Inputs: Initial parameters , hyperparameters , , , and optionally for moving-averaged gradients and for adaptive .
- Iterative update: At each step, compute the gradient ; if using the noisy-variant, replace by an exponentially weighted moving average.
- For each parameter :
- Compute the phase-aware update as per the full update rule
- Optionally apply the adaptive floor for
- Update all parameters:
Pseudocode (verbatim, see (Verma et al., 7 Dec 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
x ← x^(0) if using noisy‐landscape variant: m ← 0 # moving‐avg of gradients for t = 0,1,2,... until convergence: g ← ∇f(x) # compute gradient if using moving‐avg: m ← β⋅m + (1−β)⋅g use g_eff ← m else: use g_eff ← g # component‐wise transform for i in 1..n: T_i ← g_eff,i / sqrt(1 + g_eff,i^2) # optional adaptive floor c_i ← min(c, (η_low ⋅ |T_i|)/(1−|T_i|)) # skip if non-adaptive Δ_i ← a⋅T_i + b⋅T_i⋅(1−|T_i|) + c_i⋅sign(T_i)⋅(1−|T_i|) end for x ← x − Δ # simultaneous update end for |
3. Phase-Aware Step Dynamics and User Control
ArcGD explicitly partitions the update regime into three dynamically weighted phases determined by the normalized gradient .
- Exploration (Saturation) Phase: . The update is capped by , implementing a hard ceiling.
- Transition Phase: For moderate gradients (), the -term facilitates smooth nonlinearity in update magnitude.
- Vanishing Phase: For , the -term enforces a nonzero floor, guaranteeing minimum progress when gradients are nearly vanishing.
User control is provided via the parameters:
- Increasing raises the ceiling (maximum step).
- Adjusting modifies the nonlinearity and acceleration in transition.
- Setting or tunes stalling resistance at small gradients.
4. Hyperparameter Selection and Implementation Guidelines
Default values for general tasks are:
| Parameter | Default Value | Role |
|---|---|---|
| $0.01$ | Ceiling | |
| $0.001$ | Transition | |
| $0.0001$ | Floor | |
| $0.9$ | Gradient moving average | |
| $0.01$ | Adaptive (optional) |
- Ensure for stability: .
- Set to prevent domination by the floor.
- For aggressive stalling prevention, ; for conservative, .
Initialization uses standard neural network schemes (e.g. He-normal), with selected uniformly in an appropriate range.
5. Empirical Evaluation and Comparative Performance
5.1 Stochastic Rosenbrock Benchmark
ArcGD was benchmarked against Adam on a noisy, non-convex Rosenbrock function across dimensions . Two protocol configurations eliminated learning-rate bias: (A) matched effective learning rates; (B) both using Adam's default.
| Dim | Adam conv.% | ArcGD conv.% | Avg iters (Adam/ArcGD) | Avg time (s) (Adam/ArcGD) |
|---|---|---|---|---|
| 2 | 100% | 100% | 9,440 / 2,802 | 0.47 / 0.15 |
| 10 | 60% | 80% | 11,370 / 2,897 | 0.85 / 0.25 |
| 100 | 90% | 90% | 13,432 / 4,378 | 1.29 / 0.40 |
| 1,000 | 90% | 100% | 15,658 / 9,197 | 1.99 / 1.22 |
| 50,000 | 0% | 100% | – / 22,993 | – / 104.4 |
ArcGD demonstrated superior speed, reliability, and precision, especially as dimensionality increased. For smaller default rates (Config B), ArcGD produced more precise solutions at the cost of increased iterations in high dimension.
5.2 CIFAR-10 Neural Network Benchmark
Eight MLPs of 1–5 hidden layers (parameters: –) were trained on CIFAR-10, comparing ArcGD to Adam, AdamW, Lion, and SGD. Average test accuracy:
| Optimiser | 5,000 iters | 20,000 iters |
|---|---|---|
| ArcGD | 48.4 | 50.7 |
| Adam | 47.7 | 46.6 |
| AdamW | 47.6 | 46.8 |
| Lion | 42.7 | 43.3 |
| SGD | 44.1 | 49.6 |
Adam and AdamW showed early rapid convergence (5,000 iterations) but regressed at 20,000 due to overfitting. ArcGD continued improving throughout training ( on average), requiring no early stopping tuning, and delivered highest accuracy on six out of eight architectures at the late stage. This suggests greater resistance to overfitting and enhanced generalization without hyperparameter tuning for training duration.
6. Connections to the Lion Optimiser
A variant of ArcGD recovers the Lion optimizer's behavior. By omitting the transition-phase () and setting ,
which, in the regime , yields
matching the core Lion sign-momentum update when is replaced by a momentum-accumulated gradient (, ; see Chen et al. 2023). This unifies ArcGD's bounded ceiling/floor approach with Lion’s sign-based step rule, highlighting structural correspondence between the optimizers (Verma et al., 7 Dec 2025).