Papers
Topics
Authors
Recent
2000 character limit reached

ArcGD Optimiser: Phase-Aware Gradient Descent

Updated 14 December 2025
  • ArcGD Optimiser is an optimization algorithm based on arc-length principles with phase-aware, user-controlled dynamics to prevent overshooting and stalling in non-convex landscapes.
  • Its three-phase update mechanism—exploration, transition, and vanishing phases—ensures bounded updates, smooth acceleration, and minimum progress in low-gradient zones.
  • Empirical results across benchmarks show ArcGD achieves faster convergence, reduced overfitting, and enhanced generalization compared to Adam and other traditional optimizers.

Arc Gradient Descent (ArcGD) is an optimization algorithm derived from a mathematical reformulation of classical gradient descent. Employing an arc-length principle and phase-aware, user-controlled step dynamics, ArcGD constrains parameter updates within tunable bounds to address issues of overshooting in high-gradient regions and stalling near critical points. The method was formally derived, implemented, and empirically evaluated across highly non-convex optimization landscapes and standard deep-learning benchmarks. Notably, ArcGD embodies a spectrum of behaviors interpolating between classical gradient descent and sign-based optimizers, and connects to the Lion optimizer in special cases (Verma et al., 7 Dec 2025).

1. Mathematical Formulation

The core of ArcGD is a mathematically grounded step size schedule based on the arc length of the loss function. For a one-dimensional differentiable function f(x)f(x), the arc length is

s=1+[f(x)]2dx.s = \int \sqrt{1 + [f'(x)]^2}\, dx.

Discretization over a small step Δx\Delta x yields

ΔsΔx1+[f(x)]2.\Delta s \approx \Delta x\, \sqrt{1 + [f'(x)]^2}.

Enforcing an upper bound on the arc-length increment, Δsaf(x)\Delta s \approx a |f'(x)|, with a ceiling a1a \ll 1, gives the elementwise update rule

Δx=ag1+g2\Delta x = -a\, \frac{g}{\sqrt{1 + g^2}}

where gf(x)g \equiv f'(x). Generalizing to the multidimensional case, the algorithm applies this update per parameter dimension.

To address near-zero gradients and inertial regions, the update comprises three phases—each regulated by user-controlled parameters (a,b,c)(a,b,c) decomposing the total step ceiling:

Δx=[aT+bT(1T)+csign(T)(1T)],\Delta x = -\left[a T + b T (1 - |T|) + c\, \text{sign}(T) (1 - |T|)\right],

where T=g/1+g2(1,1)T = g/\sqrt{1+g^2} \in (-1,1). The terms represent:

  • High-phase: aTa T for g1|g| \gg 1 (bounded ceiling).
  • Transition-phase: bT(1T)b T (1 - |T|) for moderate g|g|, shaping mid-range step.
  • Low-phase: csign(T)(1T)c\, \text{sign}(T)(1 - |T|), enforcing a nonzero lower bound in vanishing-gradient regimes.

An adaptive floor variant is also provided to prevent divergence of the low-phase term:

cadapt=min(c,cT1T),c_{\text{adapt}} = \min(c, \frac{c\, |T|}{1-|T|}),

and the update becomes

Δx=[aT+bT(1T)+cadaptsign(T)(1T)].\Delta x = -\left[a T + b T (1 - |T|) + c_{\text{adapt}}\, \text{sign}(T)(1 - |T|)\right].

2. Algorithmic Procedure

ArcGD uses a simultaneous, component-wise update. The procedure is as follows:

  • Inputs: Initial parameters x(0)Rnx^{(0)} \in \mathbb{R}^n, hyperparameters a>0a>0, b0b \geq 0, c0c \geq 0, and optionally β[0,1)\beta \in [0,1) for moving-averaged gradients and ηlow\eta_{\text{low}} for adaptive cc.
  • Iterative update: At each step, compute the gradient g=f(x)g = \nabla f(x); if using the noisy-variant, replace gg by an exponentially weighted moving average.
  • For each parameter ii:
    • Ti=geff,i/1+geff,i2T_i = g_{\text{eff},i} / \sqrt{1 + g_{\text{eff},i}^2}
    • Compute the phase-aware update Δi\Delta_i as per the full update rule
    • Optionally apply the adaptive floor for cic_i
  • Update all parameters: xxΔx \leftarrow x - \Delta

Pseudocode (verbatim, see (Verma et al., 7 Dec 2025)):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x  x^(0)
if using noisylandscape variant:
  m  0    # moving‐avg of gradients
for t = 0,1,2,... until convergence:
  g  f(x)      # compute gradient
  if using movingavg:
    m  βm + (1β)g
    use g_eff  m
  else:
    use g_eff  g
  # component‐wise transform
  for i in 1..n:
    T_i  g_eff,i / sqrt(1 + g_eff,i^2)
    # optional adaptive floor
    c_i  min(c, (η_low  |T_i|)/(1|T_i|))   # skip if non-adaptive
    Δ_i  aT_i + bT_i(1|T_i|) + c_isign(T_i)(1|T_i|)
  end for
  x  x  Δ           # simultaneous update
end for

3. Phase-Aware Step Dynamics and User Control

ArcGD explicitly partitions the update regime into three dynamically weighted phases determined by the normalized gradient TT.

  • Exploration (Saturation) Phase: g1    T1|g| \gg 1 \implies |T| \approx 1. The update is capped by aa, implementing a hard ceiling.
  • Transition Phase: For moderate gradients (0.01g100.01 \lesssim |g| \lesssim 10), the bb-term facilitates smooth nonlinearity in update magnitude.
  • Vanishing Phase: For g1|g| \ll 1, the cc-term enforces a nonzero floor, guaranteeing minimum progress when gradients are nearly vanishing.

User control is provided via the parameters:

  • Increasing aa raises the ceiling (maximum step).
  • Adjusting bb modifies the nonlinearity and acceleration in transition.
  • Setting cc or ηlow\eta_{\mathrm{low}} tunes stalling resistance at small gradients.

4. Hyperparameter Selection and Implementation Guidelines

Default values for general tasks are:

Parameter Default Value Role
aa $0.01$ Ceiling
bb $0.001$ Transition
cc $0.0001$ Floor
β\beta $0.9$ Gradient moving average
ηlow\eta_{\mathrm{low}} $0.01$ Adaptive cc (optional)
  • Ensure a+b+c1a + b + c \ll 1 for stability: Δxa+b+c<1|\Delta x| \leq a+b+c < 1.
  • Set ca+bc \leq a+b to prevent domination by the floor.
  • For aggressive stalling prevention, c102c \approx 10^{-2}; for conservative, c104c \approx 10^{-4}.

Initialization uses standard neural network schemes (e.g. He-normal), with x(0)x^{(0)} selected uniformly in an appropriate range.

5. Empirical Evaluation and Comparative Performance

5.1 Stochastic Rosenbrock Benchmark

ArcGD was benchmarked against Adam on a noisy, non-convex Rosenbrock function across dimensions n=2,10,100,1000,50000n = 2,10,100,1000,50\,000. Two protocol configurations eliminated learning-rate bias: (A) matched effective learning rates; (B) both using Adam's default.

Dim Adam conv.% ArcGD conv.% Avg iters (Adam/ArcGD) Avg time (s) (Adam/ArcGD)
2 100% 100% 9,440 / 2,802 0.47 / 0.15
10 60% 80% 11,370 / 2,897 0.85 / 0.25
100 90% 90% 13,432 / 4,378 1.29 / 0.40
1,000 90% 100% 15,658 / 9,197 1.99 / 1.22
50,000 0% 100% – / 22,993 – / 104.4

ArcGD demonstrated superior speed, reliability, and precision, especially as dimensionality increased. For smaller default rates (Config B), ArcGD produced more precise solutions at the cost of increased iterations in high dimension.

5.2 CIFAR-10 Neural Network Benchmark

Eight MLPs of 1–5 hidden layers (parameters: 10510^55.5×1065.5\times10^6) were trained on CIFAR-10, comparing ArcGD to Adam, AdamW, Lion, and SGD. Average test accuracy:

Optimiser 5,000 iters 20,000 iters
ArcGD 48.4 50.7
Adam 47.7 46.6
AdamW 47.6 46.8
Lion 42.7 43.3
SGD 44.1 49.6

Adam and AdamW showed early rapid convergence (5,000 iterations) but regressed at 20,000 due to overfitting. ArcGD continued improving throughout training (+2.3%+2.3\% on average), requiring no early stopping tuning, and delivered highest accuracy on six out of eight architectures at the late stage. This suggests greater resistance to overfitting and enhanced generalization without hyperparameter tuning for training duration.

6. Connections to the Lion Optimiser

A variant of ArcGD recovers the Lion optimizer's behavior. By omitting the transition-phase (b=0b=0) and setting a=c=γ1a = c = \gamma \ll 1,

Δx=γTγsign(T)(1T),\Delta x = -\gamma T - \gamma\, \mathrm{sign}(T)(1 - |T|),

which, in the regime T1|T| \approx 1, yields

Δxγsign(T),\Delta x \approx -\gamma\,\mathrm{sign}(T),

matching the core Lion sign-momentum update when TT is replaced by a momentum-accumulated gradient (TmtT \leftarrow m_t, β2=1\beta_2 = 1; see Chen et al. 2023). This unifies ArcGD's bounded ceiling/floor approach with Lion’s sign-based step rule, highlighting structural correspondence between the optimizers (Verma et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ArcGD Optimiser.