ArcGD Optimiser: Phase-Aware Gradient Descent

Updated 14 December 2025

ArcGD Optimiser is an optimization algorithm based on arc-length principles with phase-aware, user-controlled dynamics to prevent overshooting and stalling in non-convex landscapes.
Its three-phase update mechanism—exploration, transition, and vanishing phases—ensures bounded updates, smooth acceleration, and minimum progress in low-gradient zones.
Empirical results across benchmarks show ArcGD achieves faster convergence, reduced overfitting, and enhanced generalization compared to Adam and other traditional optimizers.

Arc Gradient Descent (ArcGD) is an optimization algorithm derived from a mathematical reformulation of classical gradient descent. Employing an arc-length principle and phase-aware, user-controlled step dynamics, ArcGD constrains parameter updates within tunable bounds to address issues of overshooting in high-gradient regions and stalling near critical points. The method was formally derived, implemented, and empirically evaluated across highly non-convex optimization landscapes and standard deep-learning benchmarks. Notably, ArcGD embodies a spectrum of behaviors interpolating between classical gradient descent and sign-based optimizers, and connects to the Lion optimizer in special cases (Verma et al., 7 Dec 2025).

1. Mathematical Formulation

The core of ArcGD is a mathematically grounded step size schedule based on the arc length of the loss function. For a one-dimensional differentiable function $f(x)$ , the arc length is

$s = \int \sqrt{1 + [f'(x)]^2}\, dx.$

Discretization over a small step $\Delta x$ yields

$\Delta s \approx \Delta x\, \sqrt{1 + [f'(x)]^2}.$

Enforcing an upper bound on the arc-length increment, $\Delta s \approx a |f'(x)|$ , with a ceiling $a \ll 1$ , gives the elementwise update rule

$\Delta x = -a\, \frac{g}{\sqrt{1 + g^2}}$

where $g \equiv f'(x)$ . Generalizing to the multidimensional case, the algorithm applies this update per parameter dimension.

To address near-zero gradients and inertial regions, the update comprises three phases—each regulated by user-controlled parameters $(a,b,c)$ decomposing the total step ceiling:

$\Delta x = -\left[a T + b T (1 - |T|) + c\, \text{sign}(T) (1 - |T|)\right],$

where $T = g/\sqrt{1+g^2} \in (-1,1)$ . The terms represent:

High-phase: $a T$ for $|g| \gg 1$ (bounded ceiling).
Transition-phase: $b T (1 - |T|)$ for moderate $|g|$ , shaping mid-range step.
Low-phase: $c\, \text{sign}(T)(1 - |T|)$ , enforcing a nonzero lower bound in vanishing-gradient regimes.

An adaptive floor variant is also provided to prevent divergence of the low-phase term:

$c_{\text{adapt}} = \min(c, \frac{c\, |T|}{1-|T|}),$

and the update becomes

$\Delta x = -\left[a T + b T (1 - |T|) + c_{\text{adapt}}\, \text{sign}(T)(1 - |T|)\right].$

2. Algorithmic Procedure

ArcGD uses a simultaneous, component-wise update. The procedure is as follows:

Inputs: Initial parameters $x^{(0)} \in \mathbb{R}^n$ , hyperparameters $a>0$ , $b \geq 0$ , $c \geq 0$ , and optionally $\beta \in [0,1)$ for moving-averaged gradients and $\eta_{\text{low}}$ for adaptive $c$ .
Iterative update: At each step, compute the gradient $g = \nabla f(x)$ ; if using the noisy-variant, replace $g$ by an exponentially weighted moving average.
For each parameter $i$ $i$ :
- $T_i = g_{\text{eff},i} / \sqrt{1 + g_{\text{eff},i}^2}$
- Compute the phase-aware update $\Delta_i$ as per the full update rule
- Optionally apply the adaptive floor for $c_i$
Update all parameters: $x \leftarrow x - \Delta$

Pseudocode (verbatim, see (Verma et al., 7 Dec 2025)):

x ← x^(0)
if using noisy‐landscape variant:
  m ← 0    # moving‐avg of gradients
for t = 0,1,2,... until convergence:
  g ← ∇f(x)      # compute gradient
  if using moving‐avg:
    m ← β⋅m + (1−β)⋅g
    use g_eff ← m
  else:
    use g_eff ← g
  # component‐wise transform
  for i in 1..n:
    T_i ← g_eff,i / sqrt(1 + g_eff,i^2)
    # optional adaptive floor
    c_i ← min(c, (η_low ⋅ |T_i|)/(1−|T_i|))   # skip if non-adaptive
    Δ_i ← a⋅T_i + b⋅T_i⋅(1−|T_i|) + c_i⋅sign(T_i)⋅(1−|T_i|)
  end for
  x ← x − Δ           # simultaneous update
end for

3. Phase-Aware Step Dynamics and User Control

ArcGD explicitly partitions the update regime into three dynamically weighted phases determined by the normalized gradient $T$ .

Exploration (Saturation) Phase: $|g| \gg 1 \implies |T| \approx 1$ . The update is capped by $a$ , implementing a hard ceiling.
Transition Phase: For moderate gradients ( $0.01 \lesssim |g| \lesssim 10$ ), the $b$ -term facilitates smooth nonlinearity in update magnitude.
Vanishing Phase: For $|g| \ll 1$ , the $c$ -term enforces a nonzero floor, guaranteeing minimum progress when gradients are nearly vanishing.

User control is provided via the parameters:

Increasing $a$ raises the ceiling (maximum step).
Adjusting $b$ modifies the nonlinearity and acceleration in transition.
Setting $c$ or $\eta_{\mathrm{low}}$ tunes stalling resistance at small gradients.

4. Hyperparameter Selection and Implementation Guidelines

Default values for general tasks are:

Parameter	Default Value	Role
$a$	$0.01$	Ceiling
$b$	$0.001$	Transition
$c$	$0.0001$	Floor
$\beta$	$0.9$	Gradient moving average
$\eta_{\mathrm{low}}$	$0.01$	Adaptive $c$ (optional)

Ensure $a + b + c \ll 1$ for stability: $|\Delta x| \leq a+b+c < 1$ .
Set $c \leq a+b$ to prevent domination by the floor.
For aggressive stalling prevention, $c \approx 10^{-2}$ ; for conservative, $c \approx 10^{-4}$ .

Initialization uses standard neural network schemes (e.g. He-normal), with $x^{(0)}$ selected uniformly in an appropriate range.

5. Empirical Evaluation and Comparative Performance

5.1 Stochastic Rosenbrock Benchmark

ArcGD was benchmarked against Adam on a noisy, non-convex Rosenbrock function across dimensions $n = 2,10,100,1000,50\,000$ . Two protocol configurations eliminated learning-rate bias: (A) matched effective learning rates; (B) both using Adam's default.

Dim	Adam conv.%	ArcGD conv.%	Avg iters (Adam/ArcGD)	Avg time (s) (Adam/ArcGD)
2	100%	100%	9,440 / 2,802	0.47 / 0.15
10	60%	80%	11,370 / 2,897	0.85 / 0.25
100	90%	90%	13,432 / 4,378	1.29 / 0.40
1,000	90%	100%	15,658 / 9,197	1.99 / 1.22
50,000	0%	100%	– / 22,993	– / 104.4

ArcGD demonstrated superior speed, reliability, and precision, especially as dimensionality increased. For smaller default rates (Config B), ArcGD produced more precise solutions at the cost of increased iterations in high dimension.

5.2 CIFAR-10 Neural Network Benchmark

Eight MLPs of 1–5 hidden layers (parameters: $10^5$ – $5.5\times10^6$ ) were trained on CIFAR-10, comparing ArcGD to Adam, AdamW, Lion, and SGD. Average test accuracy:

Optimiser	5,000 iters	20,000 iters
ArcGD	48.4	50.7
Adam	47.7	46.6
AdamW	47.6	46.8
Lion	42.7	43.3
SGD	44.1	49.6

Adam and AdamW showed early rapid convergence (5,000 iterations) but regressed at 20,000 due to overfitting. ArcGD continued improving throughout training ( $+2.3\%$ on average), requiring no early stopping tuning, and delivered highest accuracy on six out of eight architectures at the late stage. This suggests greater resistance to overfitting and enhanced generalization without hyperparameter tuning for training duration.

6. Connections to the Lion Optimiser

A variant of ArcGD recovers the Lion optimizer's behavior. By omitting the transition-phase ( $b=0$ ) and setting $a = c = \gamma \ll 1$ ,

$\Delta x = -\gamma T - \gamma\, \mathrm{sign}(T)(1 - |T|),$

which, in the regime $|T| \approx 1$ , yields

$\Delta x \approx -\gamma\,\mathrm{sign}(T),$

matching the core Lion sign-momentum update when $T$ is replaced by a momentum-accumulated gradient ( $T \leftarrow m_t$ , $\beta_2 = 1$ ; see Chen et al. 2023). This unifies ArcGD's bounded ceiling/floor approach with Lion’s sign-based step rule, highlighting structural correspondence between the optimizers (Verma et al., 7 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ArcGD Optimiser.