Multiplicative Backprop Updates

Updated 14 March 2026

Multiplicative backpropagation-style updates scale parameter changes by their current magnitude, offering robust normalization and mitigating vanishing gradients.
Hybrid methods blend multiplicative and additive rules using inner/outer learning rates and a blending parameter to control nonlinearity and update clipping.
Empirical studies on benchmarks like CIFAR-10 and MNIST show accelerated convergence and improved performance compared to traditional optimization techniques.

Multiplicative backpropagation-style updates are optimization algorithms for deep learning that replace or complement the canonical additive gradient-descent rule with elementwise updates that scale parameter changes proportionally to their current magnitude. Such methods—rooted in the Winnow and exponentiated-gradient literatures—have seen a rigorous resurgence in modern deep networks, as they naturally normalize updates, can ameliorate vanishing/shattering gradients, yield robustness to parameter rescaling, and facilitate new regimes of training acceleration and stability. Multiplicative updates can be implemented in pure form, as a hybrid with standard additive updates, or via more general frameworks such as hypentropy-based mirror descent and Expectation Reflection, all of which have been integrated into backpropagation pipelines and empirically validated at scale (Kirtas et al., 2023, Ghai et al., 2019, Kim et al., 13 Mar 2025, Köpp et al., 2016).

1. Mathematical Formulation of Multiplicative Updates

The canonical gradient descent rule is additive: at iteration $t$ , with parameter vector $\theta_{t-1}$ , gradient $g_t = \nabla_\theta f_t(\theta_{t-1})$ , momentum $m_t$ , and adaptive preconditioner $l_t$ ,

$\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$

Multiplicative updates fundamentally alter this structure by making the step proportional to the magnitude of each parameter. In the framework of the Generic Optimization Framework for Alternative Updates (GOFAU), the pure multiplicative update is

$\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$

where $\eta_{\mathrm{in}}>0$ and $\eta_{\mathrm{out}}\in(0,1]$ are hyperparameters controlling the nonlinearity and outer scaling, respectively. This form ensures (i) moves proportional to parameter scale, (ii) controlled, clipped updates via $\tanh$ , and (iii) elementwise operation.

A hybrid variant interpolates with the additive rule via a blending parameter $\theta_{t-1}$ 0: $\theta_{t-1}$ 1 Additive and multiplicative methods are recovered at endpoints $\theta_{t-1}$ 2 and $\theta_{t-1}$ 3 (Kirtas et al., 2023).

2. Algorithmic Implementation and Integration with Backpropagation

Multiplicative and hybrid updates are implemented within standard backward passes. At each layer:

Compute $\theta_{t-1}$ 4, $\theta_{t-1}$ 5 as for Adam, RMSProp, or Adagrad.
Compute the multiplicative factor $\theta_{t-1}$ 6.
Compute the additive factor $\theta_{t-1}$ 7.
Choose the update via hybrid blending or pure multiplicative rule.
Apply elementwise subtraction to update $\theta_{t-1}$ 8.

The update is composable with any standard initialization (Glorot, He) and optimizer-driven preconditioning. Table 1 summarizes the principal update rules:

Update Type	Formula	Sign-Flip?
Pure Multiplicative	$\theta_{t-1}$ 9	No
Additive (SGD)	$g_t = \nabla_\theta f_t(\theta_{t-1})$ 0	Yes
Hybrid	$g_t = \nabla_\theta f_t(\theta_{t-1})$ 1 (multiplicative) $g_t = \nabla_\theta f_t(\theta_{t-1})$ 2 (additive)	Yes, if $g_t = \nabla_\theta f_t(\theta_{t-1})$ 3

Multiplicative updates remain sign-invariant; only the hybrid choice or additive term allows sign-flips. Zero-initialized parameters remain unchanged under pure multiplicative rules, motivating hybridization.

3. Theoretical Properties and Regret Bounds

Multiplicative and hybrid rules exhibit several theoretical properties that differentiate them from additive SGD (Kirtas et al., 2023, Ghai et al., 2019):

Scale-Adaptivity: Updates scale proportionally under a rescaling $g_t = \nabla_\theta f_t(\theta_{t-1})$ 4, naturally adapting to parameter magnitude—a property absent from additive SGD, which would require learning-rate retuning.
Vanishing-Gradient Mitigation: For small $g_t = \nabla_\theta f_t(\theta_{t-1})$ 5, multiplicative updates still result in nonzero $g_t = \nabla_\theta f_t(\theta_{t-1})$ 6 due to proportional scaling, alleviating stagnation.
Clipping and Robustness: The $g_t = \nabla_\theta f_t(\theta_{t-1})$ 7 nonlinearity ensures that, even with large $g_t = \nabla_\theta f_t(\theta_{t-1})$ 8, step sizes remain bounded within $g_t = \nabla_\theta f_t(\theta_{t-1})$ 9, preventing parameter blow-up.
Mistake-Bound Guarantees: In settings analogous to Winnow and exponentiated-gradient (EG), multiplicative updates are known to enjoy logarithmic mistake bounds when many features are irrelevant (Kirtas et al., 2023).
Mirror Descent Unification: The hypentropy framework (Ghai et al., 2019) provides a continuous family interpolating between additive (GD) and EG multiplicative rules, via a scalar “temperature” $m_t$ 0, with explicit regret bounds that reduce to classic $m_t$ 1 or $m_t$ 2 rates in appropriate limits.

4. Variants: Hypentropy and Expectation Reflection

Alternative multiplicative forms have been advanced:

Hypentropy Update (HU, SHU family): Based on the hypentropy potential,

$m_t$ 3

HU smoothly interpolates between GD ( $m_t$ 4) and positive EG ( $m_t$ 5) elementwise, applicable to vectors and via spectral decomposition to matrices. The SHU extension enables matrix-valued updates for general rectangular matrices (Ghai et al., 2019).

Expectation Reflection (ER): Updates weights via a multiplicative correction based on the ratio of true target to prediction,

$m_t$ 6

and, in the full stacked-network case, updates pre-activations by $m_t$ 7, with new weights computed via pseudoinverse regression. ER is hyperparameter-free, can converge in a single iteration in ideal cases, and reinterprets backpropagation as inverse target propagation (Kim et al., 13 Mar 2025).

Differentiable Addiplicative Units: A smoothly parameterized transition between addition and multiplication at the neuron level uses non-integer exponentials:

$m_t$ 8

allowing each neuron to interpolate between summation and multiplication, with gradients computable in closed form (Köpp et al., 2016).

5. Empirical Results and Benchmarking

Multiplicative and hybrid updates have been empirically validated across convex, non-convex, and deep neural network settings:

Convex benchmarks: On 2D convex and Rosenbrock problems, hybrid updates reduce normalized distance to the global minimum by factors of $m_t$ 9– $l_t$ 0 compared to additive baselines; after 100 steps, up to five orders of magnitude improvement is observed (Kirtas et al., 2023).
Deep image classification:
- CIFAR-10: Hybrid SGD + GOFAU yields $l_t$ 1 higher absolute accuracy at epoch 5 compared to SGD; final gains up to $l_t$ 2 (ResNet18+SGD), $l_t$ 3 for Adagrad. Multiplicative-only rules yield $l_t$ 4– $l_t$ 5 improvements.
- CIFAR-100 and Tiny ImageNet: Hybrid updates improve final accuracy on ResNet18 from $l_t$ 6 to $l_t$ 7 (SGD) and $l_t$ 8 to $l_t$ 9 (Adagrad); up to $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 0 on Tiny ImageNet (Kirtas et al., 2023).
- Training speed: Major accuracy and loss improvements concentrate in the first $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 1– $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 2 epochs.
Expectation Reflection achieves $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 3 test error on MNIST and $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 4 on CIFAR-10 after a single full-batch iteration, outperforming other hyperparameter-free algorithms and rapidly approaching BP performance (Kim et al., 13 Mar 2025).
Addiplicative units enable networks to fit polynomial targets with lower test error and faster convergence than both purely additive or purely multiplicative architectures (Köpp et al., 2016).

6. Practical Recommendations, Hyperparameter Tuning, and Limitations

Rate selection: Inner/outer rates $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 5 control the impact of $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 6-clipping and scaling:
- SGD: $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 7, $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 8
- Adagrad: $\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.$ 9, $\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$ 0
- RMSProp: $\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$ 1, $\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$ 2 (Kirtas et al., 2023)
Blending parameter $\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$ 3: A value of $\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }$ 4 effectively balances sign flexibility and multiplicative robustness.
Initialization: Use standard Xavier or He schemes. Pure multiplicative updates are acceptable with nonnegative-initialized weights.
Limitations:
- Pure multiplicative rules cannot switch parameter sign, and weights zero-initialized remain fixed unless hybridized.
- Shallow nets may require additive steps to avoid suboptimal minima.
- ER’s full-batch pseudoinverse updates scale poorly in high dimensions; ridge regularization and mini-batch approximations trade off the no-hyperparameter property for scalability (Kim et al., 13 Mar 2025).
- Hybrid and addiplicative schemes increase per-parameter state or introduce additional parameters to optimize.

7. Broader Landscape and Theoretical Unification

Multiplicative backpropagation-style updates connect classical theory and modern deep learning:

Multiplicative rules derive rigorously from mirror descent in non-Euclidean geometries, with Bregman divergences such as hypentropy smoothly bridging additive and multiplicative regimes (Ghai et al., 2019).
The unification enables regret bounds and update steps that extend to matrices and general nonlinear transformations.
Structural variants, including neuron-level addiplicative control and model-level multiplicative scaling, provide inductive bias for multiplicative effects relevant to problem structure (e.g., polynomial interaction or logical gating) (Köpp et al., 2016).
Expectation Reflection links multiplicative consistency updates directly to target propagation and classical regression (Kim et al., 13 Mar 2025).
The robustness of multiplicative schemes to hyperparameter selection, initialization scale, and irrelevant parameter directions aligns with theoretical advantages known from the Winnow and EG literatures, now instantiated at scale in modern deep architectures (Kirtas et al., 2023).

In synthesis, multiplicative backpropagation-style updates and their hybridizations furnish a principled, robust, and empirically validated toolkit for optimization in deep learning, with favorable convergence, scale-adaptivity, and performance characteristics documented across a range of benchmarks and model classes.

Markdown Report Issue Upgrade to Chat

References (4)

Multiplicative update rules for accelerating deep learning training and increasing robustness (2023)

Exponentiated Gradient Meets Gradient Descent (2019)

Multiplicative Learning (2025)

A Differentiable Transition Between Additive and Multiplicative Neurons (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiplicative Backpropagation-Style Updates.

Multiplicative Backprop Updates

1. Mathematical Formulation of Multiplicative Updates

2. Algorithmic Implementation and Integration with Backpropagation

3. Theoretical Properties and Regret Bounds

4. Variants: Hypentropy and Expectation Reflection

5. Empirical Results and Benchmarking

6. Practical Recommendations, Hyperparameter Tuning, and Limitations

7. Broader Landscape and Theoretical Unification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multiplicative Backprop Updates

1. Mathematical Formulation of Multiplicative Updates

2. Algorithmic Implementation and Integration with Backpropagation

3. Theoretical Properties and Regret Bounds

4. Variants: Hypentropy and Expectation Reflection

5. Empirical Results and Benchmarking

6. Practical Recommendations, Hyperparameter Tuning, and Limitations

7. Broader Landscape and Theoretical Unification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research