Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multiplicative Backprop Updates

Updated 14 March 2026
  • Multiplicative backpropagation-style updates scale parameter changes by their current magnitude, offering robust normalization and mitigating vanishing gradients.
  • Hybrid methods blend multiplicative and additive rules using inner/outer learning rates and a blending parameter to control nonlinearity and update clipping.
  • Empirical studies on benchmarks like CIFAR-10 and MNIST show accelerated convergence and improved performance compared to traditional optimization techniques.

Multiplicative backpropagation-style updates are optimization algorithms for deep learning that replace or complement the canonical additive gradient-descent rule with elementwise updates that scale parameter changes proportionally to their current magnitude. Such methods—rooted in the Winnow and exponentiated-gradient literatures—have seen a rigorous resurgence in modern deep networks, as they naturally normalize updates, can ameliorate vanishing/shattering gradients, yield robustness to parameter rescaling, and facilitate new regimes of training acceleration and stability. Multiplicative updates can be implemented in pure form, as a hybrid with standard additive updates, or via more general frameworks such as hypentropy-based mirror descent and Expectation Reflection, all of which have been integrated into backpropagation pipelines and empirically validated at scale (Kirtas et al., 2023, Ghai et al., 2019, Kim et al., 13 Mar 2025, Köpp et al., 2016).

1. Mathematical Formulation of Multiplicative Updates

The canonical gradient descent rule is additive: at iteration tt, with parameter vector θt1\theta_{t-1}, gradient gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1}), momentum mtm_t, and adaptive preconditioner ltl_t,

Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.

Multiplicative updates fundamentally alter this structure by making the step proportional to the magnitude of each parameter. In the framework of the Generic Optimization Framework for Alternative Updates (GOFAU), the pure multiplicative update is

Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }

where ηin>0\eta_{\mathrm{in}}>0 and ηout(0,1]\eta_{\mathrm{out}}\in(0,1] are hyperparameters controlling the nonlinearity and outer scaling, respectively. This form ensures (i) moves proportional to parameter scale, (ii) controlled, clipped updates via tanh\tanh, and (iii) elementwise operation.

A hybrid variant interpolates with the additive rule via a blending parameter θt1\theta_{t-1}0: θt1\theta_{t-1}1 Additive and multiplicative methods are recovered at endpoints θt1\theta_{t-1}2 and θt1\theta_{t-1}3 (Kirtas et al., 2023).

2. Algorithmic Implementation and Integration with Backpropagation

Multiplicative and hybrid updates are implemented within standard backward passes. At each layer:

  • Compute θt1\theta_{t-1}4, θt1\theta_{t-1}5 as for Adam, RMSProp, or Adagrad.
  • Compute the multiplicative factor θt1\theta_{t-1}6.
  • Compute the additive factor θt1\theta_{t-1}7.
  • Choose the update via hybrid blending or pure multiplicative rule.
  • Apply elementwise subtraction to update θt1\theta_{t-1}8.

The update is composable with any standard initialization (Glorot, He) and optimizer-driven preconditioning. Table 1 summarizes the principal update rules:

Update Type Formula Sign-Flip?
Pure Multiplicative θt1\theta_{t-1}9 No
Additive (SGD) gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})0 Yes
Hybrid gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})1 (multiplicative) gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})2 (additive) Yes, if gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})3

Multiplicative updates remain sign-invariant; only the hybrid choice or additive term allows sign-flips. Zero-initialized parameters remain unchanged under pure multiplicative rules, motivating hybridization.

3. Theoretical Properties and Regret Bounds

Multiplicative and hybrid rules exhibit several theoretical properties that differentiate them from additive SGD (Kirtas et al., 2023, Ghai et al., 2019):

  • Scale-Adaptivity: Updates scale proportionally under a rescaling gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})4, naturally adapting to parameter magnitude—a property absent from additive SGD, which would require learning-rate retuning.
  • Vanishing-Gradient Mitigation: For small gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})5, multiplicative updates still result in nonzero gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})6 due to proportional scaling, alleviating stagnation.
  • Clipping and Robustness: The gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})7 nonlinearity ensures that, even with large gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})8, step sizes remain bounded within gt=θft(θt1)g_t = \nabla_\theta f_t(\theta_{t-1})9, preventing parameter blow-up.
  • Mistake-Bound Guarantees: In settings analogous to Winnow and exponentiated-gradient (EG), multiplicative updates are known to enjoy logarithmic mistake bounds when many features are irrelevant (Kirtas et al., 2023).
  • Mirror Descent Unification: The hypentropy framework (Ghai et al., 2019) provides a continuous family interpolating between additive (GD) and EG multiplicative rules, via a scalar “temperature” mtm_t0, with explicit regret bounds that reduce to classic mtm_t1 or mtm_t2 rates in appropriate limits.

4. Variants: Hypentropy and Expectation Reflection

Alternative multiplicative forms have been advanced:

  • Hypentropy Update (HU, SHU family): Based on the hypentropy potential,

mtm_t3

HU smoothly interpolates between GD (mtm_t4) and positive EG (mtm_t5) elementwise, applicable to vectors and via spectral decomposition to matrices. The SHU extension enables matrix-valued updates for general rectangular matrices (Ghai et al., 2019).

  • Expectation Reflection (ER): Updates weights via a multiplicative correction based on the ratio of true target to prediction,

mtm_t6

and, in the full stacked-network case, updates pre-activations by mtm_t7, with new weights computed via pseudoinverse regression. ER is hyperparameter-free, can converge in a single iteration in ideal cases, and reinterprets backpropagation as inverse target propagation (Kim et al., 13 Mar 2025).

  • Differentiable Addiplicative Units: A smoothly parameterized transition between addition and multiplication at the neuron level uses non-integer exponentials:

mtm_t8

allowing each neuron to interpolate between summation and multiplication, with gradients computable in closed form (Köpp et al., 2016).

5. Empirical Results and Benchmarking

Multiplicative and hybrid updates have been empirically validated across convex, non-convex, and deep neural network settings:

  • Convex benchmarks: On 2D convex and Rosenbrock problems, hybrid updates reduce normalized distance to the global minimum by factors of mtm_t9–ltl_t0 compared to additive baselines; after 100 steps, up to five orders of magnitude improvement is observed (Kirtas et al., 2023).
  • Deep image classification:
    • CIFAR-10: Hybrid SGD + GOFAU yields ltl_t1 higher absolute accuracy at epoch 5 compared to SGD; final gains up to ltl_t2 (ResNet18+SGD), ltl_t3 for Adagrad. Multiplicative-only rules yield ltl_t4–ltl_t5 improvements.
    • CIFAR-100 and Tiny ImageNet: Hybrid updates improve final accuracy on ResNet18 from ltl_t6 to ltl_t7 (SGD) and ltl_t8 to ltl_t9 (Adagrad); up to Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.0 on Tiny ImageNet (Kirtas et al., 2023).
    • Training speed: Major accuracy and loss improvements concentrate in the first Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.1–Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.2 epochs.
  • Expectation Reflection achieves Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.3 test error on MNIST and Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.4 on CIFAR-10 after a single full-batch iteration, outperforming other hyperparameter-free algorithms and rapidly approaching BP performance (Kim et al., 13 Mar 2025).
  • Addiplicative units enable networks to fit polynomial targets with lower test error and faster convergence than both purely additive or purely multiplicative architectures (Köpp et al., 2016).

6. Practical Recommendations, Hyperparameter Tuning, and Limitations

  • Rate selection: Inner/outer rates Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.5 control the impact of Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.6-clipping and scaling:
    • SGD: Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.7, Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.8
    • Adagrad: Δθt=ηmtltθt=θt1Δθt.\Delta\theta_t = \eta\, m_t\, l_t \qquad \Rightarrow \qquad \theta_t = \theta_{t-1} - \Delta\theta_t.9, Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }0
    • RMSProp: Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }1, Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }2 (Kirtas et al., 2023)
  • Blending parameter Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }3: A value of Δθt=θt1tanh(ηinmtlt)ηout\boxed{ \Delta\theta_t = |\theta_{t-1}|\, \tanh(\eta_{\mathrm{in}}\, m_t\, l_t)\, \eta_{\mathrm{out}} }4 effectively balances sign flexibility and multiplicative robustness.
  • Initialization: Use standard Xavier or He schemes. Pure multiplicative updates are acceptable with nonnegative-initialized weights.
  • Limitations:
    • Pure multiplicative rules cannot switch parameter sign, and weights zero-initialized remain fixed unless hybridized.
    • Shallow nets may require additive steps to avoid suboptimal minima.
    • ER’s full-batch pseudoinverse updates scale poorly in high dimensions; ridge regularization and mini-batch approximations trade off the no-hyperparameter property for scalability (Kim et al., 13 Mar 2025).
    • Hybrid and addiplicative schemes increase per-parameter state or introduce additional parameters to optimize.

7. Broader Landscape and Theoretical Unification

Multiplicative backpropagation-style updates connect classical theory and modern deep learning:

  • Multiplicative rules derive rigorously from mirror descent in non-Euclidean geometries, with Bregman divergences such as hypentropy smoothly bridging additive and multiplicative regimes (Ghai et al., 2019).
  • The unification enables regret bounds and update steps that extend to matrices and general nonlinear transformations.
  • Structural variants, including neuron-level addiplicative control and model-level multiplicative scaling, provide inductive bias for multiplicative effects relevant to problem structure (e.g., polynomial interaction or logical gating) (Köpp et al., 2016).
  • Expectation Reflection links multiplicative consistency updates directly to target propagation and classical regression (Kim et al., 13 Mar 2025).
  • The robustness of multiplicative schemes to hyperparameter selection, initialization scale, and irrelevant parameter directions aligns with theoretical advantages known from the Winnow and EG literatures, now instantiated at scale in modern deep architectures (Kirtas et al., 2023).

In synthesis, multiplicative backpropagation-style updates and their hybridizations furnish a principled, robust, and empirically validated toolkit for optimization in deep learning, with favorable convergence, scale-adaptivity, and performance characteristics documented across a range of benchmarks and model classes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiplicative Backpropagation-Style Updates.