Hybrid Quasi-Newton Backpropagation

Updated 29 March 2026

The paper demonstrates that hybrid quasi-Newton backpropagation improves MLP training by integrating BFGS updates to effectively approximate second-order curvature.
It incorporates trust-region methods and Wolfe condition-based line searches to ensure robust step size selection and faster convergence compared to gradient descent.
Empirical results indicate lower training/test MSE and reduced convergence times, highlighting the method’s practical benefits over standard backpropagation.

Hybrid Quasi-Newton Backpropagation is a supervised learning algorithm for training multi-layer perceptrons (MLPs) that integrates quasi-Newton optimization—specifically BFGS matrix updates, trust-region methods, and Wolfe condition–based line search—within the backpropagation framework. It is designed to address shortcomings of standard gradient-descent backpropagation, such as poor error-weight objective function optimization, slow learning rates, and general instability by leveraging second-order information to improve convergence properties and robustness (Chakraborty et al., 2012).

1. Problem Formulation and Error Objective

Let $W$ denote the vector of all adjustable weights, including biases, in an MLP with $o$ outputs, $h$ hidden neurons, and $n$ inputs. Given a supervised training set $\{(x^p, T^p)\}_{p=1}^P$ with $x^p \in \mathbb{R}^n$ and $T^p \in \mathbb{R}^o$ , network predictions are $O^p(W)$ . The learning objective is to minimize the mean-square error (MSE): $E(W) = \frac{1}{2P} \sum_{p=1}^P \|O^p(W) - T^p\|^2,$ yielding the minimization problem $W^* = \arg\min_W E(W)$ . The hybrid algorithm uses a quadratic model: $m_k(s) = E(W_k) + g_k^T s + \frac{1}{2} s^T B_k s,$ where $g_k = \nabla E(W_k)$ and $B_k$ is a positive-definite approximation to the Hessian.

2. Quasi-Newton Updates and BFGS Formula

At the core of this method is the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton update. Beginning with $B_0 = I$ , the weight update steps $s_k = W_{k+1} - W_k$ and gradient differences $y_k = g_{k+1} - g_k$ yield: $B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}.$ An equivalent recursion maintains the inverse Hessian approximation $H_k \approx B_k^{-1}$ : $H_{k+1} = (I - \rho_k s_k y_k^T) H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T,$ where $\rho_k = 1 / (y_k^T s_k)$ . These updates efficiently approximate local curvature, improving search directions and circumventing explicit Hessian computation.

3. Trust-Region and Line Search Mechanisms

The optimization step either restricts candidate steps $s$ to a trust region ( $\|s\| \leq \Delta_k$ ) or seeks a step $\alpha_k p_k$ along the search direction $p_k = -B_k^{-1} g_k$ . In both cases, step acceptability is governed by the agreement between actual and predicted reductions: $\rho = \frac{E(W_k) - E(W_k + s_k)}{m_k(0) - m_k(s_k)}.$ If $\rho$ is large (model predicts well), the trust region is expanded; if small, it is contracted. The hybrid algorithm as presented employs an augmented line search, enforcing the strong Wolfe conditions for $\alpha_k$ :

Sufficient decrease (Armijo):

$E(W_k + \alpha p_k) \leq E(W_k) + c_1 \alpha g_k^T p_k, \quad 0 < c_1 < \frac{1}{2}$

Curvature:

$|\nabla E(W_k + \alpha p_k)^T p_k| \leq c_2 |g_k^T p_k|, \quad c_1 < c_2 < 1$

A bracketing (zoom) approach iteratively refines $\alpha_k$ until both conditions are met.

4. Algorithmic Workflow

The hybrid backpropagation procedure iterates as follows (batch or pattern-by-pattern):

Initialization: $W_0$ drawn from $\mathcal{U}(-0.1, 0.1)$ ; $B_0$ is the identity.
Forward pass: Compute $O^p(W_k)$ for all inputs.
Gradient computation: Backpropagation yields $g_k = \nabla E(W_k)$ $g_{k} = \nabla E (W_{k})$ . Explicitly:
- For each output neuron $k$ : $\delta_k = O_k (1 - O_k)(T_k - O_k)$ .
- For each hidden neuron $j$ : $\delta_j = O_j (1 - O_j) \sum_k w_{jk} \delta_k$ .
- Gradient for weight $w_{ij}$ : $g_{w_{ij}} = -\sum_p \delta_{\text{node}} \cdot \text{input}$ .
Search direction: Solve $B_k p_k = -g_k$ .
Line search: Find $\alpha_k$ satisfying Wolfe conditions.
Weight update: $s_k = \alpha_k p_k$ , $W_{k+1} = W_k + s_k$ .
BFGS update: Form $y_k = g_{k+1} - g_k$ , then update $B_{k+1}$ .
Stopping check: Halt if $\|g_k\| < \epsilon$ or $k \geq K_{\max}$ .

Pseudocode for the full batch hybrid quasi-Newton backpropagation:

Input: training set {(x^p, T^p)}, network, f(·), ε, K_max, c₁, c₂
Initialize: W₀ ← U(−0.1,0.1), B₀ ← I, k ← 0
repeat
  Forward pass for all patterns → O^p(W_k)
  Compute g_k = ∇E(W_k) via backprop
  Solve B_k p_k = −g_k
  Line search for α_k (Wolfe conditions)
  Set s_k = α_k p_k, W_{k+1} = W_k + s_k
  Compute g_{k+1} = ∇E(W_{k+1})
  Set y_k = g_{k+1} − g_k
  Update B_{k+1}
  k ← k + 1
until (‖g_k‖ < ε) or (k ≥ K_max)
Output: W_k

5. Theoretical Convergence Properties

Global convergence is ensured under standard assumptions:

$E(W)$ is twice continuously differentiable, bounded below with compact level sets.
BFGS updates in conjunction with line search satisfying the strong Wolfe conditions preserve positive definiteness of $B_k$ .
It is ensured that $\|g_k\| \rightarrow 0$ , i.e., the method converges globally to a stationary point. These properties are underpinned by theory established in Dennis & Schnabel (1983) and Nocedal & Wright (1999) as cited in the source.

6. Empirical Evaluation and Results

The algorithm was evaluated on MLPs with a single hidden layer and architecture $2 \to h \to 1$ (hidden $h$ in the typical range 5–20), using standard benchmark problems:

Task	Training MSE	Test MSE	CPU Time (s)
Beale function	0.0010709 (0.107%)	0.013954 (1.40%)	69.37
Booth function	0.00009874 (0.01%)	0.0144 (1.44%)	70.25

A comparison was made with standard gradient-descent backpropagation (hand-tuned learning rate):

Algorithm	Booth error	Beale error
Quasi-Newton (proposed)	1.44%	1.3954%
Gradient Descent	13.59%	16.77%

The hybrid quasi-Newton method consistently achieved lower training and test MSE.
Training convergence and required epochs were faster by an order of magnitude.
Empirical regression plots indicated near-linear fit ( $R \approx 1$ ).

A plausible implication is that quasi-Newton refinement of curvature avoids the need for learning-rate tuning and increases robustness for non-linear MLP optimization. This suggests significant advantages for moderate-dimensional networks where fully second-order information is intractable but first-order methods are insufficiently stable or too slow.

Hybrid Quasi-Newton Backpropagation as detailed by Ghosh & Chakraborty (Chakraborty et al., 2012) demonstrates robust convergence and efficiency improvements over plain gradient-based backpropagation for MLP training, especially on structured low-dimensional tasks. Its reliance on batch-mode curvature estimation and matrix updates scales less favorably with very high-dimensional weight spaces, limiting applicability for large-scale modern deep architectures without further adaptation.

The trust-region and line search concepts are foundational in classical unconstrained optimization, bridging first-order neural learning with robust numerical methods. While the approach predates recent advances in adaptive first-order optimizers, a plausible implication is that such hybrid quasi-Newton enhancements remain relevant for domains where convergence reliability and hand-tuning avoidance are critical.

This method connects directly to established theory on BFGS and trust-region optimization in machine learning and serves as an explicit illustration of second-order optimization within the backpropagation paradigm.

Markdown Report Issue Upgrade to Chat

References (1)

Hybrid Optimized Back propagation Learning Algorithm For Multi-layer Perceptron (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid Quasi-Newton Backpropagation.