Papers
Topics
Authors
Recent
2000 character limit reached

Modified BCD with Skip Connections

Updated 29 October 2025
  • The modified BCD with skip connections is a training method that reformulates layer dependencies for deep neural networks, enabling global convergence using ReLU activations.
  • It integrates mathematical recursion formulas to incorporate skip paths and non-negative projection, ensuring feasible optimization at each block update.
  • Empirical studies demonstrate improved training loss decay and robust performance on both synthetic and real-world networks compared to standard BCD.

A modified Block Coordinate Descent (BCD) algorithm with skip connections is a training procedure for deep neural networks in which optimization over network parameters and auxiliary activations proceeds blockwise, while the architectural recursion formulas explicitly encode skip (shortcut) connections between layers. This modification enables global convergence guarantees for architectures employing rectified activations (e.g., ReLU), by reformulating layerwise dependencies and optimization constraints to account for skip connections and their non-negative projection properties. These developments are grounded in both theoretical analysis and empirical validation, establishing rigorous solution properties even for networks where standard BCD may fail.

1. Mathematical Foundations of Skip Connections for Layerwise Recursion

Skip connections are formalized through recursion formulas specifying the dependencies of each layer output on its preceding layers. In classical networks, layer ii computes

Xi=gi[F(Xi1,θi)]X_{i} = g_{i}\big[ F(X_{i-1}, \theta_{i}) \big]

which implies strictly sequential propagation. In networks with skip connections (ResNet), the formula is

Xi=gi[Xi1+F(Xi1,θi)]X_{i} = g_{i}\big[ X_{i-1} + F(X_{i-1}, \theta_{i}) \big]

where the addition constitutes the skip connection, allowing direct propagation from previous layers and serving to parallelize signal flow.

The effect of skip connections is evident in the derivative expansion: XLXLi=(1+WL)(1+WL1)(1+WLi+1)\frac{\partial X_L}{\partial X_{L-i}} = (1 + W_L)(1 + W_{L-1})\cdots (1 + W_{L-i+1}) where each (1+Wk)(1 + W_k) factor encodes the possibility of traversing the corresponding skip or transformation. Expansions such as

(1+W1)(1+W2)=1+W1+W2+W1W2(1 + W_1)(1 + W_2) = 1 + W_1 + W_2 + W_1 W_2

illustrate how shortcut paths create combinatorially many routes from input to output, enhancing expressivity and gradient flow.

2. Formal Integration of Skip Connections in BCD Optimization

The modified BCD procedure explicitly captures skip connections within the layerwise activation variables. For a LL-layer network and activation function σ\sigma, the recursion formula with skip connections is written as: Vj,i=σ(WjVj1,i)+Vj1,iV_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i} where Vj,iV_{j,i} are auxiliary variables for layer outputs and WjW_j are layer weights. The BCD objective is formulated as

minW,VF(W,V)=i=1n[(WLVL1,iyi)2+γj=2L1σ(WjVj1,i)+Vj1,iVj,i2+γσ(W1V0,i)V1,i2]\min_{\mathbf{W}, \mathbf{V}} F(\mathbf{W}, \mathbf{V}) = \sum_{i=1}^n \Bigg[ \left(W_L V_{L-1,i} - y_i\right)^2 + \gamma \sum_{j=2}^{L-1} \left\| \sigma(W_j V_{j-1,i}) + V_{j-1,i} - V_{j,i} \right\|^2 + \gamma \left\| \sigma(W_1 V_{0,i}) - V_{1,i} \right\|^2 \Bigg]

where yiy_i are targets and γ\gamma is a regularization parameter. This structure enforces that each layer's activation aligns with the skip-connected recursion, and all BCD block updates must respect these expanded dependencies.

3. Algorithmic Steps: Block Updates and Non-Negative Projection

The algorithm comprises the following main steps (as outlined in (Akiyama, 26 Oct 2025)):

  • Initialization: Each Vj,iV_{j,i} is initialized by the skip-connected recursion; i.e., Vj,i=σ(WjVj1,i)+Vj1,iV_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i} for j2j \geq 2.
  • Output Layer Update: For each ii, optimize VL1,iV_{L-1,i} to reduce the output loss: VL1,iVL1,iηVVL1,iWLVL1,iyi2V_{L-1,i} \gets V_{L-1,i} - \eta_V \nabla_{V_{L-1,i}} \|W_L V_{L-1,i} - y_i\|^2, followed by VL1,i(VL1,i)+V_{L-1,i} \gets (V_{L-1,i})^+, ensuring non-negativity.
  • Hidden Layer Updates: For j=L1,...,2j = L-1, ..., 2:
    • Update WjW_j via gradients w.r.t. skip constraint loss.
    • Update Vj1,iV_{j-1,i} accordingly and project to non-negative.
  • First Layer Weights: Optimize W1W_1 via multiple inner updates to align input activations with projected outputs.

The non-negative projection (Vj,i)+=max{Vj,i,0}(V_{j,i})^+ = \max\{V_{j,i}, 0\} ensures compatibility with the ReLU activation's codomain.

4. Theoretical Guarantees for Convergence and Feasibility

The modified BCD algorithm with skip connections and non-negative projection is proven to converge to global minima, even when σ\sigma is non-strictly monotonic as in ReLU ((Akiyama, 26 Oct 2025), Thm. 6.1):

  • For any ϵ>0\epsilon > 0, there exist step sizes and iterations so that F(W,V)ϵF(\mathbf{W}, \mathbf{V}) \leq \epsilon.
  • The skip-connection structure guarantees a feasible preimage exists for any activation value required by optimization, circumventing dead zones or non-bijectivity inherent to ReLU.
  • Feasibility conditions (Lemmas 6.2, 6.3) require the output layer weights WLW_L to mix sign; in practice, random Gaussian initialization suffices as width grows.
  • The loss over output layers decays exponentially during BCD, while error in auxiliary variables remains controlled.

These guarantees represent a substantial advance over standard BCD, which may only converge for strictly monotonic activation functions, and may get stuck for ReLU due to unattainable negative components.

5. Empirical Performance: Convergence and Architecture Impact

Empirical studies ((Akiyama, 26 Oct 2025), Sec. 7.2) demonstrate:

  • On synthetic regression and deep ReLU networks (4–12 layers, width 30), the modified BCD achieves monotonic decrease in training loss approaching zero, as theoretically predicted.
  • Ablation without skip connections shows BCD stagnation for ReLU nets (loss fails to decrease), underscoring the necessity of skip-connected recursion and non-negative projection.
  • Initialization via Singular Value Bounding (SVB) further stabilizes training.

Architectural experiments inspired by mathematical recursion formulas (e.g., (Liao et al., 2021)) show systematic performance improvements in ResNet variants, with statistically significant accuracy gains on CIFAR and ImageNet, and minimal computational cost increase.

6. Broader Architectural and Optimization Implications

Recursion formulas encapsulate signal dependency and propagation; network architectures are tightly determined by their mathematical recursion. Designing architectures via explicit recursive formulas enables principled control of information flow, skip paths, memory, and parallelism.

For iterative training algorithms such as BCD, skip connections must be incorporated within the update rules and gradient computations. This includes extending block dependencies beyond direct predecessors, calculating relevant Jacobians with skip path expansions, and ensuring projection constraints are enforced as necessitated by activation type.

In encoder–decoder arrangements (Xiang et al., 2022), bi-directional skip connections further extend blockwise updating capabilities, reflecting principles similar to BCD: alternate updates, recursive feature flow, and possible NAS-driven pruning for efficient implementation.

7. Summary Table: Algorithmic Components

Component Role Implementation/Equation
Skip-connected recursion Signal propagation & block dependencies Vj,i=σ(WjVj1,i)+Vj1,iV_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i}
Objective function Enforces recursion and data fit See F(W,V)F(\mathbf{W}, \mathbf{V}) formula above
Non-negative projection Feasibility for ReLU outputs Vj,imax{Vj,i,0}V_{j,i} \gets \max\{V_{j,i}, 0\}
Output layer update Minimizes prediction error VL1,iV_{L-1,i} gradient step, projection
Hidden layer update Satisfies recursion and architectural constraint Gradient steps for WjW_j, Vj1,iV_{j-1,i}

Each component implements an explicit mathematical principle from skip-connected architectures, yielding both practical and theoretical benefits for deep network training.

8. Conclusion

Modified BCD algorithms with skip connections systematically integrate architectural recursion formulas into blockwise optimization, yielding global convergence guarantees for deep networks equipped with ReLU and other modern activation functions. The approach is deeply rooted in the explicit mathematical description of information propagation and leverages additive features, multi-path signal routes, and projection steps to ensure feasibility and optimization efficacy. Empirical evidence substantiates these guarantees, and methodological insights apply broadly to neural architecture design, iterative optimization, and algorithmic co-design where skip connections enhance signal and gradient flow.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modified BCD Algorithm with Skip Connections.