Modified BCD with Skip Connections

Updated 29 October 2025

The modified BCD with skip connections is a training method that reformulates layer dependencies for deep neural networks, enabling global convergence using ReLU activations.
It integrates mathematical recursion formulas to incorporate skip paths and non-negative projection, ensuring feasible optimization at each block update.
Empirical studies demonstrate improved training loss decay and robust performance on both synthetic and real-world networks compared to standard BCD.

A modified Block Coordinate Descent (BCD) algorithm with skip connections is a training procedure for deep neural networks in which optimization over network parameters and auxiliary activations proceeds blockwise, while the architectural recursion formulas explicitly encode skip (shortcut) connections between layers. This modification enables global convergence guarantees for architectures employing rectified activations (e.g., ReLU), by reformulating layerwise dependencies and optimization constraints to account for skip connections and their non-negative projection properties. These developments are grounded in both theoretical analysis and empirical validation, establishing rigorous solution properties even for networks where standard BCD may fail.

1. Mathematical Foundations of Skip Connections for Layerwise Recursion

Skip connections are formalized through recursion formulas specifying the dependencies of each layer output on its preceding layers. In classical networks, layer $i$ computes

$X_{i} = g_{i}\big[ F(X_{i-1}, \theta_{i}) \big]$

which implies strictly sequential propagation. In networks with skip connections (ResNet), the formula is

$X_{i} = g_{i}\big[ X_{i-1} + F(X_{i-1}, \theta_{i}) \big]$

where the addition constitutes the skip connection, allowing direct propagation from previous layers and serving to parallelize signal flow.

The effect of skip connections is evident in the derivative expansion: $\frac{\partial X_L}{\partial X_{L-i}} = (1 + W_L)(1 + W_{L-1})\cdots (1 + W_{L-i+1})$ where each $(1 + W_k)$ factor encodes the possibility of traversing the corresponding skip or transformation. Expansions such as

$(1 + W_1)(1 + W_2) = 1 + W_1 + W_2 + W_1 W_2$

illustrate how shortcut paths create combinatorially many routes from input to output, enhancing expressivity and gradient flow.

2. Formal Integration of Skip Connections in BCD Optimization

The modified BCD procedure explicitly captures skip connections within the layerwise activation variables. For a $L$ -layer network and activation function $\sigma$ , the recursion formula with skip connections is written as: $V_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i}$ where $V_{j,i}$ are auxiliary variables for layer outputs and $W_j$ are layer weights. The BCD objective is formulated as

$\min_{\mathbf{W}, \mathbf{V}} F(\mathbf{W}, \mathbf{V}) = \sum_{i=1}^n \Bigg[ \left(W_L V_{L-1,i} - y_i\right)^2 + \gamma \sum_{j=2}^{L-1} \left\| \sigma(W_j V_{j-1,i}) + V_{j-1,i} - V_{j,i} \right\|^2 + \gamma \left\| \sigma(W_1 V_{0,i}) - V_{1,i} \right\|^2 \Bigg]$

where $y_i$ are targets and $\gamma$ is a regularization parameter. This structure enforces that each layer's activation aligns with the skip-connected recursion, and all BCD block updates must respect these expanded dependencies.

3. Algorithmic Steps: Block Updates and Non-Negative Projection

The algorithm comprises the following main steps (as outlined in (Akiyama, 26 Oct 2025)):

Initialization: Each $V_{j,i}$ is initialized by the skip-connected recursion; i.e., $V_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i}$ for $j \geq 2$ .
Output Layer Update: For each $i$ , optimize $V_{L-1,i}$ to reduce the output loss: $V_{L-1,i} \gets V_{L-1,i} - \eta_V \nabla_{V_{L-1,i}} \|W_L V_{L-1,i} - y_i\|^2$ , followed by $V_{L-1,i} \gets (V_{L-1,i})^+$ , ensuring non-negativity.
Hidden Layer Updates: For $j = L-1, ..., 2$ $j = L - 1, ..., 2$ :
- Update $W_j$ via gradients w.r.t. skip constraint loss.
- Update $V_{j-1,i}$ accordingly and project to non-negative.
First Layer Weights: Optimize $W_1$ via multiple inner updates to align input activations with projected outputs.

The non-negative projection $(V_{j,i})^+ = \max\{V_{j,i}, 0\}$ ensures compatibility with the ReLU activation's codomain.

4. Theoretical Guarantees for Convergence and Feasibility

The modified BCD algorithm with skip connections and non-negative projection is proven to converge to global minima, even when $\sigma$ is non-strictly monotonic as in ReLU ((Akiyama, 26 Oct 2025), Thm. 6.1):

For any $\epsilon > 0$ , there exist step sizes and iterations so that $F(\mathbf{W}, \mathbf{V}) \leq \epsilon$ .
The skip-connection structure guarantees a feasible preimage exists for any activation value required by optimization, circumventing dead zones or non-bijectivity inherent to ReLU.
Feasibility conditions (Lemmas 6.2, 6.3) require the output layer weights $W_L$ to mix sign; in practice, random Gaussian initialization suffices as width grows.
The loss over output layers decays exponentially during BCD, while error in auxiliary variables remains controlled.

These guarantees represent a substantial advance over standard BCD, which may only converge for strictly monotonic activation functions, and may get stuck for ReLU due to unattainable negative components.

5. Empirical Performance: Convergence and Architecture Impact

Empirical studies ((Akiyama, 26 Oct 2025), Sec. 7.2) demonstrate:

On synthetic regression and deep ReLU networks (4–12 layers, width 30), the modified BCD achieves monotonic decrease in training loss approaching zero, as theoretically predicted.
Ablation without skip connections shows BCD stagnation for ReLU nets (loss fails to decrease), underscoring the necessity of skip-connected recursion and non-negative projection.
Initialization via Singular Value Bounding (SVB) further stabilizes training.

Architectural experiments inspired by mathematical recursion formulas (e.g., (Liao et al., 2021)) show systematic performance improvements in ResNet variants, with statistically significant accuracy gains on CIFAR and ImageNet, and minimal computational cost increase.

6. Broader Architectural and Optimization Implications

Recursion formulas encapsulate signal dependency and propagation; network architectures are tightly determined by their mathematical recursion. Designing architectures via explicit recursive formulas enables principled control of information flow, skip paths, memory, and parallelism.

For iterative training algorithms such as BCD, skip connections must be incorporated within the update rules and gradient computations. This includes extending block dependencies beyond direct predecessors, calculating relevant Jacobians with skip path expansions, and ensuring projection constraints are enforced as necessitated by activation type.

In encoder–decoder arrangements (Xiang et al., 2022), bi-directional skip connections further extend blockwise updating capabilities, reflecting principles similar to BCD: alternate updates, recursive feature flow, and possible NAS-driven pruning for efficient implementation.

7. Summary Table: Algorithmic Components

Component	Role	Implementation/Equation
Skip-connected recursion	Signal propagation & block dependencies	$V_{j,i} = \sigma(W_j V_{j-1,i}) + V_{j-1,i}$
Objective function	Enforces recursion and data fit	See $F(\mathbf{W}, \mathbf{V})$ formula above
Non-negative projection	Feasibility for ReLU outputs	$V_{j,i} \gets \max\{V_{j,i}, 0\}$
Output layer update	Minimizes prediction error	$V_{L-1,i}$ gradient step, projection
Hidden layer update	Satisfies recursion and architectural constraint	Gradient steps for $W_j$ , $V_{j-1,i}$

Each component implements an explicit mathematical principle from skip-connected architectures, yielding both practical and theoretical benefits for deep network training.

8. Conclusion

Modified BCD algorithms with skip connections systematically integrate architectural recursion formulas into blockwise optimization, yielding global convergence guarantees for deep networks equipped with ReLU and other modern activation functions. The approach is deeply rooted in the explicit mathematical description of information propagation and leverages additive features, multi-path signal routes, and projection steps to ensure feasibility and optimization efficacy. Empirical evidence substantiates these guarantees, and methodological insights apply broadly to neural architecture design, iterative optimization, and algorithmic co-design where skip connections enhance signal and gradient flow.

Markdown Report Issue Upgrade to Chat

References (3)

Block Coordinate Descent for Neural Networks Provably Finds Global Minima (2025)

Analyze and Design Network Architectures by Recursion Formulas (2021)

Towards Bi-directional Skip Connections in Encoder-Decoder Architectures and Beyond (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modified BCD Algorithm with Skip Connections.