Dynamic Routing in Capsule Networks

Updated 26 December 2025

Dynamic Routing Between Capsules is a neural algorithm that assigns part–whole relationships by iteratively computing routing coefficients via agreement maximization.
Its mathematical framework leverages softmax-normalized coupling coefficients and a nonlinear squash function to produce viewpoint-aware feature hierarchies.
Various algorithmic variants enhance efficiency and scalability, addressing challenges such as sparse gradients and computational cost in deep capsule networks.

Dynamic routing between capsules is a neural inference algorithm designed to facilitate flexible part–whole assignment in capsule networks. Unlike conventional deep neural units which are scalar-valued, a capsule is a vector- (or matrix-) valued entity whose activity encodes both presence and instantiation parameters (pose, deformation, attributes) for a given entity or part. The dynamic routing algorithm deterministically computes the assignment weights (routing coefficients) between lower-level and higher-level capsules via an iterative process maximizing agreement, thereby enabling compositional, viewpoint-aware feature hierarchies. The mechanism is central to the original Capsule Networks (CapsNets) and underpins their ability to encode spatial relationships, parse novel configurations, and generalize over complex transformations.

1. Mathematical Framework of Dynamic Routing

Dynamic routing operates between two adjacent layers of capsules: a lower layer with $N_l$ capsules and an upper layer with $N_{l+1}$ capsules. Each lower-layer capsule $i$ outputs $\mathbf{u}_i \in \mathbb{R}^d$ , which is linearly transformed to produce "prediction vectors" for each upper capsule $j$ : $\hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij} \mathbf{u}_i,$ where $\mathbf{W}_{ij}$ is a learned transformation.

The routing mechanism maintains a set of coupling coefficients $c_{ij}$ that specify how much information flows from capsule $i$ (layer $l$ ) to capsule $j$ (layer $l+1$ ). These are computed by a rowwise softmax over per-capsule logit variables $b_{ij}$ : $c_{ij} = \frac{\exp(b_{ij})}{\sum_{k=1}^{N_{l+1}} \exp(b_{ik})}, \qquad \sum_j c_{ij} = 1,$ with $b_{ij}$ initialized to zero.

The upper capsule representations are computed as: $\mathbf{s}_j = \sum_{i=1}^{N_l} c_{ij}\,\hat{\mathbf{u}}_{j|i}, \qquad \mathbf{v}_j = \mathrm{squash}(\mathbf{s}_j) = \frac{\|\mathbf{s}_j\|^2}{1+\|\mathbf{s}_j\|^2}\,\frac{\mathbf{s}_j}{\|\mathbf{s}_j\|}.$ Routing updates proceed by measuring agreement (typically dot product) between each prediction vector and resulting upper capsule, adjusting $b_{ij}$ iteratively: $b_{ij} \leftarrow b_{ij} + \mathbf{v}_j \cdot \hat{\mathbf{u}}_{j|i}.$ This loop typically runs for 3 routing iterations, after which the upper-layer capsules $\{\mathbf{v}_j\}$ are used for subsequent layers or the final classification (Sabour et al., 2017).

2. Compositionality and Grammar-based Motivation

Dynamic routing directly enforces a compositional (parse-tree) structure over capsule networks. Capsules at lower layers select upper-layer parents via learned, peaked distributions $c_{ij}$ , corresponding to OR-nodes in a grammar. Upper capsules aggregate as AND-nodes—conjunctively combining lower-capsule contributions. When routing coefficients become one-hot, the resulting capsule activations instantiate a parse-tree, mirroring the decomposition of objects into parts and subparts. In contrast, CNNs, lacking routing, blend all inputs or throw them away via pooling, and thus fail to recover grammar-like compositionality (Venkatraman et al., 2020).

To encourage sharp, discrete assignments, entropy regularization on routing coefficients is introduced: $H = -\sum_{i} \sum_{j} c_{ij} \log c_{ij},$ leading to the total loss

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{margin}} + \lambda H.$

Low-entropy routing enforces near one-hot assignments, which empirically enables the detection of compositional structure violations (e.g., permuted parts) and does not degrade standard classification accuracy (Venkatraman et al., 2020).

3. Optimization, Duality, and Convergence Properties

Dynamic routing can be formulated as minimizing a concave energy function $E(C)$ over the coupling coefficients $C = (c_{ij})$ under simplex constraints: $E(C) = -\sum_j \psi(\|\hat{U}_j C(:,j)\|), \qquad \psi(z) = z - \arctan(z),$ subject to $c_{ij} \geq 0$ , $\sum_j c_{ij} = 1$ for each $i$ , with $\hat{U}_j$ the matrix of votes for parent $j$ . Updates follow a two-step nonlinear (mirror descent) gradient method:

Gradient ascent of the energy in $C$ -space.
Softmax projection enforces constraint satisfaction.

Strict monotonic convergence to a local optimum is proven; the sequence of $C$ increases in energy until convergence, with final solutions displaying "polarization" (peaked routing assignments) (Ye et al., 8 Jan 2025).

The full procedure, including its dual (Lagrange) interpretation and KKT conditions, establishes correspondence between agreement maximization and optimal probabilistic assignment under structural constraints (Ye et al., 8 Jan 2025).

4. Algorithmic Variants and Efficient Implementations

Numerous modifications to dynamic routing address computational and representational limitations:

Intermediate Feature Routing: Extracting feature capsules not tied to classes—improves computational efficiency (reducing O( $N_{l+1}$ ) to O(1) scaling for many-class regimes) and generalization; a final FC layer maps features to classes (Mandal et al., 2019).
Parallel/Batched Routing: Tensorizing all computations facilitates hardware efficiency (e.g., PDR-CapsNet achieves 87% parameter reduction, 3x faster inference, and 93% lower energy) (Javadinia et al., 2023).
Inverted Attention and Concurrent Routing: Swapping the direction of agreement computation (parents compete for children using dot-product attention) plus LayerNorm and concurrent routing across layers yields improved gradient flow, parameter efficiency, and stable deep capsule stacks (Tsai et al., 2020).
Efficient Solvers: Fast EM- or mean-shift–style algorithms recast routing as weighted kernel density estimation, accelerating computation by ≥37% without performance loss (Zhang et al., 2018).
Adaptive Routing Without Explicit Coupling: Eliminating coupling coefficients in favor of adaptive updates with a tunable gradient amplification parameter $\lambda$ enables convergence in deep nets, reduces vanishing gradients, and further lowers compute cost (Ren et al., 2019).
Stabilization for Text: Incorporating “orphan capsules,” leaky-softmax, and activation-gated couplings improve robustness in text settings subject to routing noise (Zhao et al., 2018).
Quadratic Programming-based Routing: Posing routing as a constrained QP enables direct margin maximization, inhibits only positive routing, and achieves more discriminative capsule activations (Yang et al., 2021).

5. Applications and Empirical Findings

Capsule networks with dynamic routing demonstrate:

Significantly improved resilience to affine, rotational, and permutation-based structure violations (e.g., scrambled faces, affNIST), with low-entropy routing critical for compositionality detection (Venkatraman et al., 2020).
Parameter and energy-efficient learning on high-class-count problems—e.g., up to $32\times$ faster epoch times and negligible accuracy loss on 199-class handwritten Bangla compound characters (Mandal et al., 2019).
Accurate segmentation and instance recognition in overlapped-object and multi-label text classification, outperforming similarly sized CNNs and LSTMs in both domains (Sabour et al., 2017, Zhao et al., 2018).
Theoretical and empirical support for unsupervised generative modeling: dynamic routing can serve as both an inference and MCMC-mixing engine in product-of-experts capsule Boltzmann machines, with contrastive divergence yielding sample generation and feature disentanglement (Hauser, 2019, Hauser, 2019).

Empirical results consistently show that low-entropy, agreement-driven routing is necessary for learning and exploiting compositional structures, and that variants leveraging feature-based, parallel, or efficient solvers enable higher scalability and robustness.

6. Limitations, Open Problems, and Future Directions

Key challenges and research questions include:

Stacking Depth and Gradient Propagation: Classic dynamic routing induces sparse, vanishing gradients beyond two layers due to highly peaked $c_{ij}$ coefficients; adaptive and attention-based methods address, but fundamental theoretical limits remain to be fully characterized (Ren et al., 2019).
Computation vs. Sharpness Tradeoff: Over-polarization (too sharp $c_{ij}$ ) and excess entropy regularization both degrade structure detection and may harm gradient flow (Venkatraman et al., 2020, Ye et al., 8 Jan 2025).
Negative Routing/Disinhibition: Standard dynamic routing only allows positive couplings; QP-based formulations accommodate inhibitory or negative routing, with implications for class discrimination and adversarial robustness (Yang et al., 2021).
Scalability on Large-Scale Vision and Text: Hardware-aware parallelism, attention mechanisms, and group-type routing address barriers, but optimal scaling with deep architectures and modern backbone integration (e.g., ResNets, transformers) remains under active investigation (Tsai et al., 2020, Javadinia et al., 2023).
Unsupervised and Generative Learning: Energy-based capsule networks with dynamic routing-bolstered products of experts offer a principled, generative alternative to prevalent variational and contrastive autoencoder techniques, with promising results on image generation and structure discovery (Hauser, 2019, Hauser, 2019).
Formal Grammar Induction: There is ongoing work on formalizing the relationship between dynamic routing and learnable parsing of structured objects as grammars, with entropy and routing mechanisms acting as differentiable relaxations of combinatorial parsing (Venkatraman et al., 2020).

Dynamic routing between capsules remains a critical mechanism for compositional deep learning, with a spectrum of variants addressing truncated gradients, computational efficiency, and representational sharpness. Recent advances provide both rigorous mathematical theory and scalable implementations, consolidating its central role in capsule-based architectures (Sabour et al., 2017, Venkatraman et al., 2020, Javadinia et al., 2023).