Capsule Routing in Neural Networks

Updated 25 December 2025

Capsule Routing is a dynamic algorithm that assigns votes from lower-level neurons to higher-level capsules based on agreement, enforcing part–whole relationships.
It leverages iterative and non-iterative methods, including EM, self-attention, and prototype-based schemes, to refine coupling coefficients effectively.
Optimizations such as geometric extensions and computational enhancements improve scalability and robustness in applications like sound detection and graph learning.

Capsule routing refers to a family of algorithms for dynamically assigning outputs (“votes”) from lower-layer groups of neurons—capsules—to upper-layer capsules in neural networks. Unlike static connections in conventional deep networks, capsule routing explicitly models part–whole relationships between entities (“capsules”) and selects how much each lower-level entity should contribute to candidate higher-level entities. Routing is central to the function and distinct representational properties of Capsule Networks (CapsNets).

1. Mathematical Formulation and Dynamic Routing by Agreement

The canonical capsule routing mechanism is the dynamic routing-by-agreement algorithm introduced by Sabour et al., which iteratively refines couplings between lower-layer and upper-layer capsules based on the degree of agreement among their predictions. The key steps can be summarized as follows:

Let layer $l$ contain $M$ input capsules $u_i \in \mathbb{R}^{D_1}$ and the next layer $l+1$ contain $N$ output capsules $v_j \in \mathbb{R}^{D_2}$ . For each pair $(i, j)$ , a trainable transformation $W_{ji} \in \mathbb{R}^{D_2 \times D_1}$ maps the input capsule to a “vote”: $\hat{u}_{j|i} = W_{ji} u_i$ The routing iteratively updates coupling coefficients $c_{ij}$ :

Initialize $b_{ij} = 0$ and set $c_{ij} = \operatorname{softmax}_j(b_{ij})$ .
Compute each output capsule’s input: $s_j = \sum_{i=1}^M c_{ij} \hat{u}_{j|i}$ .
Apply a squashing nonlinearity to ensure output vectors have length in $(0,1)$ : $v_j = \frac{\|s_j\|^2}{1 + \|s_j\|^2} \frac{s_j}{\|s_j\|}$
Update $b_{ij} \leftarrow b_{ij} + v_j \cdot \hat{u}_{j|i}$ . This sequence is repeated for a fixed number of iterations (commonly $r=3$ ) and constitutes one routing “pass” (Iqbal et al., 2018).

In capsule networks for sound event detection, Zhao et al. implemented this procedure in a time-slice–wise fashion, inserting it into a temporal architecture to model both part–whole coherence and attention to salient events (Iqbal et al., 2018). By running routing independently at each time-slice and combining outputs with an auxiliary temporal attention mechanism, the architecture achieved state-of-the-art performance on polyphonic sound event detection.

2. Non-Iterative and Fast Routing Variants

The computational overhead of iterative routing has led to significant interest in non-iterative or single-pass routing schemes. Notable approaches include:

Cluster Routing: Each lower-level capsule produces a cluster of votes via multiple learned matrices acting on a spatial neighborhood. Variance within a cluster quantifies self-consistency, and centroids of low-variance clusters obtain higher routing weights. The next-layer capsule input is a weighted sum of centroids, with weights given by an elementwise softmax of $- \log \sigma_i$ , where $\sigma_i$ is the cluster’s variance. The entire routing is performed in a single pass, avoiding any recurrent coefficient refinement (Zhao et al., 2021).
Self-Attention Routing: The routing-by-agreement step is replaced by an attention mechanism. Each lower-level capsule predicts every higher-level capsule’s pose, and similarity (e.g., scaled dot-product) between votes defines the “agreement.” Coupling coefficients are computed by a single softmax, and capsule outputs are obtained via attention-weighted aggregation, with a final nonlinearity (Mazzia et al., 2021, Duarte et al., 2021).
Prototype-Based (ProtoCaps) Routing: Each lower-layer capsule is projected into a shared subspace; similarity to a set of learned prototypes determines soft clustering coefficients (via softmax), which in turn weight contributions to the next layer. No per-iteration voting or agreement update is performed (Everett et al., 2023).

These non-iterative algorithms provide significant reductions in memory, FLOPs, and wall-clock runtime, while achieving competitive or superior accuracy to traditional dynamic/EM routing on standard vision and sequence datasets (Zhao et al., 2021, Mazzia et al., 2021, Everett et al., 2023).

3. Extensions: Probabilistic and Geometric Routing

Probabilistic formulations of capsule routing generalize the dynamic routing approach:

EM Routing: Treats each higher-level capsule as a Gaussian mixture component, with votes from lower-level capsules as samples. Routing alternates E-steps (computing responsibilities based on Gaussian likelihood) and M-steps (fitting the mean, variance, and activation of each capsule), with activation computed as a logistic function of the sum of log-variances and mixture assignment weights (Peer et al., 2019).
Variational Bayes Routing: Places explicit priors over mixture weights, means, and precisions of the Gaussian clusters, and infers posteriors via coordinate ascent. This regularization prunes degenerate solutions (e.g., variance collapse), helps prevent overfitting, and enables interpretation of capsules as VAE components. Bayesian routing maintains or improves accuracy while increasing stability and generalization to novel transformations (Ribeiro et al., 2019).
Pseudo-Riemannian Capsule Routing: Classical routing operates in Euclidean or constant-curvature spaces. PR-CapsNet extends capsule routing to pseudo-Riemannian manifolds with adaptive curvature, decomposing capsule representation spaces into spherical-temporal and Euclidean-spatial components. Adaptive curvature routing learns per-capsule mixtures over curvature perspectives and aggregates them via geometry-aware attention, achieving robust hierarchical and cluster structure modeling in graph domains (Qin et al., 9 Dec 2025).

4. Complexity, Scaling, and Speed

A recurrent bottleneck for capsule routing has been computational and memory cost:

Method	Time per layer (big-O)	Memory per layer
Dynamic routing ( $r$ iters)	$O(r\,m\,n\,d)$	$O(m\,n\,d)$
EM routing	$O(r\,m\,n\,d^2)$	$O(m\,n\,d + m\,n)$
Cluster/self-attn/ProtoCaps	$O(m\,n\,d)$ (1 pass)	$O(m\,n + (m+n)d)$

Empirically, iterative approaches can require up to an order of magnitude more operations than non-iterative schemes (e.g., 0.40 G vs. 0.06 G ops per batch for standard vs. Efficient-CapsNet on MNIST- or smallNORB-sized inputs (Mazzia et al., 2021)). Memory requirements for iterative routing preclude training on ImageNet-scale tasks unless reduced-memory approximations—such as ProtoCaps or accumulated (master) routing coefficients—are employed (Everett et al., 2023, Zhao et al., 2019).

Optimizations for sequential or temporal tasks include sequential dynamic routing (accumulating coupling updates across time steps) to maintain linear decoding time in sequence length (Lee et al., 2020), and blockwise or shortcut routing to enable deep capsule stacks without exacerbating routing cost (Vu et al., 2023, Zhang et al., 2024).

5. Empirical Performance and Representational Properties

Routing enables capsule networks to exhibit viewpoint equivariance, part–whole compositionality, and disentangled generative factors even with shallow architectures and relatively few parameters. Notable experimental outcomes include:

On Fashion-MNIST, non-iterative cluster routing achieves 5.17% error with 146K parameters, outperforming earlier CapsNets (Zhao et al., 2021).
On smallNORB and MNIST-to-affNIST transfer, non-iterative and Bayesian methods exhibit better generalization to novel transforms than standard CNNs or dynamic routing (Ribeiro et al., 2019, Zhao et al., 2021, Iqbal et al., 2018).
In sequence and multimodal tasks, capsule routing enables parameter sharing over variable-length sequences or parallel streams, avoids parameter blow-up, and improves sample efficiency (Lee et al., 2020, Duarte et al., 2021).
Object-centric and hierarchical representations learned via routing are more robust to adversarial and out-of-distribution scenarios than their statically-aggregated equivalents (Hu et al., 23 Aug 2025).

However, practical benefits in classification accuracy have sometimes been marginal relative to parameter- and computation-matched CNNs, and some studies report that, in standard settings, non-learned or even random routing may achieve similar results, prompting reconsideration of routing objectives and regularization (Paik et al., 2019).

6. Limitations, Pathologies, and Theoretical Insights

Despite its promise, capsule routing exhibits several critical limitations:

Expressivity limitations: Peer et al. proved that both dynamic and EM routing are restricted to symmetric functions $f(x)=f(-x)$ , failing to be universal approximators unless explicit bias is introduced per capsule. Without these modifications, the networks cannot learn anti-symmetric or sign-sensitive mappings—even in principle. Adding small learned biases corrects this limitation and restores universality (Peer et al., 2019).
Over-polarization: Empirical studies found that repeated routing iterations can drive coupling coefficients to degenerate one-hot assignments, losing the soft, distributed representations capsule networks are meant to provide. This polarization renders routing behavior similar to max-pooling and undermines uncertainty modeling (Paik et al., 2019).
Limited impact on decision boundary: On standard datasets, the vast majority of input–class assignments are unchanged by the routing step—most of the classification power arises from the initial (unrouted) assignments. Only a small fraction of examples experience a change in the final decision due to routing (Paik et al., 2019).

Remedial techniques include soft/entropy regularization, adaptive temperature in softmax, curvature-adaptive geometric routing, and explicit Bayesian uncertainty modeling to prevent collapse and maintain distributed part–whole credit assignment (Qin et al., 9 Dec 2025, Ribeiro et al., 2019).

7. Application Domains and Advanced Routing Architectures

Capsule routing algorithms have been adapted across vision, audio, natural language, graph representation learning, and multimodal tasks:

Sound event detection architectures use time-slice–wise routing, combined with temporal attention, to achieve robust performance and reduce overfitting on noisy or small track datasets (Iqbal et al., 2018).
In visual recognition, multi-scale routing mechanisms such as cross-agreement routing (CAR) efficiently combine features from multiple spatial resolutions, focusing on their maximally-agreeing cross-scale capsules in one pass, boosting both parameter efficiency and adversarial robustness (Hu et al., 23 Aug 2025).
Mamba capsule routing compresses the set of pixel-wise capsules with a selective state-space model, performs EM routing at the “type-level,” and employs Capsule-to-Spatial Detail Retrieval (CSDR) for improved segmentation efficiency (Zhang et al., 2024).
Graph learning has benefited from adaptive curvature pseudo-Riemannian capsule routing, integrating complex geometric inductive biases for modeling both hierarchical and cyclic graph structures (Qin et al., 9 Dec 2025).

These architectural innovations underscore the flexibility of the capsule routing paradigm, especially when adapted to domain-specific data structure and inductive bias requirements.

In sum, capsule routing constitutes the distinguishing ingredient of Capsule Networks, enforcing dynamic, data-dependent part–whole relationships between higher- and lower-order entities. The field has progressed from iterative dynamic and EM routing to a diversity of non-iterative, probabilistic, geometric, and attention-inspired schemes. Ongoing challenges include improving expressivity, stability, computational efficiency, and reliably demonstrating routing’s value beyond what is obtained from static aggregation or trivial assignments. The next phase of research will likely focus on routings that integrate principled uncertainty, robustness to degeneracy, and geometric adaptation to data and task structure (Iqbal et al., 2018, Zhao et al., 2021, Peer et al., 2019, Hu et al., 23 Aug 2025, Qin et al., 9 Dec 2025, Everett et al., 2023, Zhang et al., 2024, Paik et al., 2019).