Capsule Networks: Structure and Performance

Updated 10 October 2025

Capsule Networks (CapsNets) are neural architectures that use groups of neurons called capsules to output vectors representing both the probability and detailed instantiation parameters of entities.
They employ a dynamic routing-by-agreement mechanism that iteratively adjusts coupling coefficients to preserve spatial hierarchies and enable part-whole reasoning.
Empirical results, such as on the MNIST dataset, show that CapsNets achieve state-of-the-art performance with enhanced robustness and interpretability, despite increased computational overhead.

A Capsule Network (CapsNet) is a neural network architecture in which groups of neurons, termed capsules, output vectors whose lengths represent the presence probability of an entity and whose orientations encode detailed instantiation parameters such as pose, deformation, and scale. Unlike traditional convolutional neural networks (CNNs) with scalar outputs and pooling operations that lose spatial hierarchies, CapsNets preserve part-whole relationships using dynamic routing mechanisms. These routing algorithms iteratively adjust the assignment of lower-level capsule outputs to higher-level capsules based on agreement, enabling the construction of interpretable parse trees and improved robustness to variabilities such as viewpoint or overlap in the input data.

1. Fundamental Principles and Architectural Components

A capsule comprises a set of neurons outputting an activity vector, with the vector’s length ( $\|v_j\|$ ) representing the probability of an entity (object or object part) and its orientation encoding its instantiation parameters. The basic operations include:

Prediction vectors: Each lower-level capsule $i$ computes a prediction $\hat{u}_{j|i} = W_{ij} u_i$ for each higher-layer capsule $j$ using a learned transformation matrix $W_{ij}$ .
Squashing function: To ensure the output vector lengths reliably represent probabilities, a nonlinear squashing function is employed:

$v_j = \frac{\|s_j\|^2}{1 + \|s_j\|^2} \frac{s_j}{\|s_j\|}$

where $s_j$ is the aggregated input to capsule $j$ .

Dynamic routing: Iteratively, the network adjusts the “coupling coefficients” $c_{ij}$ (determined by a routing softmax) to maximize agreement (measured by scalar product) between predicted and actual outputs, formalized as:

$c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}$

$b_{ij}$ are routing logits updated by $a_{ij} = v_j \cdot \hat{u}_{j|i}$ .

Through multiple routing iterations, the network converges to a configuration that amplifies mutually consistent part-whole relationships.

2. Routing-by-Agreement and Information Flow

The dynamic routing procedure is distinguished from standard max-pooling or static routing strategies. Each iteration comprises:

Initial uniform (or learned) assignment of coupling coefficients $c_{ij}$ .
Calculation of higher-layer capsule inputs:

$s_j = \sum_i c_{ij} \hat{u}_{j|i}$

Squashing for output $v_j$ .
Update of logits $b_{ij}$ by measuring agreement $a_{ij}$ , reinforcing assignments for strong matches.

This allows capsules to selectively route information and build compositional parse trees. Importantly, the architecture's vector-based intermediate states preserve equivariance, ensuring that transforms in the input result in predictable transforms in the capsule activations rather than invariance to those transformations.

3. Comparison with Standard Convolutional Neural Networks

CapsNets introduce representational and functional distinctions:

Vector representations: Whereas CNN layers output scalars responding merely to the presence of features, capsules encode rich instantiation parameters—enabling the network to disentangle pose, deformation, and other attributes of parts or objects.
Spatial preservation: Max-pooling, typical in CNNs for translation invariance, discards fine spatial details. Routing-by-agreement in CapsNets, in contrast, selectively aggregates supporting evidence based on geometric compatibility, crucial for part-whole reasoning.
Multi-object handling: CapsNets are notably more effective at resolving multiple, even overlapping, objects—such as highly occluded digits—by assigning pixel evidence to different “hypothetical” objects via competing capsule activations (Sabour et al., 2017).

4. Empirical Performance and Robustness

Empirical results underscore the capability gains:

On MNIST, a shallow CapsNet architecture (two convolutional layers, one DigitCaps layer) achieves state-of-the-art error rates: 0.34% test error (single routing iteration, no reconstruction), further reduced to 0.29% (with reconstruction regularizer), and 0.25% with three routing iterations plus regularizer.
Robustness to spatial overlap and transformation is substantiated by improved segmentation and classification in ambiguous or occluded scenarios compared to standard convolutional networks.
The reconstruction loss as an auxiliary task, requiring DigitCaps outputs to be predictive of the input, reinforces the encoding of instantiation parameters and acts in a regularizing capacity (Sabour et al., 2017).

5. Mathematical Formulation of the Routing Algorithm

The routing-by-agreement operation can be formalized as:

Prediction step:

$\hat{u}_{j|i} = W_{ij} u_i$

Routing softmax (over output capsules $j$ ):

$c_{ij} = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}$

Aggregation and squashing:

$s_j = \sum_i c_{ij} \hat{u}_{j|i} \ v_j = \frac{\|s_j\|^2}{1 + \|s_j\|^2} \frac{s_j}{\|s_j\|}$

Agreement update:

$a_{ij} = v_j \cdot \hat{u}_{j|i} \ b_{ij} \leftarrow b_{ij} + a_{ij}$

The process is repeated for a fixed number of routing iterations (typically three).

6. Architectural and Practical Implications

The key practical consequences of CapsNet architecture include:

Parameter efficiency: CapsNet achieves comparable or better accuracy with fewer parameters than conventional CNNs for tasks requiring rich geometric modeling.
Enhanced interpretability: The parse tree construction, with explicit instantiation parameter vectors, supports more interpretable inference—allowing, for example, the extraction of pose or deformation directly from capsule outputs.
Robustness in challenging scenarios: The iterative, agreement-based routing architecture demonstrates resilience to occlusion and varying spatial arrangements, facilitating generalization in domains with structured compositionality or overlapping semantic entities.

However, CapsNets introduce added computational overhead due to matrix multiplications and iterative routing steps, and practical deployment for large-scale tasks requires further optimizations and scalable variants.

7. Limitations and Future Directions

While dynamic routing between capsules provides strong representational and generalization improvements, several challenges remain:

Computational cost: The iterative routing algorithm adds significant computation per forward pass relative to CNNs.
Scaling: For large and high-resolution inputs, managing the routing process and capsule parameterization becomes resource intensive.
Extension to complex domains: While empirical performance on MNIST and MultiMNIST is robust, translation to large-scale, real-world datasets with diverse intra-class variation may require new design enhancements (multi-prototype routing, efficient routing algorithms, or hierarchical capsule arrangements, as explored in subsequent research).

Nevertheless, dynamic routing between capsules forms the foundation of a family of models capable of hierarchical, part-aware, and pose-aware reasoning in vision systems—demonstrating essential advances over previous approaches (Sabour et al., 2017).

PDF Markdown Chat (Pro)

References (1)

Dynamic Routing Between Capsules (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Capsule Network (CapsNet).