- The paper proposes a novel Terminator architecture that replaces residual learning with a dual-path slow-fast network generating hyper-kernels for full-context interaction.
- It introduces the HyperZ·Z·W operator that fuses global and local information through efficient elementwise multiplications, achieving linear time and space complexity.
- Extensive experiments on image classification benchmarks demonstrate superior accuracy, faster convergence, and reduced parameters compared to state-of-the-art models.
The paper proposes a novel neural network architecture—Terminator—that eschews traditional residual learning in favor of an architecture built on large implicit convolution kernels. This design relies on a dual-network paradigm in which a slow network generates hyper-kernels via coordinate-based implicit MLP (Multi-Layer Perceptron) modules, and a fast network employs these hyper-kernels to perform context-dependent convolutions. The two key components are summarized as follows:
Hyper-Kernel Generation and HyperZZW Operator
- Slow Network and Hyper-Kernels:
A coordinate-based implicit MLP is used to generate two types of hyper-kernels. The global hyper-kernel, Kg∈R1×C×H×W, is produced over normalized image coordinates, while a local hyper-kernel Kl∈RC×1×k×k is generated for depth-wise convolution. Despite having a very low parameter count (with the slow network accounting for only 8% of total model parameters), these hyper-kernels facilitate full contextual interaction.
- HyperZ⋅Z⋅W Operator:
Instead of dot product-based attention—which incurs quadratic cost—the paper introduces the HyperZ⋅Z⋅W operator. This operator bridges the hyper-kernels and hidden activations Z by means of simple elementwise multiplication. The resulting context-dependent weights (denoted $\hat{\mathcal{W}$) are then used either for dot product operations (in the global branch) or for sliding-window-based convolution (in the local branch). This mechanism achieves pixel-level scoring with optimal linear time and space complexity.
Architecture Design: Slow-Fast Neural Encoding (SFNE) Block
The SFNE block is organized into multiple branches (nine in the primary design), each responsible for extracting features at different scales. Branches are constructed from the global and local HyperZ⋅Z⋅W operators, supplemented by auxiliary modules—such as the parameter-free Si-GLU, Recursive Gated Unit (RGU), and novel hyper-channel and hyper-interaction modules—to effectuate channel and spatial mixing, as well as multi-scale feature fusion.
- Elimination of Residual and Pooling Layers:
By leveraging large implicit convolution kernels, the architecture achieves full context interaction at every layer. This design obviates the need for residual (additive) connections, which are traditionally employed to compensate for limited layer-wise representation capacity. In parallel, no intermediate pooling layers are used, preserving spatial structure and mitigating distortion in feature maps.
- Normalization via Standardization:
Rather than using conventional normalization layers (e.g., Batch Normalization that incorporates affine parameters and momentum), the paper adopts a pure z-score standardization process. This modified procedure—termed Batch Standardization (BS) and Instance Standardization (IS)—ensures stable zero-mean, unit-variance features, which contributes to faster training convergence and improved in-context learning.
Additional Innovations and Training Strategy
- Local Feedback Loss (Slow Neural Loss):
A novel loss term is introduced to update the slow network via local feedback. The slow neural loss penalizes discrepancies between hyper-kernels across SFNE blocks by computing the mean squared error between a block’s hyper-kernel and averaged hyper-kernels from preceding blocks. This promotes consistency in the generated context-dependent weights and refines the pixel-level scoring of the HyperZ⋅Z⋅W operator.
- Channel Bottleneck and Group-Based Standardization:
Concatenated multi-scale features are compressed through a bottleneck layer modulated by a coefficient λ, ensuring that only the most informative features propagate. Additionally, group-based instance-batch standardization is applied to enhance the diversity and stability of channel variances.
Empirical Performance and Analysis
- Superior Accuracy with Fewer Parameters:
- On 1D image classification benchmarks (sMNIST, pMNIST, sCIFAR10), Terminator achieves an accuracy of 99.72% on sMNIST and 93.21% on sCIFAR10 with only 1.3M parameters.
- On 2D datasets such as CIFAR10 and CIFAR100, Terminator significantly surpasses architectures like Inception, ResNet, and ViT, while also relying on dramatically fewer parameters.
- On the STL10 dataset, Terminator attains a test accuracy of 86.32%, outperforming a ResNet-152 variant that uses residual and normalization operations—and doing so with roughly one-seventh of the model parameters.
- Enhanced Training Dynamics:
Visualizations show that Terminator achieves equivalent accuracy as a deep residual model (e.g., ResNet-152) in approximately one-sixth of the training epochs, a benefit attributed to the stable zero-mean features induced by the standardization layers. The absence of pooling and direct full-context interaction at every layer further minimize information loss and reduce learning complexity.
- Context-Dependent Weighting and Global-Local Feature Fusion:
The dual-path strategy—combining global and local hyper-kernel branches—allows the model to simultaneously capture coarse outlines and fine details. The global branch performs pixel-level scoring through elementwise operations followed by dot products or FFTs (for 1D sequences), while the local branch uses sliding window-based convolutions. This fusion obviates the need for conventional residual shortcuts to mitigate feature loss.
In summary, the paper introduces a comprehensive architectural redesign that replaces residual learning with a combination of slow-fast network dynamics and context-dependent fast weight generation. The design innovations successfully lead to faster convergence, enhanced feature representation, and state-of-the-art performance on challenging classification tasks with significantly reduced model complexity.