HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction

Published 31 Jan 2024 in cs.CV | (2401.17948v1)

Abstract: The self-attention mechanism utilizes large implicit weight matrices, programmed through dot product-based activations with very few trainable parameters, to enable long sequence modeling. In this paper, we investigate the possibility of discarding residual learning by employing large implicit kernels to achieve full context interaction at each layer of the network. To accomplish it, we introduce coordinate-based implicit MLPs as a slow network to generate hyper-kernels for another fast convolutional network. To get context-varying weights for fast dynamic encoding, we propose a $\mathrm{Hyper}\mathcal{Z{\cdot}Z{\cdot}W}$ operator that connects hyper-kernels ($\mathcal{W}$) and hidden activations ($\mathcal{Z}$) through simple elementwise multiplication, followed by convolution of $\mathcal{Z}$ using the context-dependent $\mathcal{W}$. Based on this design, we present a novel Terminator architecture that integrates hyper-kernels of different sizes to produce multi-branch hidden representations for enhancing the feature extraction capability of each layer. Additionally, a bottleneck layer is employed to compress the concatenated channels, allowing only valuable information to propagate to the subsequent layers. Notably, our model incorporates several innovative components and exhibits excellent properties, such as introducing local feedback error for updating the slow network, stable zero-mean features, faster training convergence, and fewer model parameters. Extensive experimental results on pixel-level 1D and 2D image classification benchmarks demonstrate the superior performance of our architecture.

Abstract PDF Chat (Pro)

Summary

The paper proposes a novel Terminator architecture that replaces residual learning with a dual-path slow-fast network generating hyper-kernels for full-context interaction.
It introduces the HyperZ·Z·W operator that fuses global and local information through efficient elementwise multiplications, achieving linear time and space complexity.
Extensive experiments on image classification benchmarks demonstrate superior accuracy, faster convergence, and reduced parameters compared to state-of-the-art models.

The paper proposes a novel neural network architecture—Terminator—that eschews traditional residual learning in favor of an architecture built on large implicit convolution kernels. This design relies on a dual-network paradigm in which a slow network generates hyper-kernels via coordinate-based implicit MLP (Multi-Layer Perceptron) modules, and a fast network employs these hyper-kernels to perform context-dependent convolutions. The two key components are summarized as follows:

Hyper-Kernel Generation and HyperZZW Operator

Slow Network and Hyper-Kernels:

A coordinate-based implicit MLP is used to generate two types of hyper-kernels. The global hyper-kernel, $\mathbf{K}_g \in \mathbb{R}^{1 \times C \times H \times W}$ , is produced over normalized image coordinates, while a local hyper-kernel $\mathbf{K}_l \in \mathbb{R}^{C \times 1 \times k \times k}$ is generated for depth-wise convolution. Despite having a very low parameter count (with the slow network accounting for only 8% of total model parameters), these hyper-kernels facilitate full contextual interaction.

Hyper $\mathcal{Z\cdot Z\cdot W}$ Operator:

Instead of dot product-based attention—which incurs quadratic cost—the paper introduces the Hyper $\mathcal{Z\cdot Z\cdot W}$ operator. This operator bridges the hyper-kernels and hidden activations $\mathcal{Z}$ by means of simple elementwise multiplication. The resulting context-dependent weights (denoted $\hat{\mathcal{W}$) are then used either for dot product operations (in the global branch) or for sliding-window-based convolution (in the local branch). This mechanism achieves pixel-level scoring with optimal linear time and space complexity.

Architecture Design: Slow-Fast Neural Encoding (SFNE) Block

Multi-Branch Structure:

The SFNE block is organized into multiple branches (nine in the primary design), each responsible for extracting features at different scales. Branches are constructed from the global and local Hyper $\mathcal{Z\cdot Z\cdot W}$ operators, supplemented by auxiliary modules—such as the parameter-free Si-GLU, Recursive Gated Unit (RGU), and novel hyper-channel and hyper-interaction modules—to effectuate channel and spatial mixing, as well as multi-scale feature fusion.

Elimination of Residual and Pooling Layers:

By leveraging large implicit convolution kernels, the architecture achieves full context interaction at every layer. This design obviates the need for residual (additive) connections, which are traditionally employed to compensate for limited layer-wise representation capacity. In parallel, no intermediate pooling layers are used, preserving spatial structure and mitigating distortion in feature maps.

Normalization via Standardization:

Rather than using conventional normalization layers (e.g., Batch Normalization that incorporates affine parameters and momentum), the paper adopts a pure z-score standardization process. This modified procedure—termed Batch Standardization (BS) and Instance Standardization (IS)—ensures stable zero-mean, unit-variance features, which contributes to faster training convergence and improved in-context learning.

Additional Innovations and Training Strategy

Local Feedback Loss (Slow Neural Loss):

A novel loss term is introduced to update the slow network via local feedback. The slow neural loss penalizes discrepancies between hyper-kernels across SFNE blocks by computing the mean squared error between a block’s hyper-kernel and averaged hyper-kernels from preceding blocks. This promotes consistency in the generated context-dependent weights and refines the pixel-level scoring of the Hyper $\mathcal{Z\cdot Z\cdot W}$ operator.

Channel Bottleneck and Group-Based Standardization:

Concatenated multi-scale features are compressed through a bottleneck layer modulated by a coefficient $\lambda$ , ensuring that only the most informative features propagate. Additionally, group-based instance-batch standardization is applied to enhance the diversity and stability of channel variances.

Empirical Performance and Analysis

Superior Accuracy with Fewer Parameters:
- On 1D image classification benchmarks (sMNIST, pMNIST, sCIFAR10), Terminator achieves an accuracy of 99.72% on sMNIST and 93.21% on sCIFAR10 with only 1.3M parameters.
- On 2D datasets such as CIFAR10 and CIFAR100, Terminator significantly surpasses architectures like Inception, ResNet, and ViT, while also relying on dramatically fewer parameters.
- On the STL10 dataset, Terminator attains a test accuracy of 86.32%, outperforming a ResNet-152 variant that uses residual and normalization operations—and doing so with roughly one-seventh of the model parameters.
Enhanced Training Dynamics:

Visualizations show that Terminator achieves equivalent accuracy as a deep residual model (e.g., ResNet-152) in approximately one-sixth of the training epochs, a benefit attributed to the stable zero-mean features induced by the standardization layers. The absence of pooling and direct full-context interaction at every layer further minimize information loss and reduce learning complexity.

Context-Dependent Weighting and Global-Local Feature Fusion:

The dual-path strategy—combining global and local hyper-kernel branches—allows the model to simultaneously capture coarse outlines and fine details. The global branch performs pixel-level scoring through elementwise operations followed by dot products or FFTs (for 1D sequences), while the local branch uses sliding window-based convolutions. This fusion obviates the need for conventional residual shortcuts to mitigate feature loss.

In summary, the paper introduces a comprehensive architectural redesign that replaces residual learning with a combination of slow-fast network dynamics and context-dependent fast weight generation. The design innovations successfully lead to faster convergence, enhanced feature representation, and state-of-the-art performance on challenging classification tasks with significantly reduced model complexity.