GSoP-Net1: Global Second-Order Pooling in ConvNets
- GSoP-Net1 is a convolutional neural network that incorporates global second-order pooling at various depths to capture richer feature statistics.
- It computes covariance matrices from convolutional outputs, enabling enhanced non-linear modeling compared to traditional first-order pooling methods.
- Experimental results on ImageNet show that GSoP-Net1 reduces error rates relative to SE-Net and CBAM while maintaining efficient computational overhead.
GSoP-Net1 is a convolutional neural network architecture that systematically injects global second-order pooling (GSoP) operations at multiple layers in a deep ConvNet, rather than only at the network’s output. This design, introduced by Gao et al. (Gao et al., 2018), addresses the challenge of enhancing non-linear modeling capability in large-scale visual recognition by leveraging higher-order feature representations in both lower and higher layers. The result is a significant improvement over first-order and end-only second-order pooling approaches on tasks such as ImageNet-1K classification.
1. Formal Definition of Global Second-Order Pooling
GSoP computes a covariance matrix from the feature map produced by a convolutional layer, providing a holistic second-order representation. Let denote the convolutional output with as the feature at spatial position , where , , and . The mean feature is defined as: The covariance matrix is: Typically, mean-subtraction is omitted since a subsequent Batch Norm will re-center .
Following this, row-wise Batch Normalization (BN) is applied across each row of 0, yielding 1. A two-stage nonlinear transformation forms the channel scaling vector: 2 with 3, followed by LeakyReLU activation (slope 0.1): 4 and a final linear projection with sigmoid nonlinearity: 5 where 6 and 7. This non-negative scaling vector is used to adaptively weight feature channels.
Matrix square-root normalization of 8 via eigendecomposition or Newton–Schulz iteration is explored for the GSoP-Net2 variant but is not utilized in GSoP-Net1, which uses standard average pooling in the final stage.
2. Structure and Operation of GSoP Blocks
GSoP blocks come in two principal variants: channel-wise and spatial-wise, both designed to extract and apply second-order statistics for tensor scaling.
Channel-wise GSoP Block
Given input 9, a 0 convolution reduces channels to 1 (typically 2), producing 3. 4 is reshaped as an 5 matrix (6), and 7 is computed. After row-wise BN and embedding, the resulting vector 8 is broadcast and multiplied with 9 along the channel dimension: 0 Channels 1 remain unaffected.
Spatial-wise GSoP Block
For input 2, a 3 convolution again reduces to 4 channels, resulting in 5. 6 is spatially downsampled (e.g., 7) to 8, which is viewed as 9 (0). The spatial covariance 1 is
2
normalized row-wise and nonlinearly embedded into 3, reshaped to 4, upsampled to 5. Feature scaling is then
6
3. Block Placement and Network Architecture
GSoP-Net1 is built upon a ResNet-50 backbone, with GSoP channel-wise blocks inserted at the end of each residual stage. Below is the block mapping:
| Stage | Output Resolution | Block Placement |
|---|---|---|
| conv2_x | 56 7 56 8 256 | After last bottleneck |
| conv3_x | 28 9 28 0 512 | After the stage |
| conv4_x | 14 1 14 2 1024 | After the stage |
| conv5_x | 7 3 7 4 2048 | After the stage |
The full forward pipeline is as follows:
- Input: 5
- conv1: 6, 64, stride 2 7 BN 8 ReLU 9
- pool1: 0, max, stride 2 1
- conv2_x: three bottleneck blocks 2
- GSoP block 3
- conv3_x: four bottlenecks 4
- GSoP block 5
- conv4_x: six bottlenecks 6
- GSoP block 7 10. conv5_x: three bottlenecks 8
- GSoP block 9
- Global average pooling 0
- Fully connected (2048 to 1000) 1 softmax
Each GSoP block introduces approximately 0.7 million parameters and 30 million FLOPs (channel-wise) and can optionally include a spatial-wise branch (approximately 0.16 million parameters, 26 million FLOPs).
4. Implementation and Computational Considerations
GSoP blocks are implemented efficiently, with uniform 2 reduction via 3 convolutions prior to pooling. For channel-wise blocks, 4 is computed via a single matrix multiplication of size 5 and 6. Row-wise BN leverages a BatchNorm layer treating the C dimension as channels. The two linear projections (row-convs) are implemented as 7 grouped convolutions, with each output row corresponding to its input. The final rescaling uses a broadcast multiply across tensors.
No large 3D convolutions or eigen-decomposition operations are required in GSoP-Net1, leading to an overhead of approximately 10% in parameters and 60% in FLOPs compared to ResNet-50.
5. Experimental Results and Ablation Analysis
On ImageNet-1K with standard 8 training and validation, GSoP-Net1 with blocks in conv2_x through conv5_x and final global average pooling achieves:
| Model | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|
| ResNet-50 (baseline) | 23.85 | 7.13 |
| GSoP-Net1 | 22.32 (↓1.53) | 6.02 (↓1.11) |
| SE-Net-50 | 23.29 | 6.62 |
| CBAM | 22.66 | 6.31 |
| MPN-COV | 22.74 | 6.54 |
Ablation studies with ResNet-26 on 9-scale ImageNet indicate that second-order blocks in intermediate layers reduce error incrementally:
- Single channel-wise GSoP at conv2_x: Top-1 18.45%
- Single at conv5_x: Top-1 18.33%
- All four stages: Top-1 17.42% (baseline 19.18%)
This suggests that incorporating second-order statistics at multiple stages, rather than exclusively at the tail of the network, provides cumulative benefit and outperforms both first-order in-network pooling (SE/CBAM) and classical end-only second-order schemes (MPN-COV).
6. Comparison to Related Approaches
GSoP-Net1 is distinct from SE-Net-50 and CBAM, both of which are based on first-order channel or spatial attention mechanisms, and from MPN-COV, which applies second-order pooling only at the network’s output. While SE-Net-50 achieves Top-1/Top-5 errors of 23.29%/6.62% and CBAM obtains 22.66%/6.31%, GSoP-Net1 outperforms these with 22.32%/6.02%. Compared to MPN-COV (Top-1/Top-5 22.74%/6.54%), GSoP-Net1’s strategy of placing GSoP blocks at multiple depths yields superior results (Gao et al., 2018).
7. Significance and Implications
The methodology of GSoP-Net1 demonstrates that the systematic distribution of global second-order pooling blocks throughout a ConvNet backbone yields substantial gains in non-linear representational capacity and overall recognition performance. By leveraging holistic image statistics beyond first-order pooling and attention mechanisms, and without the need for computationally expensive eigen-decompositions or square-root normalization at every stage, GSoP-Net1 sets a precedent for higher-order representation learning at all layers of deep convolutional architectures. This framework provides a foundation for future advances in hierarchical network design utilizing higher-order pooling mechanisms (Gao et al., 2018).