Papers
Topics
Authors
Recent
Search
2000 character limit reached

GSoP-Net1: Global Second-Order Pooling in ConvNets

Updated 17 April 2026
  • GSoP-Net1 is a convolutional neural network that incorporates global second-order pooling at various depths to capture richer feature statistics.
  • It computes covariance matrices from convolutional outputs, enabling enhanced non-linear modeling compared to traditional first-order pooling methods.
  • Experimental results on ImageNet show that GSoP-Net1 reduces error rates relative to SE-Net and CBAM while maintaining efficient computational overhead.

GSoP-Net1 is a convolutional neural network architecture that systematically injects global second-order pooling (GSoP) operations at multiple layers in a deep ConvNet, rather than only at the network’s output. This design, introduced by Gao et al. (Gao et al., 2018), addresses the challenge of enhancing non-linear modeling capability in large-scale visual recognition by leveraging higher-order feature representations in both lower and higher layers. The result is a significant improvement over first-order and end-only second-order pooling approaches on tasks such as ImageNet-1K classification.

1. Formal Definition of Global Second-Order Pooling

GSoP computes a covariance matrix from the feature map produced by a convolutional layer, providing a holistic second-order representation. Let X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} denote the convolutional output with xij∈RCx_{ij} \in \mathbb{R}^C as the feature at spatial position (i,j)(i,j), where i=1…Hi=1\ldots H, j=1…Wj=1\ldots W, and N=HWN=HW. The mean feature is defined as: μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}. The covariance matrix C∈RC×CC \in \mathbb{R}^{C \times C} is: C=1N∑i=1H∑j=1W(xij−μ)(xij−μ)T.C = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W (x_{ij} - \mu)(x_{ij} - \mu)^T. Typically, mean-subtraction is omitted since a subsequent Batch Norm will re-center CC.

Following this, row-wise Batch Normalization (BN) is applied across each row of xij∈RCx_{ij} \in \mathbb{R}^C0, yielding xij∈RCx_{ij} \in \mathbb{R}^C1. A two-stage nonlinear transformation forms the channel scaling vector: xij∈RCx_{ij} \in \mathbb{R}^C2 with xij∈RCx_{ij} \in \mathbb{R}^C3, followed by LeakyReLU activation (slope 0.1): xij∈RCx_{ij} \in \mathbb{R}^C4 and a final linear projection with sigmoid nonlinearity: xij∈RCx_{ij} \in \mathbb{R}^C5 where xij∈RCx_{ij} \in \mathbb{R}^C6 and xij∈RCx_{ij} \in \mathbb{R}^C7. This non-negative scaling vector is used to adaptively weight feature channels.

Matrix square-root normalization of xij∈RCx_{ij} \in \mathbb{R}^C8 via eigendecomposition or Newton–Schulz iteration is explored for the GSoP-Net2 variant but is not utilized in GSoP-Net1, which uses standard average pooling in the final stage.

2. Structure and Operation of GSoP Blocks

GSoP blocks come in two principal variants: channel-wise and spatial-wise, both designed to extract and apply second-order statistics for tensor scaling.

Channel-wise GSoP Block

Given input xij∈RCx_{ij} \in \mathbb{R}^C9, a (i,j)(i,j)0 convolution reduces channels to (i,j)(i,j)1 (typically (i,j)(i,j)2), producing (i,j)(i,j)3. (i,j)(i,j)4 is reshaped as an (i,j)(i,j)5 matrix ((i,j)(i,j)6), and (i,j)(i,j)7 is computed. After row-wise BN and embedding, the resulting vector (i,j)(i,j)8 is broadcast and multiplied with (i,j)(i,j)9 along the channel dimension: i=1…Hi=1\ldots H0 Channels i=1…Hi=1\ldots H1 remain unaffected.

Spatial-wise GSoP Block

For input i=1…Hi=1\ldots H2, a i=1…Hi=1\ldots H3 convolution again reduces to i=1…Hi=1\ldots H4 channels, resulting in i=1…Hi=1\ldots H5. i=1…Hi=1\ldots H6 is spatially downsampled (e.g., i=1…Hi=1\ldots H7) to i=1…Hi=1\ldots H8, which is viewed as i=1…Hi=1\ldots H9 (j=1…Wj=1\ldots W0). The spatial covariance j=1…Wj=1\ldots W1 is

j=1…Wj=1\ldots W2

normalized row-wise and nonlinearly embedded into j=1…Wj=1\ldots W3, reshaped to j=1…Wj=1\ldots W4, upsampled to j=1…Wj=1\ldots W5. Feature scaling is then

j=1…Wj=1\ldots W6

3. Block Placement and Network Architecture

GSoP-Net1 is built upon a ResNet-50 backbone, with GSoP channel-wise blocks inserted at the end of each residual stage. Below is the block mapping:

Stage Output Resolution Block Placement
conv2_x 56 j=1…Wj=1\ldots W7 56 j=1…Wj=1\ldots W8 256 After last bottleneck
conv3_x 28 j=1…Wj=1\ldots W9 28 N=HWN=HW0 512 After the stage
conv4_x 14 N=HWN=HW1 14 N=HWN=HW2 1024 After the stage
conv5_x 7 N=HWN=HW3 7 N=HWN=HW4 2048 After the stage

The full forward pipeline is as follows:

  1. Input: N=HWN=HW5
  2. conv1: N=HWN=HW6, 64, stride 2 N=HWN=HW7 BN N=HWN=HW8 ReLU N=HWN=HW9
  3. pool1: μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.0, max, stride 2 μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.1
  4. conv2_x: three bottleneck blocks μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.2
  5. GSoP block μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.3
  6. conv3_x: four bottlenecks μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.4
  7. GSoP block μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.5
  8. conv4_x: six bottlenecks μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.6
  9. GSoP block μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.7 10. conv5_x: three bottlenecks μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.8
  10. GSoP block μ=1N∑i=1H∑j=1Wxij.\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.9
  11. Global average pooling C∈RC×CC \in \mathbb{R}^{C \times C}0
  12. Fully connected (2048 to 1000) C∈RC×CC \in \mathbb{R}^{C \times C}1 softmax

Each GSoP block introduces approximately 0.7 million parameters and 30 million FLOPs (channel-wise) and can optionally include a spatial-wise branch (approximately 0.16 million parameters, 26 million FLOPs).

4. Implementation and Computational Considerations

GSoP blocks are implemented efficiently, with uniform C∈RC×CC \in \mathbb{R}^{C \times C}2 reduction via C∈RC×CC \in \mathbb{R}^{C \times C}3 convolutions prior to pooling. For channel-wise blocks, C∈RC×CC \in \mathbb{R}^{C \times C}4 is computed via a single matrix multiplication of size C∈RC×CC \in \mathbb{R}^{C \times C}5 and C∈RC×CC \in \mathbb{R}^{C \times C}6. Row-wise BN leverages a BatchNorm layer treating the C dimension as channels. The two linear projections (row-convs) are implemented as C∈RC×CC \in \mathbb{R}^{C \times C}7 grouped convolutions, with each output row corresponding to its input. The final rescaling uses a broadcast multiply across tensors.

No large 3D convolutions or eigen-decomposition operations are required in GSoP-Net1, leading to an overhead of approximately 10% in parameters and 60% in FLOPs compared to ResNet-50.

5. Experimental Results and Ablation Analysis

On ImageNet-1K with standard C∈RC×CC \in \mathbb{R}^{C \times C}8 training and validation, GSoP-Net1 with blocks in conv2_x through conv5_x and final global average pooling achieves:

Model Top-1 Error (%) Top-5 Error (%)
ResNet-50 (baseline) 23.85 7.13
GSoP-Net1 22.32 (↓1.53) 6.02 (↓1.11)
SE-Net-50 23.29 6.62
CBAM 22.66 6.31
MPN-COV 22.74 6.54

Ablation studies with ResNet-26 on C∈RC×CC \in \mathbb{R}^{C \times C}9-scale ImageNet indicate that second-order blocks in intermediate layers reduce error incrementally:

  • Single channel-wise GSoP at conv2_x: Top-1 18.45%
  • Single at conv5_x: Top-1 18.33%
  • All four stages: Top-1 17.42% (baseline 19.18%)

This suggests that incorporating second-order statistics at multiple stages, rather than exclusively at the tail of the network, provides cumulative benefit and outperforms both first-order in-network pooling (SE/CBAM) and classical end-only second-order schemes (MPN-COV).

GSoP-Net1 is distinct from SE-Net-50 and CBAM, both of which are based on first-order channel or spatial attention mechanisms, and from MPN-COV, which applies second-order pooling only at the network’s output. While SE-Net-50 achieves Top-1/Top-5 errors of 23.29%/6.62% and CBAM obtains 22.66%/6.31%, GSoP-Net1 outperforms these with 22.32%/6.02%. Compared to MPN-COV (Top-1/Top-5 22.74%/6.54%), GSoP-Net1’s strategy of placing GSoP blocks at multiple depths yields superior results (Gao et al., 2018).

7. Significance and Implications

The methodology of GSoP-Net1 demonstrates that the systematic distribution of global second-order pooling blocks throughout a ConvNet backbone yields substantial gains in non-linear representational capacity and overall recognition performance. By leveraging holistic image statistics beyond first-order pooling and attention mechanisms, and without the need for computationally expensive eigen-decompositions or square-root normalization at every stage, GSoP-Net1 sets a precedent for higher-order representation learning at all layers of deep convolutional architectures. This framework provides a foundation for future advances in hierarchical network design utilizing higher-order pooling mechanisms (Gao et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GSoP-Net1.