GSoP-Net1: Global Second-Order Pooling in ConvNets

Updated 17 April 2026

GSoP-Net1 is a convolutional neural network that incorporates global second-order pooling at various depths to capture richer feature statistics.
It computes covariance matrices from convolutional outputs, enabling enhanced non-linear modeling compared to traditional first-order pooling methods.
Experimental results on ImageNet show that GSoP-Net1 reduces error rates relative to SE-Net and CBAM while maintaining efficient computational overhead.

GSoP-Net1 is a convolutional neural network architecture that systematically injects global second-order pooling (GSoP) operations at multiple layers in a deep ConvNet, rather than only at the network’s output. This design, introduced by Gao et al. (Gao et al., 2018), addresses the challenge of enhancing non-linear modeling capability in large-scale visual recognition by leveraging higher-order feature representations in both lower and higher layers. The result is a significant improvement over first-order and end-only second-order pooling approaches on tasks such as ImageNet-1K classification.

1. Formal Definition of Global Second-Order Pooling

GSoP computes a covariance matrix from the feature map produced by a convolutional layer, providing a holistic second-order representation. Let $X \in \mathbb{R}^{H \times W \times C}$ denote the convolutional output with $x_{ij} \in \mathbb{R}^C$ as the feature at spatial position $(i,j)$ , where $i=1\ldots H$ , $j=1\ldots W$ , and $N=HW$ . The mean feature is defined as: $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ The covariance matrix $C \in \mathbb{R}^{C \times C}$ is: $C = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W (x_{ij} - \mu)(x_{ij} - \mu)^T.$ Typically, mean-subtraction is omitted since a subsequent Batch Norm will re-center $C$ .

Following this, row-wise Batch Normalization (BN) is applied across each row of $x_{ij} \in \mathbb{R}^C$ 0, yielding $x_{ij} \in \mathbb{R}^C$ 1. A two-stage nonlinear transformation forms the channel scaling vector: $x_{ij} \in \mathbb{R}^C$ 2 with $x_{ij} \in \mathbb{R}^C$ 3, followed by LeakyReLU activation (slope 0.1): $x_{ij} \in \mathbb{R}^C$ 4 and a final linear projection with sigmoid nonlinearity: $x_{ij} \in \mathbb{R}^C$ 5 where $x_{ij} \in \mathbb{R}^C$ 6 and $x_{ij} \in \mathbb{R}^C$ 7. This non-negative scaling vector is used to adaptively weight feature channels.

Matrix square-root normalization of $x_{ij} \in \mathbb{R}^C$ 8 via eigendecomposition or Newton–Schulz iteration is explored for the GSoP-Net2 variant but is not utilized in GSoP-Net1, which uses standard average pooling in the final stage.

2. Structure and Operation of GSoP Blocks

GSoP blocks come in two principal variants: channel-wise and spatial-wise, both designed to extract and apply second-order statistics for tensor scaling.

Channel-wise GSoP Block

Given input $x_{ij} \in \mathbb{R}^C$ 9, a $(i,j)$ 0 convolution reduces channels to $(i,j)$ 1 (typically $(i,j)$ 2), producing $(i,j)$ 3. $(i,j)$ 4 is reshaped as an $(i,j)$ 5 matrix ( $(i,j)$ 6), and $(i,j)$ 7 is computed. After row-wise BN and embedding, the resulting vector $(i,j)$ 8 is broadcast and multiplied with $(i,j)$ 9 along the channel dimension: $i=1\ldots H$ 0 Channels $i=1\ldots H$ 1 remain unaffected.

Spatial-wise GSoP Block

For input $i=1\ldots H$ 2, a $i=1\ldots H$ 3 convolution again reduces to $i=1\ldots H$ 4 channels, resulting in $i=1\ldots H$ 5. $i=1\ldots H$ 6 is spatially downsampled (e.g., $i=1\ldots H$ 7) to $i=1\ldots H$ 8, which is viewed as $i=1\ldots H$ 9 ( $j=1\ldots W$ 0). The spatial covariance $j=1\ldots W$ 1 is

$j=1\ldots W$ 2

normalized row-wise and nonlinearly embedded into $j=1\ldots W$ 3, reshaped to $j=1\ldots W$ 4, upsampled to $j=1\ldots W$ 5. Feature scaling is then

$j=1\ldots W$ 6

3. Block Placement and Network Architecture

GSoP-Net1 is built upon a ResNet-50 backbone, with GSoP channel-wise blocks inserted at the end of each residual stage. Below is the block mapping:

Stage	Output Resolution	Block Placement
conv2_x	56 $j=1\ldots W$ 7 56 $j=1\ldots W$ 8 256	After last bottleneck
conv3_x	28 $j=1\ldots W$ 9 28 $N=HW$ 0 512	After the stage
conv4_x	14 $N=HW$ 1 14 $N=HW$ 2 1024	After the stage
conv5_x	7 $N=HW$ 3 7 $N=HW$ 4 2048	After the stage

The full forward pipeline is as follows:

Input: $N=HW$ 5
conv1: $N=HW$ 6, 64, stride 2 $N=HW$ 7 BN $N=HW$ 8 ReLU $N=HW$ 9
pool1: $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 0, max, stride 2 $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 1
conv2_x: three bottleneck blocks $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 2
GSoP block $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 3
conv3_x: four bottlenecks $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 4
GSoP block $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 5
conv4_x: six bottlenecks $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 6
GSoP block $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 7 10. conv5_x: three bottlenecks $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 8
GSoP block $\mu = \frac{1}{N} \sum_{i=1}^H \sum_{j=1}^W x_{ij}.$ 9
Global average pooling $C \in \mathbb{R}^{C \times C}$ 0
Fully connected (2048 to 1000) $C \in \mathbb{R}^{C \times C}$ 1 softmax

Each GSoP block introduces approximately 0.7 million parameters and 30 million FLOPs (channel-wise) and can optionally include a spatial-wise branch (approximately 0.16 million parameters, 26 million FLOPs).

4. Implementation and Computational Considerations

GSoP blocks are implemented efficiently, with uniform $C \in \mathbb{R}^{C \times C}$ 2 reduction via $C \in \mathbb{R}^{C \times C}$ 3 convolutions prior to pooling. For channel-wise blocks, $C \in \mathbb{R}^{C \times C}$ 4 is computed via a single matrix multiplication of size $C \in \mathbb{R}^{C \times C}$ 5 and $C \in \mathbb{R}^{C \times C}$ 6. Row-wise BN leverages a BatchNorm layer treating the C dimension as channels. The two linear projections (row-convs) are implemented as $C \in \mathbb{R}^{C \times C}$ 7 grouped convolutions, with each output row corresponding to its input. The final rescaling uses a broadcast multiply across tensors.

No large 3D convolutions or eigen-decomposition operations are required in GSoP-Net1, leading to an overhead of approximately 10% in parameters and 60% in FLOPs compared to ResNet-50.

5. Experimental Results and Ablation Analysis

On ImageNet-1K with standard $C \in \mathbb{R}^{C \times C}$ 8 training and validation, GSoP-Net1 with blocks in conv2_x through conv5_x and final global average pooling achieves:

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-50 (baseline)	23.85	7.13
GSoP-Net1	22.32 (↓1.53)	6.02 (↓1.11)
SE-Net-50	23.29	6.62
CBAM	22.66	6.31
MPN-COV	22.74	6.54

Ablation studies with ResNet-26 on $C \in \mathbb{R}^{C \times C}$ 9-scale ImageNet indicate that second-order blocks in intermediate layers reduce error incrementally:

Single channel-wise GSoP at conv2_x: Top-1 18.45%
Single at conv5_x: Top-1 18.33%
All four stages: Top-1 17.42% (baseline 19.18%)

This suggests that incorporating second-order statistics at multiple stages, rather than exclusively at the tail of the network, provides cumulative benefit and outperforms both first-order in-network pooling (SE/CBAM) and classical end-only second-order schemes (MPN-COV).

GSoP-Net1 is distinct from SE-Net-50 and CBAM, both of which are based on first-order channel or spatial attention mechanisms, and from MPN-COV, which applies second-order pooling only at the network’s output. While SE-Net-50 achieves Top-1/Top-5 errors of 23.29%/6.62% and CBAM obtains 22.66%/6.31%, GSoP-Net1 outperforms these with 22.32%/6.02%. Compared to MPN-COV (Top-1/Top-5 22.74%/6.54%), GSoP-Net1’s strategy of placing GSoP blocks at multiple depths yields superior results (Gao et al., 2018).

7. Significance and Implications

The methodology of GSoP-Net1 demonstrates that the systematic distribution of global second-order pooling blocks throughout a ConvNet backbone yields substantial gains in non-linear representational capacity and overall recognition performance. By leveraging holistic image statistics beyond first-order pooling and attention mechanisms, and without the need for computationally expensive eigen-decompositions or square-root normalization at every stage, GSoP-Net1 sets a precedent for higher-order representation learning at all layers of deep convolutional architectures. This framework provides a foundation for future advances in hierarchical network design utilizing higher-order pooling mechanisms (Gao et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Global Second-order Pooling Convolutional Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GSoP-Net1.