Global Entropy Pooling in Deep Learning

Updated 31 May 2026

Global Entropy Pooling (GEP) is a pooling method that aggregates information across feature maps, graph nodes, or data distributions by directly measuring the entropy of activations.
GEP overcomes limitations of traditional pooling methods like GAP and GMP by capturing informational uncertainty and activation spread, thus improving robustness in vision, graph, and federated learning models.
Implemented via techniques such as optimal transport, random-walk graph kernels, and divergence minimization, GEP provides a tractable, end-to-end trainable layer that enhances performance and invariance.

Global Entropy Pooling (GEP) is a global aggregation operator that fuses information across feature maps, graph nodes, or data distributions by balancing statistical central tendency with entropy-based uncertainty, providing robust, information-aware summaries for deep learning. GEP generalizes beyond classic pooling methods by directly measuring informational disorder, often via Shannon entropy, and has proven especially effective in vision, graph, and decentralized learning models. In various implementations, GEP is derived either from optimal transport principles, random-walk graph kernels, or constrained divergence-minimization, yielding tractable, end-to-end trainable layers (Wu et al., 2022, Kong et al., 2023, Feng et al., 2024, Xu et al., 2022).

1. Theoretical Foundations

Global Entropy Pooling relies on quantifying the disorder or spread of activation patterns, with the prototypical realization being the Shannon entropy of normalized features. Its design seeks to overcome the limitations of pooling operators that consider only extreme or mean values, which may be insensitive to the distributional characteristics of activations.

In matrix-based pooling frameworks, such as Regularized Optimal Transport Pooling (ROTP), GEP can be posed as an entropic regularization problem. Given a feature matrix $X \in \mathbb{R}^{D \times N}$ , GEP seeks a row-probability assignment $p_{d, n}$ that maximizes expected feature value while penalizing low-entropy (overconfident) assignments, under a row-wise marginal constraint:

$\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$

The solution is row-wise softmax normalization: $p_{dn} = \mathrm{softmax}(x_{dn}/\varepsilon)$ , and the pooled feature is $f_d = \sum_{n=1}^N p_{dn} x_{dn}$ (Xu et al., 2022).

In graph domains, GEP scores nodes by entropic centrality derived from the diagonal of the maximal-entropy random walk (MET) kernel:

$M_{ii} = \frac{1}{Z_i} \sum_{l=0}^L e^{-E(l)/T} (A^l)_{ii}$

This measures the "information-richness" of a node in terms of weighted sums over closed walks of different lengths (Kong et al., 2023).

In decentralized learning settings, GEP synthesizes client (or node) distributions under information-theoretic constraints. Clients fit local distributions (e.g., GMMs), then a global distribution $p^*$ is obtained by minimizing the Kullback-Leibler divergence to a prior $p_0$ while matching pooled statistical moments, using convex optimization with dual variables (Feng et al., 2024).

2. Motivation and Comparative Analysis

Traditional pooling schemes—Global Average Pooling (GAP) and Global Max Pooling (GMP)—focus on mean and extreme value aggregation, respectively. GAP is sensitive to large background regions with mild activations, leading to information dilution, while GMP is sensitive to noise and can ignore distributed informative patterns. Neither captures activation variance or the uncertainty in feature representation.

GEP addresses these shortcomings by directly measuring the entropy (spread) of activations, suppressing uniform background regions (low entropy) and mitigating the dominance of isolated outliers (which produce low entropy due to peaky softmax assignments). In visual attention modules (e.g., CAT), the addition of GEP to GAP and GMP channels yields more robust, noise-suppressive, and discriminative attention signals (Wu et al., 2022).

In GNNs, entropy-based pooling captures structural centrality and informativeness beyond node feature magnitude, preserving subgraph diversity and reducing variance in downstream tasks (Kong et al., 2023).

3. Mathematical Formulations and Implementation

Global Entropy Pooling can be implemented in several modes depending on context:

Feature Maps (CNNs): For input $F \in \mathbb{R}^{H \times W \times C}$ , channel-wise GEP uses

$p^c_i = \mathrm{softmax}(F_{i, c}), \quad C'_{\text{Ent}}[c] = -\sum_i p^c_i \log p^c_i$

and normalization to $p_{d, n}$ 0. Analogous computation is performed spatially (Wu et al., 2022).

Optimal Transport Pooling: For $p_{d, n}$ 1, solve

$p_{d, n}$ 2

along each row, then aggregate $p_{d, n}$ 3, optionally accelerated via Sinkhorn iterations for more general constraints (Xu et al., 2022).

Graph Pooling: For graph adjacency $p_{d, n}$ 4, entropy-based centrality is extracted from the diagonal of MET, then combined with feature-based scoring via learnable weighting before top- $p_{d, n}$ 5 node selection (Kong et al., 2023).
Federated Learning: Each client $p_{d, n}$ 6 locally fits $p_{d, n}$ 7, aggregates summaries into $p_{d, n}$ 8, then solves

$p_{d, n}$ 9

for pooled moments $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 0; the resulting $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 1 is used to compute aggregation weights via $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 2 (Feng et al., 2024).

4. Integration into Network Architectures

Global Entropy Pooling is a plug-and-play replacement or augmentation for pooling operations in deep architectures:

Domain	GEP Integration Example	Reference
Vision (CNN)	Integrated alongside GAP and GMP in channel/spatial attention branches; fused via learnable weights	(Wu et al., 2022)
Graph Learning	Inserted as PANPool after message aggregation, feeding subgraph centrality into pooling scores	(Kong et al., 2023)
Set/MIL Models	Used as an ROTP implicit layer substituting for mean/max pooling	(Xu et al., 2022)
Federated Learning	Determines aggregation weights for neighbor models by entropy pooling of client distributions	(Feng et al., 2024)

In all settings, GEP enhances the sensitivity of aggregation to the informativeness of feature/activity distributions, and, especially when learnable or compositional with other pooling types, provides an adaptable mechanism aligning with task invariance and robustness.

5. Empirical Evidence and Applications

Experimental ablations in computer vision show that GEP consistently improves performance: removing GEP from attention modules (CAT) induces significant drops in average precision, and full inclusion reaches state-of-the-art for object detection, segmentation, and classification benchmarks such as MS COCO, Pascal-VOC, and ImageNet (Wu et al., 2022). Attention maps produced via GEP are characterized by suppressed backgrounds and enhanced localization of object boundaries and textures.

In graph neural networks, GEP-equipped PANPool modules enable parameter-efficient yet competitive or superior accuracy to classic GCN or GIN, with reduced run-to-run variance, attributed to the systematic preservation of informative subgraph structure (Kong et al., 2023).

In federated settings, GEP pooling of model distributions enables privacy-preserving, communication-efficient, and convergence-accelerated decentralized federated learning under non-IID data, outperforming other state-of-the-art aggregation protocols on diverse benchmarks (Feng et al., 2024).

6. Computational and Theoretical Properties

GEP and closely related ROT-based layers admit tractable, differentiable implementations. For feature maps and sets, softmax closed-form or unrolled Sinkhorn-Knopp iterative solutions yield $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 3 per forward step; graph kernels for GEP poolings require $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 4 cost per layer for adjacency power tracing, generally with small $\min_{P_{d\star} \in \Delta^{N-1}} \sum_{d=1}^D \left[-\sum_{n=1}^N p_{dn} x_{dn} + \varepsilon \sum_{n=1}^N p_{dn} \log p_{dn}\right],\quad \sum_n p_{dn} = 1$ 5 (Xu et al., 2022, Kong et al., 2023).

Theoretical analysis affirms permutation-invariance (with appropriate marginal constraints), convergence guarantees when using proximal-point/Sinkhorn unrolling, and numerical stability when log-stabilization is applied. In federated setups, empirical and optimization-theoretic analysis both confirm that GEP-based aggregation yields optimally smooth and strongly convex global objectives under mild standard assumptions (Feng et al., 2024).

7. Relation to the Broader Pooling Landscape

Global Entropy Pooling arises as a natural specialization within the generalized Regularized Optimal Transport Pooling family. It can recover mean/max pooling in certain parameter regimes and further generalizes to sophisticated pooling paradigms (e.g., Gromov-Wasserstein, marginal-constrained, and attention-based pooling) (Xu et al., 2022).

GEP's flexibility is reinforced by its compatibility with end-to-end gradient descent, its ability to inject learnable calibrations (e.g., colla-factors in CAT, β in PANPool), and robust empirical behavior across parameter choices and tasks. This positions GEP as a critical primitive in modern robust representation learning for images, graphs, and decentralized systems.