CGAP: Hierarchical Pooling

Updated 22 April 2026

Hierarchical Pooling (CGAP) is a framework that creates multi-scale representations for graphs and sequences using tree-structured, attention-driven coarsening.
It employs learnable assignment matrices to aggregate local features into coarser summaries, effectively fusing spatial or temporal attributes.
Empirical results in urban region modeling and action recognition show that CGAP outperforms flat pooling methods by enhancing prediction accuracy and contextual integration.

Hierarchical pooling encompasses a broad family of methods for producing multi-scale, compressed representations of complex inputs such as graphs or temporal sequences. CGAP (Coarsened Graph Attention Pooling) denotes both a particular hierarchical graph pooling framework for urban region representation learning (Xu et al., 2024), as well as a family of coarse-to-fine, tree-structured pooling strategies for temporal or graph data. These methods construct a representation hierarchy, systematically aggregating local features into coarser structural summaries through learnable or data-driven pooling operations. CGAP variants have been developed for both graph-based urban modeling and temporal action recognition, each leveraging hierarchical pooling to address the limitations of flat or homogeneous aggregation.

1. Core Principles of Hierarchical Pooling (CGAP)

CGAP frameworks are distinguished by their explicit construction of a pooling hierarchy, typically realized as a tree-structured or multi-level process. In the graph context, CGAP operates on an initial graph $\mathcal{G} = (V, E)$ , where nodes represent spatial units (e.g., urban regions), with edges encoding adjacency or relational structure. At each hierarchical layer $\ell$ , a pooling operation groups nodes into $n_{\ell+1} < n_\ell$ clusters, producing a coarsened graph with reduced granularity. This assignment is commonly governed by an attention-based mechanism: a learnable assignment matrix $S^\ell \in \mathbb{R}^{n_\ell \times n_{\ell+1}}$ encodes the strength by which each node is assigned to a cluster at the next level, computed via attention mechanisms over the local receptive field and cluster prototypes.

This hierarchical process continues until a single "global" node summarizes the entire structure. The pooled features and coarsened adjacencies at each level are given by $X^\ell = (S^\ell)^\top Z^\ell$ and $A^{\ell+1} = (S^\ell)^\top A^\ell S^\ell$ , enabling progressive abstraction and information propagation across scales (Xu et al., 2024).

2. Algorithmic Structure and Attention-Based Coarsening

At the heart of CGAP is the local attention unit, which implements node coarsening with learnable query and prototype vectors. For each pooling layer:

The attention score $e_{i, p} = \mathrm{LeakyReLU}\left(\widehat{w}^\top [W h_i^\ell \| W c_p^\ell]\right)$ measures the affinity between node $i$ and cluster prototype $p$ , followed by a softmax to yield assignment probabilities $S^\ell_{i, p}$ .
Pooled features and adjacencies are directly formed by multiplying $\ell$ 0 with the features and adjacency matrices of the current level.
The pooling pipeline is typically implemented as a K-layer GNN feature extractor, followed by $\ell$ 1 cascading attention-pooling layers and a global attention layer for feature fusion.

The global attention layer introduces a mechanism whereby the single "global node" embedding $\ell$ 2 is broadcast back to all original node positions and fused with local embeddings via scaled dot-product attention, ensuring city-wide or sequence-wide context is accessible to all local representations.

3. CGAP in Urban Region Representation Learning

In the "CGAP: Urban Region Representation Learning with Coarsened Graph Attention Pooling" framework (Xu et al., 2024), these principles are instantiated for the modeling of city regions:

Initial features $\ell$ 3 encode region-level attributes such as POI (Points of Interest) counts and inter-regional mobility flows.
The hierarchical pooling process enables aggregation of fine-scale and coarse-scale urban information, countering over-smoothing and locality limitations of standard GNNs.
Downstream tasks include mobility prediction, POI similarity retrieval, and land-use classification.

CGAP optimizes a composite loss encompassing region-embedding consistency, cross-entropy for mobility transition prediction, and POI similarity regression. The total loss is

$\ell$ 4

where each term regularizes different aspects of the learned representation.

Empirical results on Manhattan urban data show that CGAP outperforms advanced baselines such as ASAP, DiffPool, SAGPool, and MGFN, achieving mean absolute error (MAE) of $\ell$ 5 for crime prediction and $\ell$ 6 for check-in prediction, with $\ell$ 7 for the respective tasks. Land-use classification further confirms the value of hierarchical pooling, with CGAP exceeding baselines by $\ell$ 8– $\ell$ 9 in NMI/ARI (Xu et al., 2024).

4. Temporal Hierarchical Pooling for Action Recognition

A related variant, also denoted CGAP in the action recognition context (Mazari et al., 2020), implements a temporal tree-structured pooling hierarchy. Here, input sequences (e.g., frame-wise video features) are recursively split into segments across $n_{\ell+1} < n_\ell$ 0 levels, with:

Level $n_{\ell+1} < n_\ell$ 1 containing $n_{\ell+1} < n_\ell$ 2 equal-length temporal segments, each pooled via average aggregation.
Global representations are formed either by concatenating or averaging all segment descriptors, with learnable non-negative weights $n_{\ell+1} < n_\ell$ $n_{ℓ + 1} < n_{ℓ}$ 3 determining each node's contribution:
- Concatenation yields an alignment-sensitive representation.
- Averaging grants resilience to temporal misalignments.

The weights $n_{\ell+1} < n_\ell$ 4 are learned via a constrained minimization over the probability simplex, optimizing loss functions such as SVM or contrastive objectives with normalization constraints $n_{\ell+1} < n_\ell$ 5. This mechanism enables adaptive attention over different levels of temporal granularity. Experiments on UCF-101 demonstrate steady performance improvements as hierarchy depth increases, with the best averaging variant reaching $n_{\ell+1} < n_\ell$ 6 recognition accuracy at $n_{\ell+1} < n_\ell$ 7, outperforming global pooling and single-granularity variants (Mazari et al., 2020).

5. Comparison to Alternative Hierarchical Graph Pooling Paradigms

Several alternative approaches provide further context for the design space of hierarchical pooling:

Minimum Description Length Pooling (MDL-Pool/MapEqPool): This method (Pichowski et al., 2024) optimizes hierarchical cluster assignments by minimizing the total coding cost (in bits) of random-walk trajectories under multi-level partitions, integrating the multilevel map equation as a differentiable loss. It adaptively determines both the number of clusters and necessary hierarchy depth, automatically pruning trivial or redundant levels and capturing multi-scale community structure in a single joint objective.
Graph Parsing Networks (GPN): Instead of fixed-ratio or soft assignments, GPN (Song et al., 2024) utilizes a bottom-up grammar-inspired parsing procedure, constructing hard, sparse, and personalized pooling hierarchies for each individual graph via discrete assignments derived from syntactic "edge scoring." This enhances memory efficiency and retains node information across deep coarsening, outperforming or matching all prior hierarchical pooling methods on classical graph benchmarks.

The table below contrasts key design elements of leading hierarchical pooling strategies:

Method	Assignment Type	Hierarchy Adaptivity
CGAP	Soft/local attention	Fixed-depth, user-set
MDL-Pool	Soft (via map eq.)	Data-driven/adaptive
GPN	Hard/discrete parse	Data-driven/adaptive

6. Properties, Limitations, and Future Directions

CGAP and related hierarchical pooling methods demonstrate robust information aggregation across multiple scales, with theoretical and empirical advantages:

Multi-modal data propagation: CGAP effectively integrates both structural (e.g., adjacency) and attribute (e.g., POI, mobility) information in urban graphs, with multi-task objectives ensuring that embeddings reflect both spatial structure and semantic relations (Xu et al., 2024).
Invariance and resolution trade-off: Hierarchical pooling in action recognition systematically navigates the balance between global context and fine temporal detail, the distribution of learned weights $n_{\ell+1} < n_\ell$ 8 revealing the relative importance of different granularities per action class (Mazari et al., 2020).
Alignment resilience and video-length agnosticism: The averaging variant of temporal CGAP avoids dependence on precise segment alignment and accommodates variable input length.

Limitations center on predefined tree structure and uniform layer design—dynamic segmentation or learnable hierarchy structure remains an open extension. Efficient optimization for very large inputs can be challenging, particularly for EM-style weight learning in temporal CGAP, although end-to-end deep contrastive variants provide some relief. Extensions to richer pooling operations, 3D spatio-temporal hierarchies, and even finer adaptivity in both graph and sequence domains represent promising future research directions (Xu et al., 2024, Mazari et al., 2020).

7. Empirical Evidence and Performance Assessment

Experimentation across both graph and temporal domains substantiates the effectiveness of hierarchical pooling:

In urban graph learning, CGAP achieves state-of-the-art results on crime, mobility, and land-use prediction; ablation studies confirm the necessity of both local attention coarsening and global feature fusion, as well as the integration of multi-modal region attributes (Xu et al., 2024).
In temporal action recognition, coarse-to-fine CGAP outperforms both shallow and single-level pooling, with learned attention emphasizing task-relevant granularity. Removing intermediate levels degrades performance, affirming the complementary value of multi-scale representations (Mazari et al., 2020).

A plausible implication is that explicit hierarchical pooling, with careful integration of attention mechanisms and task-level supervision, is a crucial enabler for robust, context-enriched, and granular representation learning in structured domains.

References:

"CGAP: Urban Region Representation Learning with Coarsened Graph Attention Pooling" (Xu et al., 2024)
"Deep hierarchical pooling design for cross-granularity action recognition" (Mazari et al., 2020)
"MDL-Pool: Adaptive Multilevel Graph Pooling Based on Minimum Description Length" (Pichowski et al., 2024)
"Graph Parsing Networks" (Song et al., 2024)

Markdown Report Issue Upgrade to Chat

References (4)

CGAP: Urban Region Representation Learning with Coarsened Graph Attention Pooling (2024)

Deep hierarchical pooling design for cross-granularity action recognition (2020)

MDL-Pool: Adaptive Multilevel Graph Pooling Based on Minimum Description Length (2024)

Graph Parsing Networks (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Pooling (CGAP).

CGAP: Hierarchical Pooling

1. Core Principles of Hierarchical Pooling (CGAP)

2. Algorithmic Structure and Attention-Based Coarsening

3. CGAP in Urban Region Representation Learning

4. Temporal Hierarchical Pooling for Action Recognition

5. Comparison to Alternative Hierarchical Graph Pooling Paradigms

6. Properties, Limitations, and Future Directions

7. Empirical Evidence and Performance Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CGAP: Hierarchical Pooling

1. Core Principles of Hierarchical Pooling (CGAP)

2. Algorithmic Structure and Attention-Based Coarsening

3. CGAP in Urban Region Representation Learning

4. Temporal Hierarchical Pooling for Action Recognition

5. Comparison to Alternative Hierarchical Graph Pooling Paradigms

6. Properties, Limitations, and Future Directions

7. Empirical Evidence and Performance Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research