Papers
Topics
Authors
Recent
Search
2000 character limit reached

CGAP: Hierarchical Pooling

Updated 22 April 2026
  • Hierarchical Pooling (CGAP) is a framework that creates multi-scale representations for graphs and sequences using tree-structured, attention-driven coarsening.
  • It employs learnable assignment matrices to aggregate local features into coarser summaries, effectively fusing spatial or temporal attributes.
  • Empirical results in urban region modeling and action recognition show that CGAP outperforms flat pooling methods by enhancing prediction accuracy and contextual integration.

Hierarchical pooling encompasses a broad family of methods for producing multi-scale, compressed representations of complex inputs such as graphs or temporal sequences. CGAP (Coarsened Graph Attention Pooling) denotes both a particular hierarchical graph pooling framework for urban region representation learning (Xu et al., 2024), as well as a family of coarse-to-fine, tree-structured pooling strategies for temporal or graph data. These methods construct a representation hierarchy, systematically aggregating local features into coarser structural summaries through learnable or data-driven pooling operations. CGAP variants have been developed for both graph-based urban modeling and temporal action recognition, each leveraging hierarchical pooling to address the limitations of flat or homogeneous aggregation.

1. Core Principles of Hierarchical Pooling (CGAP)

CGAP frameworks are distinguished by their explicit construction of a pooling hierarchy, typically realized as a tree-structured or multi-level process. In the graph context, CGAP operates on an initial graph G=(V,E)\mathcal{G} = (V, E), where nodes represent spatial units (e.g., urban regions), with edges encoding adjacency or relational structure. At each hierarchical layer ℓ\ell, a pooling operation groups nodes into nℓ+1<nℓn_{\ell+1} < n_\ell clusters, producing a coarsened graph with reduced granularity. This assignment is commonly governed by an attention-based mechanism: a learnable assignment matrix Sℓ∈Rnℓ×nℓ+1S^\ell \in \mathbb{R}^{n_\ell \times n_{\ell+1}} encodes the strength by which each node is assigned to a cluster at the next level, computed via attention mechanisms over the local receptive field and cluster prototypes.

This hierarchical process continues until a single "global" node summarizes the entire structure. The pooled features and coarsened adjacencies at each level are given by Xℓ=(Sℓ)⊤ZℓX^\ell = (S^\ell)^\top Z^\ell and Aℓ+1=(Sℓ)⊤AℓSℓA^{\ell+1} = (S^\ell)^\top A^\ell S^\ell, enabling progressive abstraction and information propagation across scales (Xu et al., 2024).

2. Algorithmic Structure and Attention-Based Coarsening

At the heart of CGAP is the local attention unit, which implements node coarsening with learnable query and prototype vectors. For each pooling layer:

  • The attention score ei,p=LeakyReLU(w^⊤[Whiℓ∥Wcpâ„“])e_{i, p} = \mathrm{LeakyReLU}\left(\widehat{w}^\top [W h_i^\ell \| W c_p^\ell]\right) measures the affinity between node ii and cluster prototype pp, followed by a softmax to yield assignment probabilities Si,pâ„“S^\ell_{i, p}.
  • Pooled features and adjacencies are directly formed by multiplying â„“\ell0 with the features and adjacency matrices of the current level.
  • The pooling pipeline is typically implemented as a K-layer GNN feature extractor, followed by â„“\ell1 cascading attention-pooling layers and a global attention layer for feature fusion.

The global attention layer introduces a mechanism whereby the single "global node" embedding â„“\ell2 is broadcast back to all original node positions and fused with local embeddings via scaled dot-product attention, ensuring city-wide or sequence-wide context is accessible to all local representations.

3. CGAP in Urban Region Representation Learning

In the "CGAP: Urban Region Representation Learning with Coarsened Graph Attention Pooling" framework (Xu et al., 2024), these principles are instantiated for the modeling of city regions:

  • Initial features â„“\ell3 encode region-level attributes such as POI (Points of Interest) counts and inter-regional mobility flows.
  • The hierarchical pooling process enables aggregation of fine-scale and coarse-scale urban information, countering over-smoothing and locality limitations of standard GNNs.
  • Downstream tasks include mobility prediction, POI similarity retrieval, and land-use classification.

CGAP optimizes a composite loss encompassing region-embedding consistency, cross-entropy for mobility transition prediction, and POI similarity regression. The total loss is

â„“\ell4

where each term regularizes different aspects of the learned representation.

Empirical results on Manhattan urban data show that CGAP outperforms advanced baselines such as ASAP, DiffPool, SAGPool, and MGFN, achieving mean absolute error (MAE) of ℓ\ell5 for crime prediction and ℓ\ell6 for check-in prediction, with ℓ\ell7 for the respective tasks. Land-use classification further confirms the value of hierarchical pooling, with CGAP exceeding baselines by ℓ\ell8–ℓ\ell9 in NMI/ARI (Xu et al., 2024).

4. Temporal Hierarchical Pooling for Action Recognition

A related variant, also denoted CGAP in the action recognition context (Mazari et al., 2020), implements a temporal tree-structured pooling hierarchy. Here, input sequences (e.g., frame-wise video features) are recursively split into segments across nâ„“+1<nâ„“n_{\ell+1} < n_\ell0 levels, with:

  • Level nâ„“+1<nâ„“n_{\ell+1} < n_\ell1 containing nâ„“+1<nâ„“n_{\ell+1} < n_\ell2 equal-length temporal segments, each pooled via average aggregation.
  • Global representations are formed either by concatenating or averaging all segment descriptors, with learnable non-negative weights nâ„“+1<nâ„“n_{\ell+1} < n_\ell3 determining each node's contribution:
    • Concatenation yields an alignment-sensitive representation.
    • Averaging grants resilience to temporal misalignments.

The weights nâ„“+1<nâ„“n_{\ell+1} < n_\ell4 are learned via a constrained minimization over the probability simplex, optimizing loss functions such as SVM or contrastive objectives with normalization constraints nâ„“+1<nâ„“n_{\ell+1} < n_\ell5. This mechanism enables adaptive attention over different levels of temporal granularity. Experiments on UCF-101 demonstrate steady performance improvements as hierarchy depth increases, with the best averaging variant reaching nâ„“+1<nâ„“n_{\ell+1} < n_\ell6 recognition accuracy at nâ„“+1<nâ„“n_{\ell+1} < n_\ell7, outperforming global pooling and single-granularity variants (Mazari et al., 2020).

5. Comparison to Alternative Hierarchical Graph Pooling Paradigms

Several alternative approaches provide further context for the design space of hierarchical pooling:

  • Minimum Description Length Pooling (MDL-Pool/MapEqPool): This method (Pichowski et al., 2024) optimizes hierarchical cluster assignments by minimizing the total coding cost (in bits) of random-walk trajectories under multi-level partitions, integrating the multilevel map equation as a differentiable loss. It adaptively determines both the number of clusters and necessary hierarchy depth, automatically pruning trivial or redundant levels and capturing multi-scale community structure in a single joint objective.
  • Graph Parsing Networks (GPN): Instead of fixed-ratio or soft assignments, GPN (Song et al., 2024) utilizes a bottom-up grammar-inspired parsing procedure, constructing hard, sparse, and personalized pooling hierarchies for each individual graph via discrete assignments derived from syntactic "edge scoring." This enhances memory efficiency and retains node information across deep coarsening, outperforming or matching all prior hierarchical pooling methods on classical graph benchmarks.

The table below contrasts key design elements of leading hierarchical pooling strategies:

Method Assignment Type Hierarchy Adaptivity
CGAP Soft/local attention Fixed-depth, user-set
MDL-Pool Soft (via map eq.) Data-driven/adaptive
GPN Hard/discrete parse Data-driven/adaptive

6. Properties, Limitations, and Future Directions

CGAP and related hierarchical pooling methods demonstrate robust information aggregation across multiple scales, with theoretical and empirical advantages:

  • Multi-modal data propagation: CGAP effectively integrates both structural (e.g., adjacency) and attribute (e.g., POI, mobility) information in urban graphs, with multi-task objectives ensuring that embeddings reflect both spatial structure and semantic relations (Xu et al., 2024).
  • Invariance and resolution trade-off: Hierarchical pooling in action recognition systematically navigates the balance between global context and fine temporal detail, the distribution of learned weights nâ„“+1<nâ„“n_{\ell+1} < n_\ell8 revealing the relative importance of different granularities per action class (Mazari et al., 2020).
  • Alignment resilience and video-length agnosticism: The averaging variant of temporal CGAP avoids dependence on precise segment alignment and accommodates variable input length.

Limitations center on predefined tree structure and uniform layer design—dynamic segmentation or learnable hierarchy structure remains an open extension. Efficient optimization for very large inputs can be challenging, particularly for EM-style weight learning in temporal CGAP, although end-to-end deep contrastive variants provide some relief. Extensions to richer pooling operations, 3D spatio-temporal hierarchies, and even finer adaptivity in both graph and sequence domains represent promising future research directions (Xu et al., 2024, Mazari et al., 2020).

7. Empirical Evidence and Performance Assessment

Experimentation across both graph and temporal domains substantiates the effectiveness of hierarchical pooling:

  • In urban graph learning, CGAP achieves state-of-the-art results on crime, mobility, and land-use prediction; ablation studies confirm the necessity of both local attention coarsening and global feature fusion, as well as the integration of multi-modal region attributes (Xu et al., 2024).
  • In temporal action recognition, coarse-to-fine CGAP outperforms both shallow and single-level pooling, with learned attention emphasizing task-relevant granularity. Removing intermediate levels degrades performance, affirming the complementary value of multi-scale representations (Mazari et al., 2020).

A plausible implication is that explicit hierarchical pooling, with careful integration of attention mechanisms and task-level supervision, is a crucial enabler for robust, context-enriched, and granular representation learning in structured domains.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Pooling (CGAP).