Geometric Hidden Community Model

Updated 31 January 2026

GHCM is a probabilistic generative framework that models spatial networks via latent community labels and continuous geometric features, bridging characteristics of RGG and SBM.
It employs motif-based clustering and connectivity thresholds to achieve near-optimal community recovery with efficient, nearly-linear algorithms in sparse regimes.
The model generalizes traditional approaches by incorporating distance-dependent connection kernels, offering rigorous information-theoretic recovery thresholds and robust empirical performance.

A Geometric Hidden Community Model (GHCM) is a probabilistic generative framework for spatially embedded networks in which community structure manifests through both discrete latent group labels and continuous geometric features. GHCM is formulated to model networks where edge formation is modulated by latent spatial proximity and community membership, generalizing classical random geometric graphs (RGG) and stochastic block models (SBM). Community detection in GHCM leverages both motif-based properties (e.g., triangles) and information-theoretic thresholds to characterize the fundamental limits of recovery in regimes where traditional SBM approaches do not suffice, particularly in the sparse-graph regime where edge dependencies induced by geometry lead to high motif counts and spatial transitivity.

1. Model Formulation

A prototypical GHCM specifies a set of $n$ nodes partitioned into $k$ hidden communities $\{V_1,\ldots,V_k\}$ , with each node $i$ assigned an i.i.d. latent coordinate $X_i \in [0,1]^d$ or embedded in more general spaces such as a $d$ -dimensional torus or the unit sphere. Edge formation is governed by geometric proximity modulated by community labels: for a pair $(i,j)$ , an edge exists with probability $f_{c_1,c_2}(\|X_i-X_j\|)$ , where $c_1,c_2$ are the (hidden) communities of $i,j$ and $k$ 0 is a connection kernel depending on both label pair and distance. A canonical instantiation uses the step kernel:

Within-community edges: $k$ 1
Between-community edges: $k$ 2 with $k$ 3 and community sizes often balanced (Galhotra et al., 2022, Galhotra et al., 2017). Generalizations further allow distance-dependent pairwise distributions, where for $k$ 4 in community $k$ 5 and $k$ 6 in community $k$ 7 at distance $k$ 8, $k$ 9, enabling the modeling of weighted edges, multiple observations, or more complex relational data (Gaudio et al., 24 Jan 2026, Gaudio et al., 22 Jan 2025).

2. Connectivity Thresholds in Geometric and Block Structures

Community detectability and recovery in GHCMs are intimately linked to the connectivity properties of random geometric and annulus graphs. For RAG $\{V_1,\ldots,V_k\}$ 0 on the $\{V_1,\ldots,V_k\}$ 1-sphere, edges are present if $\{V_1,\ldots,V_k\}$ 2. The transition from disconnected to connected graphs (percolation) underpins the possibility of global community recovery. In $\{V_1,\ldots,V_k\}$ 3 (circle), the critical regime is $\{V_1,\ldots,V_k\}$ 4 and $\{V_1,\ldots,V_k\}$ 5. The graph is connected w.h.p. if $\{V_1,\ldots,V_k\}$ 6 and $\{V_1,\ldots,V_k\}$ 7; otherwise, it is disconnected w.h.p. (Galhotra et al., 2022). In higher dimensions, the isolation and connectivity thresholds are governed by the function

$\{V_1,\ldots,V_k\}$ 8

and radii scaling as $\{V_1,\ldots,V_k\}$ 9; connectivity typically requires $i$ 0 and $i$ 1 (Galhotra et al., 2022).

3. Information-Theoretic and Algorithmic Recovery Thresholds

GHCM admits rigorous information-theoretic sharp thresholds for exact recovery, formulated by evaluating whether sufficient information exists to break the (global relabeling) symmetry in the presence of edge sparsity and geometric correlations. In the classical step-kernel GHCM for $i$ 2, exact recovery is impossible if $i$ 3 or $i$ 4 (Galhotra et al., 2022, Galhotra et al., 2017), with analogous results holding in higher dimensions under corresponding parameter scalings.

For the general distance-dependent pairwise observation setup, the sharp recovery threshold is

$i$ 5

where $i$ 6 is the Poisson process intensity, $i$ 7 is the unit-ball volume in $i$ 8, and $i$ 9 is the Chernoff--Hellinger divergence integrated over spatial distance and label mixture (Gaudio et al., 22 Jan 2025, Gaudio et al., 24 Jan 2026). Above threshold, there exist linear-time or polynomial-time algorithms achieving exact recovery; below, no estimator surpasses chance. The precise formula for $X_i \in [0,1]^d$ 0 is

$X_i \in [0,1]^d$ 1

with $X_i \in [0,1]^d$ 2 the conditional densities and $X_i \in [0,1]^d$ 3 the distance density (Gaudio et al., 24 Jan 2026).

4. Recovery Algorithms and Computational Aspects

GHCMs admit provably close-to-optimal recovery algorithms in the sparse regime, harnessing geometric transitivity and motif abundance:

Triangle-based clustering: Algorithms count triangles for edge $X_i \in [0,1]^d$ 4—the number of common neighbors—and prune edges not exceeding statistically determined thresholds. Edges with triangle counts close to within-community expectations are retained. The final partition is extracted via connected components or union-find machinery. This scheme succeeds for $X_i \in [0,1]^d$ 5 (or similar thresholds in higher dimensions), achieving $X_i \in [0,1]^d$ 6 complexity for $X_i \in [0,1]^d$ 7 nodes in the sparse regime (Galhotra et al., 2022, Galhotra et al., 2017).
Two-phase linear-time algorithms: Recent work (Gaudio et al., 22 Jan 2025, Gaudio et al., 24 Jan 2026) describes "seed-propagate-refine" meta-algorithms: (1) Local MAP inference on small initial blocks, (2) label propagation across spatial blocks via likelihood ratios or motif aggregation, and (3) an exact labeling refinement phase using local MAP with the now-almost-correct labeling. All edges are only examined $X_i \in [0,1]^d$ 8 times, yielding overall $X_i \in [0,1]^d$ 9 running time.
Spectral methods: In hybrid models (SBM plus geometric noise), standard spectral clustering on the adjacency matrix is robust if the SBM eigen-gap exceeds geometric "noise" by sufficient factor. Explicit eigenvalue separation and Davis-Kahan-type arguments guarantee that the second (or $d$ 0 leading) eigenvectors retain significant alignment with the true community structure when $d$ 1 where $d$ 2 (Peche et al., 2020).
Active learning and label queries: Motif-based edge pruning can be combined with querying a vanishing (sublinear) number of node labels. For regimes where indirect motif separation is insufficient, adaptively querying the labels of a few strategically chosen nodes (e.g., one per connected component of the pruned graph) suffices for exact recovery in $d$ 3 queries (Chien et al., 2019).

5. Comparison to SBM and Other Random Graph Models

SBM and GHCM differ fundamentally in edge independence and motif structure:

SBM edges are independent given labels; thus, triangle density in sparse-SBM is $d$ 4 per edge—triangle counting does not offer useful separation (Galhotra et al., 2022, Galhotra et al., 2017).
GHCMs induce correlated edge formation: spatially nearby nodes participate in many triangles, particularly within communities. Motif (triangle)-counting is therefore an effective and nearly optimal community recovery tool for GHCM, but not for SBM in the sparse regime.
GHCM unifies and generalizes random geometric graphs, block models, and latent space models; with step kernels, it reduces to RGG or SBM in special cases (Gaudio et al., 24 Jan 2026, Gaudio et al., 22 Jan 2025).

6. Empirical Performance and Benchmarks

Empirical validation on real networks (Political-Blogs, DBLP collaboration graphs, LiveJournal) confirms that motif-based unsupervised GHCM recovery achieves 75–80% labeling accuracy (as measured against ground truth), outperforming spectral clustering and other SBM-inspired techniques (which achieve only 50–65%) (Galhotra et al., 2022, Galhotra et al., 2017). In synthetic datasets, experiments reveal a sharp threshold behavior: below the predicted $d$ 5 gap, algorithms fail; above, recovery is perfect. Running times are near-linear in $d$ 6 or linear in the number of edges, contrasting with the quadratic cost of spectral methods.

7. Extensions, Generalizations, and Open Problems

GHCM research has advanced to:

Recovery in arbitrary dimensions, inhomogeneous spatial domains, and with flexible distance-dependent kernels—both theory and methodology allow for more general geometric and weighted graphs (Avrachenkov et al., 2024, Eldan et al., 2020).
Incorporating percolation-theoretic arguments: threshold behavior links to continuum percolation and information flow (Kesten–Stigum thresholds) on branching processes with geometry (Eldan et al., 2020).
Bypassing "distinctness-of-distributions" assumptions when between-community and within-community observation distributions coincide for some pairs—data-driven block-propagation algorithms still achieve sharp recovery (Gaudio et al., 22 Jan 2025).
Active learning in GHCM indicates that sublinear label queries close the gap between unsupervised and information-theoretic recovery thresholds (Chien et al., 2019).
Open questions remain for scalability of optimal recovery algorithms to $d$ 7 communities, adapting to highly sparse regimes, and understanding criticality in more general geometries and connection rules.

References:

"Community Recovery in the Geometric Block Model" (Galhotra et al., 2022)
"The Geometric Block Model" (Galhotra et al., 2017)
"Exact Recovery in the Geometric Hidden Community Model" (Gaudio et al., 24 Jan 2026)
"Sharp exact recovery threshold for two-community Euclidean random graphs" (Gaudio et al., 22 Jan 2025)
"Community Detection on Block Models with Geometric Kernels" (Avrachenkov et al., 2024)
"Active learning in the geometric block model" (Chien et al., 2019)
"Community detection and percolation of information in a geometric setting" (Eldan et al., 2020)
"Robustness of Community Detection to Random Geometric Perturbations" (Peche et al., 2020)