HIT-Leiden: Incremental Tree Community Detection
- HIT-Leiden is an incremental, hierarchical, and parallel community detection algorithm designed to maintain quality community partitions under frequent graph updates.
- It uses a tree-structured hierarchy and dynamic connectivity techniques to localize and bound update computations, enhancing efficiency over traditional methods.
- Empirical evaluations demonstrate significant speedups and modularity preservation, making HIT-Leiden effective for dynamic network applications from social graphs to biological datasets.
Hierarchical Incremental Tree Leiden (HIT-Leiden) is an incremental, hierarchical, and parallel community detection algorithm for large, dynamic networks that maintains high-quality community partitions under frequent updates. HIT-Leiden is distinguished by a hierarchical, tree-structured approach to managing community structure and update propagation, as well as provable boundedness and efficient locality—qualities that address the inefficiency and unboundedness of previous Leiden-based incremental methods. It has demonstrated significant speedups and modularity preservation in large-scale experimental evaluations across a variety of dynamic graph scenarios (Lin et al., 13 Jan 2026, Bokov et al., 20 Feb 2025).
1. Theoretical Foundation and Problem Setting
In HIT-Leiden, the input is a dynamic, undirected, weighted graph , where each node has weighted degree , and the global edge-weight sum is . The community detection objective is modularity maximization:
where is the community assignment, the total internal edge weight, the total degree of , and is the resolution parameter.
The algorithm processes batch updates of edge insertions and deletions. The principal challenge addressed by HIT-Leiden is efficient community maintenance under these updates, without recomputing community structure from scratch. This is formalized in Problem 1: given and update , efficiently compute updated communities for while preserving high modularity.
A key theoretical concept is boundedness: an incremental algorithm is bounded if its update time is polynomially related to the size of the affected region (AFF) and the community structure size . Existing incremental methods (DF-Leiden, ND-Leiden, DS-Leiden) are unbounded, requiring work proportional to the whole graph on each update (Lin et al., 13 Jan 2026).
2. Hierarchical Data Structures and Connectivity Maintenance
HIT-Leiden implements a hierarchical community structure using levels (typically ), each representing progressively coarser meta-communities. At each level , the nodes (supernodes) correspond to communities formed at level , and edges are induced by aggregating weights between underlying vertices. Supernodes are linked to parents via pointers, forming a tree.
Subcommunity structure is maintained by a dynamic connectivity index (e.g., DND-Tree) over a subgraph comprising intra-sub-community edges. Each connected component is a subcommunity; edge updates or vertex moves that split a component are tracked, and the smaller piece receives a new subcommunity ID. This enables efficient detection of subcommunity splits/merges and localizes updates.
A representative hierarchy:
| Level | Entity | Nodes Represent |
|---|---|---|
| Community Nodes | Meta-communities | |
| Refined Nodes | Subcommunities | |
| $0$ | Ground Nodes | Singleton vertices |
Edges exist only within levels and inherit weights from underlying substructure (Bokov et al., 20 Feb 2025).
3. Incremental Update Algorithms
HIT-Leiden processes batch updates via efficient local routines at each hierarchical level. The update pipeline consists of:
- Inc-movement: For a batch , identifies affected vertices, maintains a working set of potentially moved vertices, and greedily applies modularity-improving moves based on the gain
All such moves are performed until no positive remains. The process marks both the “community-changed” region () and sub-community splits ().
- Inc-refinement: For each vertex in , if split from its subcommunity, reassigns it optimally based on local modularity, ensuring (nearly) -connected, locally optimal subcommunity partitions.
- Inc-aggregation: Lifts the batch of edge and subcommunity changes from level to by aggregating the updates on supernodes, maintaining consistent hierarchy representations.
- Deferred hierarchy update: Changes detected at higher levels are propagated downward: affected children inherit updated community labels, preserving hierarchical consistency.
A global driver sequentially processes each hierarchical level, applying inc-movement, inc-refinement, inc-aggregation, and final deferred updates (Lin et al., 13 Jan 2026, Bokov et al., 20 Feb 2025).
4. Modularity Optimization and Parallelization
In each hierarchical level, HIT-Leiden applies a Leiden-style process comprised of Move and Refine stages:
- MoveStage: Computes, for each affected node and its neighbors, the best allowed move based on . Candidate moves are collected, sorted by reward, and greedily filtered to a non-conflicting set for simultaneous application.
- RefineStage: Restricts optimization to moves within the same parent community, refining substructure.
Parallelization leverages the largely local nature of these operations: affected node sets and their 2-hop neighborhoods are partitioned across threads. Most steps, including computation of modularity rewards and candidate moves, are parallel. Only the decoupling (conflict filtering) step is strictly sequential. Memory accesses remain localized, and synchronization cost is minimal (Bokov et al., 20 Feb 2025).
5. Time and Space Complexity
Update cost per batch is a function of the number of unique affected nodes , maximum degree , number of hierarchy levels , inner iterations , and Move/Refine iterations . Formally,
where is the number of supernodes touched. Since in practical dynamic workloads, and are constants, time per update is effectively . This establishes HIT-Leiden as relatively bounded. Space overhead is , with in practice (Bokov et al., 20 Feb 2025, Lin et al., 13 Jan 2026).
6. Empirical Performance and Applications
HIT-Leiden demonstrates scalability and efficiency across multiple domains:
- On datasets with up to 201M nodes and 4B edges, HIT-Leiden achieves up to speedup over DF-Leiden and – over ND/DS-Leiden for batch sizes .
- Modularity matches static Leiden within $0.01$ and achieves -density.
- As batch size decreases, runtime grows sublinearly, validating dynamic locality, whereas baseline methods remain linear in .
- In long-term experiments over 999 update batches, HIT-Leiden remains both fast and quality-stable.
- In question-answering over graphs (Graph-RAG on HotpotQA), HIT-Leiden-RAG is 56 faster than static Leiden-RAG, with summary token cost dropping below and no deterioration in QA accuracy (Lin et al., 13 Jan 2026).
The parallel implementation (LD-Leiden) achieves 7–49 single-thread speedup over prominent baselines and scales to 64 threads with maintained or improved modularity (Bokov et al., 20 Feb 2025).
7. Relation to Previous Methods and Significance
HIT-Leiden’s design addresses the central deficiency of prior incremental Leiden algorithms—unboundedness—by tightly confining computation to the actual affected subregions and propagating changes only when necessary. This is achieved through integration of hierarchical representation, dynamic connectivity tracking, and efficient modularity optimization. The locality inherent in HIT-Leiden also enables efficient parallelization on shared-memory systems.
As the first relatively bounded incremental Leiden algorithm with provable update complexity, HIT-Leiden offers a foundational methodology for community detection in steaming or continuously evolving massive networks encountered in knowledge graphs, anomaly detection, biological datasets, and LLM-powered retrieval-augmented generation systems (Lin et al., 13 Jan 2026, Bokov et al., 20 Feb 2025).