Incremental DBSCAN

Updated 6 February 2026

Incremental DBSCAN is a dynamic version of the classic DBSCAN that updates clusters locally without recomputing the entire structure upon data insertions or deletions.
It leverages spatial indices, region-growing techniques, and advanced data structures like Euler Tour Trees and HNSW to achieve sublinear update costs.
Empirical studies report 10–20× speedups for small update fractions while maintaining clustering accuracy through metrics such as the Rand Index.

Incremental DBSCAN refers to a family of algorithms and frameworks that extend the classical Density-Based Spatial Clustering of Applications with Noise (DBSCAN) paradigm to dynamic or streaming data settings. These methods support efficient integration of insertions and/or deletions into an existing clustering, eliminating the need to recompute the entire structure after each data modification. Incremental DBSCAN variants leverage advanced data structures, indexing strategies, graph-theoretic formulations, and sampling or approximation methods to achieve sublinear or polylogarithmic update costs, extending the applicability of density-based clustering to large-scale, high-velocity, and non-Euclidean datasets.

1. Classical DBSCAN and the Need for Incrementalization

The original DBSCAN algorithm [Ester et al., 1996] defines clusters as maximal sets of density-connected points in a metric space, given parameters $\varepsilon$ (radius) and MinPts (minimum neighborhood cardinality). Each point is classified as a core (at least MinPts points in its $\varepsilon$ -neighborhood), border, or noise. DBSCAN performs a global scan of the dataset using range queries, with complexity $O(n^2)$ in the absence of spatial indexing or $O(n \log n)$ for low-dimensional Euclidean data with an R-tree. Classical DBSCAN is batch-oriented and not designed for evolving datasets.

The incremental DBSCAN problem is formally defined as maintaining a valid clustering under dynamic updates, i.e., for data evolving via insertions ( $\Delta^+$ ) and deletions ( $\Delta^-$ ) according to $D_{t+1} = (D_t \cup \Delta^+) \setminus \Delta^-$ . The challenge is to adjust only affected neighborhoods and clusters without recomputing the full structure, achieving computational savings when the update fraction $\delta = |\Delta|/|D|$ is small (Chakraborty et al., 2014, Chakraborty et al., 2014).

2. Algorithmic Techniques for Incremental DBSCAN

Incremental DBSCAN implementations share several algorithmic features:

Neighborhood Index Maintenance: A spatial index (e.g., R-tree, kd-tree) is kept over the data, supporting $O(\log n + |N_\varepsilon|)$ range queries. Upon insertion/deletion, affected points' neighborhoods and core/border/noise statuses are updated locally.
Region-Growing Expansion: When a newly inserted point becomes a core, region-growing (flood fill) is performed using the classic ExpandCluster method to update density-reachable points' cluster labels. Deletions may cascade demotions or splits within affected clusters.
Switch-Level Incrementalization: For small updates ( $\delta \lesssim 0.3$ ), incremental DBSCAN yields significant speedups (10--20 $\times$ for $\delta = 0.01$ --0.05). For larger updates, incremental strategies lose their computational advantage and full re-computation is recommended.
Complexity: For $|\Delta| = \delta n$ updates, the incremental cost is $O(\delta n \log n + \delta n \operatorname{MinPts})$ per batch as opposed to $O(n \log n)$ for full re-run, provided that update localization is preserved (Chakraborty et al., 2014, Chakraborty et al., 2014).

Incremental DBSCAN pseudocode follows standard region-growing with localized re-indexing. Insertions perform $\varepsilon$ -range queries to current clusters, updating labels and possibly triggering further promotions to core/border status. Deletions demote neighbors and recursively update the cluster structure if conditions change.

3. Scalable and Advanced Incremental DBSCAN Variants

3.1 Dynamic DBSCAN with Euler Tour Trees

DynamicDBSCAN (Shin et al., 11 Mar 2025) introduces Euler Tour Trees (ETT) to support fully dynamic connectivity queries and updates in the DBSCAN core-neighbor graph. Points are hashed into $t = O(\log (n/\delta))$ buckets using locality-sensitive hashing (LSH), supporting $O(d \log n)$ time $\varepsilon$ -neighbor checks in $\mathbb{R}^d$ . Core points are organized as a dynamic forest with ETT, linked according to $\varepsilon$ -neighborhood connectivity. Each insertion or deletion triggers at most $O(\log^3 n + d \log^3 n)$ updates (worst case, high probability). Non-core (border/noise) points are maintained as leaves attached to the core forest.

The approach guarantees, with probability at least $1 - \delta$ , the maintenance of the same connected components (clusters) as static DBSCAN (except for an $O(\varepsilon)$ boundary band). Empirical results indicate $10$-- $100\times$ speedup for batches of up to $10^5$ updates, with Rand Index $>0.995$ vs. static DBSCAN.

3.2 FISHDBC

FISHDBC (Dell'Amico, 2019) is a flexible, incremental, scalable, hierarchical density-based clustering framework for arbitrary data types and distance functions. It uses a Hierarchical Navigable Small World (HNSW) graph for approximate neighbor discovery, accumulating distance computations from index traversal. Reachability distances $\mathrm{reachDist}(a,b) = \max(d(a,b), c(a), c(b))$ (with core distance $c(\cdot)$ = $m$ th NN) are updated incrementally as points arrive. A candidate edge set is periodically merged into the global minimum spanning forest (MST), leveraging Kruskal's algorithm. FISHDBC approximates HDBSCAN*'s hierarchical clustering, achieves complexity $O(n \log^2 n)$ via batched updates, and accommodates streaming/large-scale data.

3.3 Incremental Prototype-based DBSCAN (IPD)

IPD (Saha et al., 2022) addresses big data and streaming limits of classical DBSCAN by maintaining a small, incrementally updated prototype subset (of size $\gamma \ll n$ ). DBSCAN is applied to the prototype; when batches of new points arrive, neighborhoods are checked only against the prototype, and clusters merged accordingly. Representatives are selected per cluster using density-based criteria, enabling all remaining data to be labeled by nearest neighbor to a cluster representative. Stability is tracked by monitoring label consistency on a test sample; once stable (instability $\Delta = 0$ ), clustering is summarized via the representatives. The cost is independent of $n$ : $O(\gamma^2 + t_{\mathrm{ipd}}[\gamma + (\beta^2 + \gamma \beta) + \alpha^2 + \alpha k])$ where $t_{\mathrm{ipd}}$ is the number of incremental iterations. IPD achieves NMI scores $0.82$--$0.99$ relative to DBSCAN/ground truth, with high scalability and rapid adaptation to point insertions.

4. Empirical Performance and Comparison to Other Incremental Methods

The empirical evaluation of Incremental DBSCAN (as in (Chakraborty et al., 2014, Chakraborty et al., 2014, Shin et al., 11 Mar 2025, Dell'Amico, 2019, Saha et al., 2022)) demonstrates:

For update fractions $\delta \leq 0.05$ , incremental methods deliver $10$-- $20\times$ speedup compared to full DBSCAN reruns.
For $\delta$ up to $0.3$--$0.72$ (reported thresholds), speedups remain $2$-- $10\times$ ; above this, batch rerun is preferable.
Cluster label consistency metrics (Rand index, NMI, ARI, silhouette) maintain agreement with static DBSCAN up to critical $\delta^*$ . Outlier detection and core/border assignments are preserved up to boundary noise.
Compared with incremental K-means, incremental DBSCAN is more computationally intense (due to local region-growing) but supports arbitrary cluster topology and explicit noise handling.

Table: Empirical Thresholds and Speedup

$\delta$ (%)	T_full / T_incr	Speedup Regime
1	$\approx$ 20	Strong (recommended)
5	$\approx$ 10	Effective
30	$\approx$ 2	Marginal
>30	$\leq$ 1	Full rerun favored

Performance is robust provided update localization is maintained and spatial index/querying remains sublinear.

5. Methodological and Structural Enhancements

Incremental DBSCAN frameworks integrate several advanced structures and strategies:

Spatial and Graph Indexing: R-trees, kd-trees, LSH-buckets, HNSW, and Euler Tour Trees allow efficient neighborhood lookups and core/border tracking in evolving datasets.
Hierarchical and Prototype-Based Summaries: FISHDBC and IPD support hierarchical (dendrogram) outputs or representative-based reduced labeling, accommodating both exploratory and scalable analytics.
Approximation and Batch Buffering: For streaming/large-scale scenarios, algorithms trigger full MST or cluster merges in batched modes, amortizing complexity and bounding RAM usage.
Anytime/Online Processing: Some variants allow “anytime” cuts, where clustering structure may be queried at arbitrary intermediate times, consistent with the underlying density connectivity relations.

6. Parameterization and Practical Guidance

Selecting update parameters involves:

$\varepsilon$ and MinPts: Must be adapted over time; for high-dimensions, MinPts $\approx d+1$ or $2d$ is recommended. “Elbow” methods (from $k$ -distance plots) can help adapt $\varepsilon$ as dataset density drifts.
Switch criteria ( $\delta^*$ ): Empirical determination of $\delta^*$ (the fraction of data updated at which incremental methods lose their advantage) is essential for balancing recomputation costs.
Prototype, Batch, or Candidate Sizes: For prototype-based or batch-methods (e.g., IPD, FISHDBC), batch/candidate sizes must be chosen so that critical data structures remain memory-resident but large enough to minimize rebuild frequency.
Distance Function Choices: Flexible variants (FISHDBC) support arbitrary symmetric distances, including non-metric or application-specific similarity functions.

Implementation guidelines emphasize optimized indexing libraries, parallelization of region-growing/batch updates, background processing of MST or index updates, and automated retuning based on data drift or density changes (Dell'Amico, 2019, Saha et al., 2022, Shin et al., 11 Mar 2025).

7. Limitations and Future Directions

Limitations persist, especially in the highly dynamic or high-dimensional setting:

For cluster boundaries or in regions of rapid density change, incremental maintenance may miss transient structure unless neighborhood recalculations are very aggressive.
Parameter selection remains nontrivial, particularly in the presence of concept drift or variable density; automatic scaling and monitoring are ongoing research topics.
Strong empirical performance holds only for moderate update rates; once $\delta$ exceeds $\delta^*$ (problem and implementation dependent, but typically $0.2$--$0.7$), full recomputation matches or exceeds incremental performance.
Extension to arbitrary data types, extremely high dimensions, and adversarial update regimes motivates further development of index-pruned methods, distributed architectures, and hierarchical summarizations.

Future directions include automated parameter tuning, adaptive batch or prototype sampling, efficient distributed/parallel implementations, and robust adaptation to extreme data velocity with loss or out-of-core index strategies (Dell'Amico, 2019, Saha et al., 2022, Shin et al., 11 Mar 2025). Development of accurate yet fast incremental methods for non-metric or graph-based data remains an active research area.