Papers
Topics
Authors
Recent
Search
2000 character limit reached

Absolute DSI for Cluster Margins

Updated 18 May 2026
  • The paper introduces Absolute DSI, a normalized measure that uses the KS statistic to assess the gap between intra‐ and inter-cluster distance distributions.
  • It computes ECDFs for pairwise distances within and between clusters, averaging bi-directional KS distances to form a robust separability index.
  • The approach is applicable across various cluster shapes and sizes, providing an interpretable geometric margin for cluster validity and model selection.

Absolute DSI for Cluster Margins quantifies the degree of separation between clusters in a dataset through direct analysis of inter- and intra-cluster distance distributions. Using the Kolmogorov–Smirnov (KS) statistic applied to empirical cumulative distribution functions (ECDFs) of these distances, the Absolute Distance-based Separability Index (DSI) provides a scalar in [0,1][0,1] that measures how well any cluster (or cluster pair) is isolated from others in arbitrary feature spaces. This index is “absolute” because it is inherently normalized, requires no class labels or external reference, and does not depend on the clustering method or data representation, thereby serving both as a cluster validity index and as a theoretically interpretable geometric margin.

1. Mathematical Formulation

Let X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d be partitioned into KK clusters C1,...,CKC_1, ..., C_K of sizes N1,...,NKN_1, ..., N_K, with iNi=N\sum_i N_i = N. For any two clusters CiC_i, CjC_j (iji \neq j):

  • Intra-cluster distance set: Di={xx:x,xCi,xx}D_i = \{\|x - x'\| : x, x' \in C_i,\, x \neq x'\}
  • Inter-cluster distance set: X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d0

The ECDFs X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d1 and X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d2 are constructed for X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d3 and X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d4, respectively. The cluster margin between X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d5 and X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d6 is defined by the symmetrized KS distance:

X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d7

where X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d8. X={x1,...,xN}RdX = \{x_1, ..., x_N\} \subset \mathbb{R}^d9 ranges in KK0 and quantifies the largest observed discrepancy between the intra-cluster and cross-cluster distance distributions, interpreted as the absolute cluster margin (Guan et al., 2021).

2. Methodological Workflow

The computation of Absolute DSI for any cluster pair involves the following steps:

  1. Distance extraction: Compute all pairwise intra-cluster distances for KK1 and KK2, and all between-cluster distances for KK3.
  2. Empirical distribution construction: Build ECDFs for the intra- and inter-cluster distance sets.
  3. KS statistic evaluation: Compute the KS distance from the intra-cluster ECDF to the inter-cluster ECDF in both directions.
  4. Margin aggregation: Symmetrize the result by averaging KS distances for KK4 and KK5 to obtain KK6 in KK7.

This procedure applies to general feature spaces and cluster sizes—with random subsampling of distance sets reducing computational cost for large data (Guan et al., 2021, Guan et al., 2021, Guan et al., 2020).

3. Theoretical Properties and Interpretation

Absolute DSI exhibits the following properties:

  • Range and normalization: KK8 by construction. No further normalization is required.
  • Separability semantics: KK9 when intra- and inter-cluster ECDFs are completely disjoint (perfect separation); C1,...,CKC_1, ..., C_K0 when they coincide (no separation, random labeling, or complete overlap).
  • Monotonicity: Increasing the separation between clusters (by moving them apart without changing internal structure) increases C1,...,CKC_1, ..., C_K1.
  • Robustness: The index is agnostic to cluster shape, size, and center. It only depends on the set of pairwise distances and thus captures irregular or non-convex cluster boundaries (Guan et al., 2021).
  • Sensitivity: The KS statistic is maximally sensitive to the single largest deviation in the ECDFs, which can correspond to a significant margin “gap” even if most of the distribution overlaps (Guan et al., 2021).
  • Label invariance: DSI is unaffected by the ordering or permutation of cluster assignments.

A plausible implication is that while DSI gives an interpretable and quantitative margin, it may underweight multi-modalities in the distance spectrum if a single “gap” dominates the ECDF difference.

4. Comparison with Alternative Margin and Separability Indices

Absolute DSI generalizes classical margin concepts in cluster analysis by moving beyond minimum pairwise distances and worst-case scenarios (as in the Dunn index):

  • Dunn index: Relies on the ratio of the minimal inter-cluster distance to maximal intra-cluster diameter, targeting the “weakest” cluster separation. By contrast, absolute DSI is the average of maximal ECDF gaps aggregated over all cluster pairs, offering a more global and robust measure (Bagirov et al., 15 Oct 2025).
  • Margin-based absolute indices: Extensions using explicit cluster-center gaps (e.g., C1,...,CKC_1, ..., C_K2), while interpretable in physical units, are sensitive to outliers and shape assumptions. DSI, being ECDF-based, adapts to arbitrary distributions and substructure (Bagirov et al., 15 Oct 2025).
  • Complexity and class separability measures: DSI is directly interpretable as “separability”; using C1,...,CKC_1, ..., C_K3 yields an intrinsic class or cluster “complexity score” without additional normalization (Guan et al., 2021).

Comparison studies demonstrate that DSI is empirically competitive with or superior to established internal CVIs across a range of real and synthetic benchmarks (Guan et al., 2021, Guan et al., 2020, Bagirov et al., 15 Oct 2025).

5. Extensions: One-dimensional Margins and Shrinkage Approaches

For one-dimensional clusters, an analogous absolute DSI is constructed using the diameter-shrinkage ratio under extreme-point trimming:

  • Define the span C1,...,CKC_1, ..., C_K4 for sorted samples.
  • Trim C1,...,CKC_1, ..., C_K5 extreme points per end to obtain C1,...,CKC_1, ..., C_K6.
  • Compute the shrinkage ratio C1,...,CKC_1, ..., C_K7.

This profile can be compared against null models (uniform and Gaussian) for geometric validation. The “absolute DSI” is thus the vector of shrinkage ratios up to a maximum trimming depth, distinguishing compact clusters with heavy-tailed margins from uniform segments or noise (Dereure et al., 29 Aug 2025). The approach is robust for modest sample sizes and requires no parameter tuning or density estimation, integrating directly into clustering validation pipelines.

6. Practical Significance and Empirical Observations

Absolute DSI has proved effective as a cluster validity measure:

  • On UCI datasets and synthetic benchmarks, DSI achieves high correlation with external ground-truth metrics such as Adjusted Rand Index, often ranking among the top internal indices for selecting optimal clustering solutions (Guan et al., 2021, Guan et al., 2020).
  • Example: For the wine data with C1,...,CKC_1, ..., C_K8, k-means clustering achieves ARI = 0.913 and DSI = 0.635, underscoring the alignment of this margin-based internal index with external validation.
  • In synthetic settings (e.g., non-overlapping Gaussian blobs), C1,...,CKC_1, ..., C_K9 approaches 1, while for overlapping or mixed distributions, it drops towards 0, as predicted by its theoretical construction (Guan et al., 2021).

The index’s robustness against cluster cardinality, shape, and high-dimensional scaling is supported by extensive numerical experiments, although dimensionality effects may warrant pre-processing (e.g., PCA) when distance concentration becomes an issue.

7. Limitations, Interpretive Guidelines, and Application Advice

Interpreting N1,...,NKN_1, ..., N_K0 requires attention to the context:

  • Thresholding: DSI thresholds such as 0.8 are sometimes used heuristically to determine whether cluster pairs are sufficiently separated to remain distinct, or should be merged (Guan et al., 2021).
  • Scalability: For large N1,...,NKN_1, ..., N_K1, subsampling of distance sets is recommended, as the full pairwise computation scales as N1,...,NKN_1, ..., N_K2.
  • Cluster structure distortions: In the presence of highly skewed clusters, the KS-based margin can capture only the dominant separation, and secondary or local modes may be missed.
  • Selection of N1,...,NKN_1, ..., N_K3: Plotting DSI or average cluster margins against N1,...,NKN_1, ..., N_K4 can inform model selection and cluster number validation, often seeking the non-negative maximum or joint optimization with compactness indices (Bagirov et al., 15 Oct 2025).

A plausible implication is that while absolute DSI offers a principled geometric margin and a strong internal validity index for unsupervised clustering, its empirical utility may be further enhanced by complementary use with compactness scores and model-based criteria for comprehensive validation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Absolute DSI for Cluster Margins.