Hierarchical Clustering & Balanced Sampling Pipeline

Updated 10 April 2026

Hierarchical clustering and balanced sampling pipelines are techniques that combine multi-level clustering with strategic sampling to preserve both structural detail and statistical balance in datasets.
They integrate methods such as agglomerative node-pair sampling, exponentially twisted distributions, and fair post-processing to support scalable curation and robust representation learning.
These pipelines are applied in graph mining, self-supervised learning, and fairness-aware analysis, enhancing data diversity and mitigating biases in large-scale applications.

Hierarchical clustering and balanced sampling pipelines are foundational mechanisms in large-scale data curation, unsupervised representation learning, and scalable graph mining. These pipelines integrate multi-level partitioning algorithms with principled sampling strategies to achieve both structural fidelity and statistical balance, addressing the limitations of manual selection and naive clustering in domains from self-supervised learning to fairness-aware analysis.

1. Core Concepts and Objectives

Hierarchical clustering constructs a tree-structured decomposition—dendrogram—of a dataset, successively grouping points into clusters at multiple scales, thus capturing multi-level structural information. Balanced sampling ensures that the resultant clusters, or samples drawn from them, represent the diversity and inherent structure of the data rather than overemphasizing majority modes or high-density regions. The integration of these methodologies produces large, diverse, and conceptually balanced datasets ideal for downstream training, evaluation, or statistical analysis (Vo et al., 2024).

The main objectives are:

To recover the multiscale community structure of the data, preserving rare and common patterns alike.
To curate subsets from massive data pools that are uniform over "concepts" (latent classes or distributions).
To guarantee balance (by size, class, or other attribute) across all levels of the hierarchy.
To ensure algorithmic scalability to hundreds of millions or billions of samples.

2. Hierarchical Clustering Algorithms

A range of hierarchical clustering algorithms have been developed, each with distinct theoretical properties and computational characteristics.

Agglomerative Node-Pair Sampling: Paris

The Paris algorithm (Bonald et al., 2018) operates on a weighted graph $G=(V,E)$ , using adjacency matrix $A$ . It defines a probability distribution over neighboring node pairs:

$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$

Distances between clusters $a$ and $b$ are given by

$d(a,b) = \frac{p(a)\,p(b)}{p(a,b)}$

with the key reducibility property:

$d(a \cup b, c) \geq \min\{d(a,c), d(b,c)\}$

This ensures regularity (no inversions) in the dendrogram and enables the nearest-neighbor-chain acceleration, reducing runtime to near-linear in $|E|$ . The method is parameter-free, fully deterministic, and produces clusterings at all resolutions via dendrogram cuts, naturally balancing edge-based node sampling and multiscale community detection.

Exponentially Twisted Sampling and Softmax/iPHD

Chang & Chang’s framework (Chang et al., 2017) generalizes sampling to semi-metric spaces using exponentially twisted distributions:

$p_\theta(i,j) = \frac{1}{Z(\theta)} \exp(-\theta d(x_i, x_j)), \quad Z(\theta) = \sum_{i,j} \exp(-\theta d(x_i, x_j))$

Softmax clustering maximizes an unnormalized modularity objective over soft cluster assignments, leading to a local optimum partition. Hard clusterings and embeddings are recovered in the high or low-temperature limits, and the pipeline alternates this soft partitioning with agglomerative merging (iPHD), tuning $\theta$ for resolution control and ensuring balance by design.

Balanced Distributed Clustering: Matching Affinity Clustering

Matching Affinity Clustering (Hajiaghayi et al., 2021) targets massive graphs in the MPC model. Rather than single-pair merges, it finds maximum matchings among clusters and merges them in bulk, guaranteeing that cluster sizes differ by at most a factor of two or are exactly balanced when $A$ 0 is a power of two. The algorithm achieves constant-factor approximations to known objective functions (Moseley–Wang revenue and Cohen–Addad value) and scales in $A$ 1 MPC rounds with near-perfect balance.

Fair and Balanced Dendrogram Conversion

Any pre-existing hierarchical clustering can be post-processed via the cost/fairness-preserving reductions in (Knittel et al., 2022), enforcing strong balance (e.g., $A$ 2-relative and multicolor fairness) at every split while provably bounding the increase in Dasgupta’s cost to $A$ 3 for any constant $A$ 4. This makes balance and fairness tractable without sacrificing global cost objectives.

3. Hierarchical k-means and Balanced Sampling Pipelines

Recent large-scale data curation demands pipelines that extract balanced, diverse subsets from massive embedding pools. A typical architecture comprises:

Hierarchical k-means with Resampling
- For $A$ 5, recursively cluster via $A$ 6 levels of k-means ( $A$ 7), each time either on data or centroids from the previous level.
- At each level, "resampling-clustering" steps are applied: for each cluster, sample $A$ 8 points (by proximity or random draw), rerun k-means on these, refine centroids, reassign full input, and iterate.
- This prevents rapid input shrinkage, counteracts centroid degeneracy, and makes centroid distributions increasingly uniform over the dataset support (Vo et al., 2024).
Mathematical Foundation
- Each k-means step minimizes
$A$ 9

Centroids at level $p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 0 approximate draws from $p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 1, making repeated application push toward uniformity.

Balanced Hierarchical Sampling
- For global subset size $p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 2, recursively allocate sampling budgets from root to leaves of the cluster tree.
- At each cluster, decide local sample allocations via integer optimization (binary search) to approximate uniform representation while strictly enforcing the overall budget.
- Within the smallest clusters, sample points uniformly at random or by proximity to centroids (random methods demonstrate superior downstream performance).

These steps yield a hierarchically curated, concept-balanced subset of the original dataset, suitable for large-scale self-supervised pretraining or analytic tasks.

4. Balance, Fairness, and Resolution Tuning

Balanced sampling in these pipelines arises both explicitly and implicitly:

Node-pair or exponentially twisted sampling ensures that neither high-degree nodes (in graphs) nor high-density regions (in continuous spaces) dominate merges or selections (Bonald et al., 2018, Chang et al., 2017).
Matching-based and rebalancing algorithms enforce strict size parity at every merge, either by design (maximum matchings (Hajiaghayi et al., 2021)) or post hoc by tree transformations (Knittel et al., 2022).
Exponentially twisted or marginal-weight-based formulations allow systematic control over clustering resolution: tuning parameters such as $p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 3 or dendrogram cut-heights yield fine-to-coarse partitions, permit selective biasing toward rare or abundant modes, and facilitate the formation of multi-resolution, balanced hierarchies.

For fairness constraints (e.g., in demographically labeled data), operations such as fold/abstract and level-wise subtree mixing provably approximate desired ratios at all levels with only polylogarithmic cost blowup (Knittel et al., 2022).

5. Computational Complexity and Practical Integration

All major algorithms covered address scalability and empirical efficiency:

Algorithm/Pipeline	Memory Complexity	Key Runtime Guarantees
Paris / Node-pair sampling (Bonald et al., 2018)	$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 4	Near-linear in $p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 5
Matching Affinity Clustering (Hajiaghayi et al., 2021)	$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 6 per machine (MPC)	$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 7 rounds in MPC
Hierarchical k-means + sampling (Vo et al., 2024)	$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 8	Runs over 743M points in days (GPU/cluster)
Fair/balanced reductions (Knittel et al., 2022)	$p(i,j) = \frac{A_{ij}}{w}, \quad w = \sum_{i,j} A_{ij}$ 9	Polylog factor cost increase

Pipelines are typically implemented modularly, with successively modular components (e.g., clustering, assignment, sampling, reduction). Choices such as number of levels, clusters per level, resampling passes, and sampling strategy are driven by empirical validation, with observed downstream gains from multi-level resample–cluster structures and random, hierarchical sampling.

6. Applications and Empirical Evidence

These balanced hierarchical clustering pipelines support a spectrum of data analysis and machine learning applications:

Self-supervised learning data curation: Balanced hierarchical sampling enables automatic dataset assembly for large-scale image, text, and remote sensing domains, outperforming uncurated and even expert-curated alternatives on long-tail and OOD evaluation metrics (Vo et al., 2024).
Graph mining: Hierarchical algorithms such as Paris reveal multi-scale community structure on real-world graphs (road networks, Wikipedia, flight networks), correctly identifying stable splits and rare modules (Bonald et al., 2018).
Fair analysis: Reductions guarantee fairness (across protected attributes or cluster sizes) for any input dendrogram, facilitating compliance and ethical analysis in modern ML pipelines (Knittel et al., 2022).
Distributed and massive-scale clustering: MPC-based matching methods enable balanced clustering on edge-weighted graphs with billions of nodes, under precise theoretical guarantees (Hajiaghayi et al., 2021).

A central result is that hierarchical, resample–cluster–balanced sampling frameworks are empirically and theoretically superior to ad hoc or shallow methods, especially in maintaining robustness, diversity, and fairness under distributional shift or scale (Vo et al., 2024, Hajiaghayi et al., 2021).

7. Adaptation Guidelines and Parameter Selection

Clustering levels and granularity: Increasing depth (number of cluster levels) improves downstream performance up to saturation (typically $a$ 0– $a$ 1) (Vo et al., 2024).
Cluster counts and resampling: Finer granularity (more clusters per level) improves long-tail capture, but with diminishing returns and increased cost.
Sampling within leaves: Random selection is preferable to picking centroid-closest or -farthest, yielding better coverage of both head and tail concepts.
Initialization: k-means++ is essential for cluster diversity; random initialization results in significant performance drops.
Fairness/balance parameters: For precise balance, set $a$ 2 (Knittel et al., 2022). For color fairness, adjust parameters $a$ 3 and choose subtree partition/fold batch sizes per data scale.
Downstream integration: Output cluster labels and sampled sets for subsequent tasks (SSL pretraining, visualization, stratified analysis). Modular implementation of clustering, resampling, and rebalancing subroutines is recommended.

These pipelines collectively provide a principled, efficient, and extensible toolkit for constructing balanced multi-scale data representations and subsets across diverse modern large-scale data scenarios.