Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 183 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Clustering-Based Active Learning Strategy

Updated 14 August 2025

Clustering-based active learning is an approach that uses clustering techniques like fast mode-seeking to identify valuable samples for efficient labeling in large datasets.
FMS-AL employs multi-scale, density-adaptive clustering to propagate labels from modal representatives to entire clusters, achieving lower error rates compared to traditional methods.
The strategy offers subquadratic time complexity and scalability for massive data but requires re-clustering for out-of-sample predictions due to its non-inductive nature.

A clustering-based active learning strategy is an approach to sample selection in active learning that leverages the discovered structure of data via clustering algorithms. Rather than solely relying on model uncertainty or informativeness metrics, these strategies use cluster assignments, modal representatives, or cluster-derived similarity information to identify which data points are most valuable for labeling. This paradigm encompasses a range of methodologies, from mode-seeking and centroid-based schemes to density, diversity, and fairness-aware systems. The following sections organize the landscape of clustering-based active learning, detailing key mechanisms, complexity, empirical performance, theoretical underpinnings, limitations, and practical implications, focusing especially on methodologies and results documented in (Duin et al., 2017).

1. Fast Mode-Seeking Clustering as a Basis for Active Learning

The fast kNN mode seeking (FMS) algorithm provides a computationally feasible alternative to standard mean-shift or kNN-based mode-seeking clustering, particularly for high-dimensional or large datasets (Duin et al., 2017). The procedure is characterized by:

Partitioning via Overlapping Cells: The dataset $\mathcal{S}$ is partitioned by first selecting a subset $P$ (of size $m\approx\sqrt{c n}$ for complexity parameter $c$ ) and forming "P-cells" by nearest-neighbor assignments. These are expanded into "Q-cells" of overlapping neighborhoods by considering the $c$ nearest $P$ prototypes.
Local Density Estimation: For each data point $x_i$ , local densities $f_i^k=1/s_{ik}$ are estimated, where $s_{ik}$ is the distance to the $k$ th nearest neighbor within the candidate Q-cell.
Pointer Clustering and Modal Object Assignment: Each $x_i$ points to its densest local neighbor and, by following these pointers, arrives at a unique mode. The resulting mode forms the cluster representative or "modal object" for that group.

Compared to mean-shift—which relies on kernel bandwidth selection and iterative updates over the entire space—FMS only requires local computation within Q-cells. It delivers a multi-scale clustering hierarchy in a single pass by varying $k$ .

Key properties:

Time complexity $O(n^{1.5})$ , with empirical scaling approaching $O(n^{1.4})$ for moderate to large $n$ .
Space complexity is $O(n)$ .
Scalability demonstrated up to $1.5 \times 10^6$ samples with computation times substantially lower than mean-shift or naive all-pairs strategies.

2. Clustering-Guided Label Propagation for Active Learning

In the FMS-based clustering active learning framework, cluster representatives—modal objects—are labeled by an oracle, and these labels are propagated to all other members of their respective clusters. This yields a form of "one-shot" label propagation, referred to as FMS-AL (Active Learning via Fast Mode-Seeking).

Several mechanisms further enhance this process:

Multi-Scale Propagation: By simultaneously generating clusterings at multiple $k$ values, one can propagate and combine class credibility/confidence across levels. The update rule

$Q^{(i+1)} = A^{(i+1)} Q^{i}$

recursively aggregates confidences, where $A^{(i+1)}$ is a normalization matrix at scale $i+1$ .

Hierarchical Rejection: The system can reject low-confidence assignments, e.g., at uncertain boundaries where clusters at different resolutions disagree.
No Classifier Training Required: Classification is determined solely by the clustering structure—no classifier generalizing to the ambient space is learned.

This methodology is particularly well suited for applications where large datasets must be labeled efficiently and the acquisition of a model for out-of-sample extension is secondary.

3. Theoretical Complexity and Scaling

A distinguishing aspect of FMS is its subquadratic complexity. Unlike brute-force mode seeking or mean shift, which require $O(n^2)$ distance calculations, FMS partitions the distance computation into local neighborhoods of size approximately $\sqrt{n}$ , producing a global order of $O(n\sqrt{n})$ operations.

For $n = 10^4$ objects, clustering is computed in seconds;
For $n = 10^5$ , in minutes;
For $n = 10^6$ , in less than an hour.

Space usage remains linear in $n$ , as only neighborhood assignments, pointers, and densities are stored, not complete distance matrices.

4. Empirical Performance: Classification and Clustering Metrics

Performance assessment in (Duin et al., 2017) is conducted on MNIST, Block Letters, Cursive Letters, and a large-scale ALLR dataset.

Key metrics and findings:

Normalized Mutual Information (NMI): The NMI between clusters and true classes approaches 1 as the resolution increases (i.e., as cluster sizes shrink), demonstrating high cluster purity at sufficient granularity.
Learning curves: When plotting classification error versus the number of modal object labels, FMS-AL curves consistently achieve lower error rates than either 1NN classifiers or SVMs trained on the same-sized randomly sampled labeled sets.
Computation efficiency: The FMS-AL approach enables supervised labeling at a scale (over a million instances) not tractable with standard clustering or classifier training.

Formulas used:

Cluster-level confidence updates:

$q_{(ij)}(x_t) = \frac{1}{|C_{(ij)}|} \sum_{x_t \in C_{(ij)}} q(x_t),$

where $q_{(ij)}$ is the class confidence for cluster $(i,j)$ .

NMI:

$I_n(\eta,\lambda) = \frac{I(\eta,\lambda)}{\min\{H(\eta),H(\lambda)\}}$

to quantify agreement between label assignment $\eta$ and cluster assignment $\lambda$ .

5. Limitations and Applicability

The primary limitation observed for clustering-based active learning methods such as FMS-AL is the absence of a generalizable classifier. Since only the modal objects are labeled, no explicit function is constructed to assign labels to novel out-of-sample points not present in the original clustering. The method is thus non-inductive and cannot generalize to unseen data without re-running clustering.

Additional limitations:

For very large $k$ or small cell sizes, FMS may slightly misestimate the number of clusters due to cell-boundary effects.
The approach provides no direct mechanism for handling dynamics or drift in evolving datasets.

A plausible implication is that for applications requiring inductive classifiers, hybrid strategies (e.g., combining FMS with subsequent classifier training on modal or cluster-member objects) may be required.

6. Real-World Applications and Comparative Strengths

FMS-based clustering active learning has been demonstrated on datasets with varying density and complexity, including:

Normalized MNIST (64 features): 1NN error of $0.020$.
Block Letters (82,541 objects): 43 classes.
Cursive Letters (213,623 objects): 42 classes.
ALLR (1,464,656 objects): Demonstrated scalability and approximated clustering accuracy of standard MS.

Notably, FMS-AL achieves classification error rates lower than randomly sampled 1NN and SVM baselines, and multi-scale combinations (MS-ALC/FMS-ALC) boost confidence where single-level assignments are ambiguous. The method’s unique capability to handle massive, high-dimensional datasets with modest computational resources makes it attractive for digit recognition, document classification, and any scenario requiring fast, large-scale labeling without model generalization demands.

7. Relation to Other Clustering-Based Active Learning Paradigms

Compared to cluster annotation approaches (e.g., cluster-based batch labeling with human inspection (Perez et al., 2018)) or batch-mode methods using K-means for informativeness/diversity (e.g., (Zhdanov, 2019)), mode-seeking clustering provides a deterministic, density-adaptive partitioning and propagates cluster-level labels without needing full batch annotation.

While centroid-based strategies may select samples closest to centroids or maximize diversity, mode-seeking approaches like FMS-AL intrinsically adapt to local data present in high-density regions and are capable of revealing multi-scale structure. This provides a foundational complement to broader clustering-based active learning literature, especially when computational efficiency and scalability are principal constraints.

In summary, clustering-based active learning strategies—exemplified by the FMS algorithm—yield computationally and label-efficient labeling schemes by exploiting the structure and density of high-dimensional data. They enable scalable annotation in large datasets and can significantly outperform classifier-trained baselines in transductive (in-sample) settings, though their non-inductive nature limits out-of-sample applicability. The integration of hierarchical, multi-scale clustering with active expert annotation provides a potent framework for efficient label acquisition in high-volume semi-supervised scenarios (Duin et al., 2017).

PDF Markdown Chat (Pro)

References (3)

Fast kNN mode seeking clustering applied to active learning (2017)

Weakly Supervised Active Learning with Cluster Annotation (2018)

Diverse mini-batch Active Learning (2019)

Follow Topic

Get notified by email when new papers are published related to Clustering-Based Active Learning Strategy.