Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
21 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
230 tokens/sec
2000 character limit reached

Clustering-Based Active Learning Strategy

Updated 14 August 2025
  • Clustering-based active learning is an approach that uses clustering techniques like fast mode-seeking to identify valuable samples for efficient labeling in large datasets.
  • FMS-AL employs multi-scale, density-adaptive clustering to propagate labels from modal representatives to entire clusters, achieving lower error rates compared to traditional methods.
  • The strategy offers subquadratic time complexity and scalability for massive data but requires re-clustering for out-of-sample predictions due to its non-inductive nature.

A clustering-based active learning strategy is an approach to sample selection in active learning that leverages the discovered structure of data via clustering algorithms. Rather than solely relying on model uncertainty or informativeness metrics, these strategies use cluster assignments, modal representatives, or cluster-derived similarity information to identify which data points are most valuable for labeling. This paradigm encompasses a range of methodologies, from mode-seeking and centroid-based schemes to density, diversity, and fairness-aware systems. The following sections organize the landscape of clustering-based active learning, detailing key mechanisms, complexity, empirical performance, theoretical underpinnings, limitations, and practical implications, focusing especially on methodologies and results documented in (Duin et al., 2017).

1. Fast Mode-Seeking Clustering as a Basis for Active Learning

The fast kNN mode seeking (FMS) algorithm provides a computationally feasible alternative to standard mean-shift or kNN-based mode-seeking clustering, particularly for high-dimensional or large datasets (Duin et al., 2017). The procedure is characterized by:

  • Partitioning via Overlapping Cells: The dataset S\mathcal{S} is partitioned by first selecting a subset PP (of size mcnm\approx\sqrt{c n} for complexity parameter cc) and forming "P-cells" by nearest-neighbor assignments. These are expanded into "Q-cells" of overlapping neighborhoods by considering the cc nearest PP prototypes.
  • Local Density Estimation: For each data point xix_i, local densities fik=1/sikf_i^k=1/s_{ik} are estimated, where siks_{ik} is the distance to the kkth nearest neighbor within the candidate Q-cell.
  • Pointer Clustering and Modal Object Assignment: Each xix_i points to its densest local neighbor and, by following these pointers, arrives at a unique mode. The resulting mode forms the cluster representative or "modal object" for that group.

Compared to mean-shift—which relies on kernel bandwidth selection and iterative updates over the entire space—FMS only requires local computation within Q-cells. It delivers a multi-scale clustering hierarchy in a single pass by varying kk.

Key properties:

  • Time complexity O(n1.5)O(n^{1.5}), with empirical scaling approaching O(n1.4)O(n^{1.4}) for moderate to large nn.
  • Space complexity is O(n)O(n).
  • Scalability demonstrated up to 1.5×1061.5 \times 10^6 samples with computation times substantially lower than mean-shift or naive all-pairs strategies.

2. Clustering-Guided Label Propagation for Active Learning

In the FMS-based clustering active learning framework, cluster representatives—modal objects—are labeled by an oracle, and these labels are propagated to all other members of their respective clusters. This yields a form of "one-shot" label propagation, referred to as FMS-AL (Active Learning via Fast Mode-Seeking).

Several mechanisms further enhance this process:

  • Multi-Scale Propagation: By simultaneously generating clusterings at multiple kk values, one can propagate and combine class credibility/confidence across levels. The update rule

Q(i+1)=A(i+1)QiQ^{(i+1)} = A^{(i+1)} Q^{i}

recursively aggregates confidences, where A(i+1)A^{(i+1)} is a normalization matrix at scale i+1i+1.

  • Hierarchical Rejection: The system can reject low-confidence assignments, e.g., at uncertain boundaries where clusters at different resolutions disagree.
  • No Classifier Training Required: Classification is determined solely by the clustering structure—no classifier generalizing to the ambient space is learned.

This methodology is particularly well suited for applications where large datasets must be labeled efficiently and the acquisition of a model for out-of-sample extension is secondary.

3. Theoretical Complexity and Scaling

A distinguishing aspect of FMS is its subquadratic complexity. Unlike brute-force mode seeking or mean shift, which require O(n2)O(n^2) distance calculations, FMS partitions the distance computation into local neighborhoods of size approximately n\sqrt{n}, producing a global order of O(nn)O(n\sqrt{n}) operations.

  • For n=104n = 10^4 objects, clustering is computed in seconds;
  • For n=105n = 10^5, in minutes;
  • For n=106n = 10^6, in less than an hour.

Space usage remains linear in nn, as only neighborhood assignments, pointers, and densities are stored, not complete distance matrices.

4. Empirical Performance: Classification and Clustering Metrics

Performance assessment in (Duin et al., 2017) is conducted on MNIST, Block Letters, Cursive Letters, and a large-scale ALLR dataset.

Key metrics and findings:

  • Normalized Mutual Information (NMI): The NMI between clusters and true classes approaches 1 as the resolution increases (i.e., as cluster sizes shrink), demonstrating high cluster purity at sufficient granularity.
  • Learning curves: When plotting classification error versus the number of modal object labels, FMS-AL curves consistently achieve lower error rates than either 1NN classifiers or SVMs trained on the same-sized randomly sampled labeled sets.
  • Computation efficiency: The FMS-AL approach enables supervised labeling at a scale (over a million instances) not tractable with standard clustering or classifier training.

Formulas used:

  • Cluster-level confidence updates:

q(ij)(xt)=1C(ij)xtC(ij)q(xt),q_{(ij)}(x_t) = \frac{1}{|C_{(ij)}|} \sum_{x_t \in C_{(ij)}} q(x_t),

where q(ij)q_{(ij)} is the class confidence for cluster (i,j)(i,j).

  • NMI:

In(η,λ)=I(η,λ)min{H(η),H(λ)}I_n(\eta,\lambda) = \frac{I(\eta,\lambda)}{\min\{H(\eta),H(\lambda)\}}

to quantify agreement between label assignment η\eta and cluster assignment λ\lambda.

5. Limitations and Applicability

The primary limitation observed for clustering-based active learning methods such as FMS-AL is the absence of a generalizable classifier. Since only the modal objects are labeled, no explicit function is constructed to assign labels to novel out-of-sample points not present in the original clustering. The method is thus non-inductive and cannot generalize to unseen data without re-running clustering.

Additional limitations:

  • For very large kk or small cell sizes, FMS may slightly misestimate the number of clusters due to cell-boundary effects.
  • The approach provides no direct mechanism for handling dynamics or drift in evolving datasets.

A plausible implication is that for applications requiring inductive classifiers, hybrid strategies (e.g., combining FMS with subsequent classifier training on modal or cluster-member objects) may be required.

6. Real-World Applications and Comparative Strengths

FMS-based clustering active learning has been demonstrated on datasets with varying density and complexity, including:

  • Normalized MNIST (64 features): 1NN error of $0.020$.
  • Block Letters (82,541 objects): 43 classes.
  • Cursive Letters (213,623 objects): 42 classes.
  • ALLR (1,464,656 objects): Demonstrated scalability and approximated clustering accuracy of standard MS.

Notably, FMS-AL achieves classification error rates lower than randomly sampled 1NN and SVM baselines, and multi-scale combinations (MS-ALC/FMS-ALC) boost confidence where single-level assignments are ambiguous. The method’s unique capability to handle massive, high-dimensional datasets with modest computational resources makes it attractive for digit recognition, document classification, and any scenario requiring fast, large-scale labeling without model generalization demands.

7. Relation to Other Clustering-Based Active Learning Paradigms

Compared to cluster annotation approaches (e.g., cluster-based batch labeling with human inspection (Perez et al., 2018)) or batch-mode methods using K-means for informativeness/diversity (e.g., (Zhdanov, 2019)), mode-seeking clustering provides a deterministic, density-adaptive partitioning and propagates cluster-level labels without needing full batch annotation.

While centroid-based strategies may select samples closest to centroids or maximize diversity, mode-seeking approaches like FMS-AL intrinsically adapt to local data present in high-density regions and are capable of revealing multi-scale structure. This provides a foundational complement to broader clustering-based active learning literature, especially when computational efficiency and scalability are principal constraints.


In summary, clustering-based active learning strategies—exemplified by the FMS algorithm—yield computationally and label-efficient labeling schemes by exploiting the structure and density of high-dimensional data. They enable scalable annotation in large datasets and can significantly outperform classifier-trained baselines in transductive (in-sample) settings, though their non-inductive nature limits out-of-sample applicability. The integration of hierarchical, multi-scale clustering with active expert annotation provides a potent framework for efficient label acquisition in high-volume semi-supervised scenarios (Duin et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)