Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dustbin Cluster in ML Pipelines

Updated 3 June 2026
  • Dustbin clusters are specialized constructs that automatically reject ambiguous or low-purity data points to enhance model clarity and generalization.
  • They employ multi-clustering voting and optimal transport techniques by using a dedicated 'dustbin' column to filter out non-informative features.
  • Empirical results in waste classification and visual place recognition show significant improvements in accuracy and recall when using dustbin clusters.

A dustbin cluster is a specialized construct in clustering-based and optimal-transport-based machine learning pipelines, enabling models to automatically detect, isolate, or discard data points or features that are ambiguous, outlier, or non-informative within the context of grouping or assignment. Dustbin clusters have been formalized in both unsupervised waste classification and visual place recognition as a means to improve labeling precision, enhance model generalization, and increase robustness to both domain shift and non-discriminative information (Huang et al., 4 Mar 2025, Izquierdo et al., 2023).

1. Formal Definitions and Conceptual Basis

The dustbin cluster serves as an automatic "rejection" or "outlier" bin for samples or features that do not fit confidently into any canonical category cluster. In unsupervised waste classification, such as in the DECMCV framework, dustbin clusters are populated by samples whose cluster assignments are ambiguous, exhibiting either label disagreement across clustering algorithms or low intra-cluster purity (e.g., purity < 80%) (Huang et al., 4 Mar 2025). In optimal-transport-based visual descriptor aggregation, the dustbin is modeled as an extra cluster into which features with low affinity for all canonical clusters are assigned, serving as a null-cluster that absorbs non-discriminative or uninformative local descriptors (e.g., sky patches, featureless backgrounds) (Izquierdo et al., 2023).

2. Mechanisms for Dustbin Cluster Assignment

2.1 Multi-Clustering Voting in Unsupervised Labeling

In DECMCV, each sample's embedding, produced via dual encoders (ConvNeXt and Vision Transformer), is input to three clustering algorithms (K-means, spectral, and agglomerative) with a fixed number of clusters (e.g., K=50). Cluster assignments per sample are aggregated by majority voting; if no cluster receives a decisive majority (i.e., no label assigned by at least two of three algorithms), the sample is assigned to a dustbin cluster. Subsequently, clusters are filtered by purity, with clusters failing to reach a threshold (e.g., ρ=0.8) having their constituent samples moved to the dustbin cluster. This guarantees that ambiguous or heterogeneous clusters are automatically rejected, mitigating erroneous or noisy labeling (Huang et al., 4 Mar 2025).

2.2 Optimal Transport and Dustbin Columns

In the SALAD method for visual place recognition, the (soft) assignment matrix is augmented with an extra column representing the dustbin cluster. For n local descriptors and m clusters, the score matrix SRn×mS \in \mathbb{R}^{n \times m} is extended to SˉRn×(m+1)\bar{S} \in \mathbb{R}^{n \times (m+1)} by appending a dustbin column with a trainable cost zz. The OT problem constrains the dustbin cluster's marginal sum to nmn-m, ensuring it can absorb exactly the excess mass of non-assigned features. The Sinkhorn algorithm solves for an entropy-regularized plan, and after convergence, the dustbin column absorbs features insufficiently matched to any canonical cluster. These features then do not contribute to final global descriptors, filtering uninformative information from aggregation (Izquierdo et al., 2023).

3. Quantitative Impact and Empirical Results

Dustbin clusters, by enabling robust rejection of ambiguous or low-quality assignments, substantially improve both clustering purity and downstream classification or retrieval metrics.

  • In DECMCV, application of dustbin clusters yielded classification accuracies of 93.78% (TrashNet), 98.29% (Huawei Cloud), and 97.25% (real-world conveyor dataset), consistently outperforming supervised baselines in both accuracy and cluster purity. The fraction of samples assigned to dustbin clusters was substantial: 37.0% on TrashNet, 27.1% on Huawei Cloud, illustrating the prevalence of ambiguous data in real-world datasets (Huang et al., 4 Mar 2025).
  • In SALAD, the introduction of a dustbin column improved Recall@1 from 91.4% to 92.2% and Recall@10 from 96.2% to 97.0% on the MSLS Val dataset. The dustbin was shown to be the most influential SALAD component for boosting recall (Izquierdo et al., 2023).
Method Dataset Discarded (%) Accuracy/Recall Metrics
DECMCV TrashNet 37.0 Accuracy: 93.78%, Purity: 88.2%
DECMCV Huawei Cloud 27.1 Accuracy: 98.29%, Purity: 96.5%
SALAD DINOv2 MSLS Val R@1: 92.2% (with dustbin)

4. Theoretical Properties and Implementation Details

For the DECMCV pipeline, the dustbin cluster is strictly an outcome of two mechanisms: (1) assignment by clustering-agreement majority vote; (2) filtering based on cluster purity. For the former, if a sample's cluster labels across algorithms do not agree, it is directly assigned to the dustbin cluster. For the latter, even after consensus, a cluster whose dominant category falls below a set purity threshold (typically 0.8) is fully demoted to the dustbin.

Within the OT paradigm in SALAD, the dustbin is a mathematically-defined null-cluster with its own marginal constraint and a per-feature assignment cost parameterized by zz. The Sinkhorn iterations are adapted by including the dustbin column and marginal, ensuring the total absorbed mass by the dustbin equals nmn-m. Dropout (0.3) is applied to the score-projection layer to prevent pathological overconfidence and retain dustbin activity (Izquierdo et al., 2023).

Pseudocode and detailed assignment strategies are explicit in (Huang et al., 4 Mar 2025, Izquierdo et al., 2023); reproduction requires only standard implementations of contrastive learning, clustering, and majority filtering or OT with dustbin support.

5. Practical Applications and Broader Significance

Dustbin clusters enable semi- and unsupervised pipelines to function with minimal manual labeling. In large-scale waste classification, this reduces labor cost by isolating only ambiguous samples for downstream review, while the majority of samples are automatically labeled with high accuracy. DECMCV, for example, required only 50 labeled samples to accurately label thousands in a real-world dataset, improving overall accuracy by 29.85% relative to supervised approaches (Huang et al., 4 Mar 2025).

In feature aggregation for image retrieval, dustbin clusters mitigate the influence of non-informative regions (such as sky or empty road in outdoor imagery) on global image descriptors. This selective filtering is critical to maximizing discriminative power for place recognition and retrieval recall (Izquierdo et al., 2023).

Dustbin cluster assignment diverges from conventional outlier detection in that it is woven into the clustering, voting, or OT-assignment process as a first-class assignment target. Unlike classical rejection or thresholding heuristics, dustbin clusters are systematically derived either by statistical agreement failure (in voting-based pipelines) or by explicit absorption in the OT marginal (in optimal-transport-based pipelines).

A plausible implication is that as clustering frameworks and deep metric learning advance, dynamically-learned dustbin clusters may be further integrated as a standard mechanism for dataset cleaning, relabeling, and robust aggregation. The broad applicability of dustbin strategies, from unsupervised labeling to global descriptor construction, signals their increasing utility in self-supervised and weakly supervised settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dustbin Cluster.