Decision-Ambiguous Samples in ML
- Decision-Ambiguous Samples are data instances with intrinsic labeling uncertainty due to feature overlap and contextual ambiguities.
- They are distinct from out-of-distribution and adversarial examples, requiring specialized methods like reject options and convex surrogate losses.
- Integrating DAS detection and calibration improves model reliability and safety across tasks such as classification, visual reasoning, and action recognition.
Decision-Ambiguous Samples (DAS) are data instances for which the ground-truth labeling is inherently uncertain due to feature-overlap, contextual ambiguity, or persistent class confusability—even under expert human assessment or with access to the underlying data-generating process. In supervised learning, DAS pose unique challenges: they are fundamentally distinct from unlabeled data, out-of-distribution (OOD) samples, and adversarial examples. The explicit handling, generation, detection, and utilization of DAS has emerged as a critical concern for the robustness, interpretability, and safety of machine learning models across domains, including classification, visual reasoning, and action recognition.
1. Formal Definitions and Distinction from Other Uncertainty Sources
The canonical formalization of Decision-Ambiguous Samples is as follows: let be an input and the set of possible classes. The ground-truth conditional label distribution is . A sample is decision-ambiguous if there exist such that and , i.e., multiple classes possess non-zero assignment probability according to the oracle labeling mechanism. This is equivalently characterized by positive label entropy,
signifying irreducible (aleatoric) uncertainty (Weiss et al., 2022). In practice, proxies such as model probability margins or classifier posterior closeness to equiprobability are utilized for operationalization (Gomes et al., 2024, Dong et al., 31 Dec 2025).
This distinguishes DAS from:
- OOD inputs: Samples with high epistemic uncertainty owing to distributional novelty; may be unambiguous in true labeling.
- Adversarial examples: Minimal perturbations with altered model predictions but unchanged ground-truth class distributions.
- Noisy or mislabeled data: Erroneously labeled samples, which lack persistent, label-intrinsic ambiguity.
DAS embody samples where even a hypothetical Bayes-optimal classifier exhibits irreducible uncertainty due to intrinsic class overlap.
2. DAS in Binary and Multiclass Classification: Frameworks and Loss Constructions
In binary classification, one approach models DAS via a three-way partition of the training set: positive (), negative (), and explicitly labeled ambiguous () samples (Otani et al., 2020). Rather than treating as a third class or ignoring it, the methodology extends binary classification with a reject option. The objective is to simultaneously learn:
- A classifier , yielding the label,
- A rejector , with signifying acceptance and indicating rejection (abstinence).
The losses employed are:
- 0-1- loss (binary, with reject cost ):
- 0-1-- loss (with ambiguous samples ):
A penalty is imposed whenever ambiguous points are “forced” to classes (Otani et al., 2020).
Computational tractability is achieved using a convex surrogate loss, such as the max-hinge-ambiguous (MHA) loss, ensuring Bayes-consistency and convexity. Optimization proceeds via kernel or finite-basis expansions and convex quadratic programming.
The explicit incorporation of DAS, via with non-trivial penalty , allows the model to learn rejection regions aligned with true ambiguous overlap, yielding improved test accuracy in scenarios where ambiguous zones coincide with class boundary regions.
3. Generation and Quantification of DAS in Representation Learning
The need to systematically generate, label, and quantify DAS for evaluation or stress-testing purposes has motivated generative methods such as AmbiGuess (Weiss et al., 2022), which employs a regularized adversarial autoencoder (rAAE). The process involves:
- Training an rAAE on a pair of classes, yielding Gaussian clusters in latent space,
- Defining a confined latent space (CLS) spanning the gap between clusters,
- Grid-partitioning CLS and sampling in cells with high ambiguity (as measured by minimal class discrimination and local decoder Jacobian norms),
- Decoding ambiguous latent points and assigning probabilistic labels based on discriminator outputs.
Optimal ambiguous latent points are sought by maximizing the minimum classification confidence across classes.
Metrics for DAS quality include top-1/top-2/top-pair accuracy (comparing model predictions to ground truth ambiguity), and predictive entropy. For instance, MNIST DAS achieve mean predictive entropy , with top-1 accuracy and top-2 , demonstrating significant ambiguity (Weiss et al., 2022).
4. Interpretability and Analysis of DAS at Model Decision Boundaries
DAS can be systematically mined from the “twilight zone” of model decision boundaries, characterized by confidence margins in binary classification. Generative frameworks such as GASTeN target the generation of low-margin synthetic samples, followed by filtering (e.g., ) (Gomes et al., 2024).
For interpretability and diagnostic analysis, clustering of DAS (after UMAP dimensionality reduction and GMM clustering) enables the selection of prototype ambiguous samples. Saliency analysis via GradientSHAP on these prototypes elucidates those features or input regions most responsible for the classifier’s indecision, thereby surfacing recurring structural or semantic patterns of ambiguity.
Empirical findings confirm that DAS clusters (as measured by silhouette and Davies–Bouldin indices) are visually coherent and reveal model-specific sources of uncertainty (e.g., missing strokes in hand-written digits leading to confusion). These insights inform responsible model deployment and active dataset refinement.
5. DAS Identification and Calibration in Complex, Structured Tasks
Beyond canonical classification, DAS have been formally integrated into complex tasks such as micro-action recognition and visual question answering.
In micro-action recognition, ambiguous samples are discovered hierarchically—first at the body level, then action level—by constructing explicit sets of false negatives (FN) and false positives (FP). Calibration is achieved by contrastive refinement: pulling FN DAS embeddings toward their correct action/body prototype and pushing FP DAS away. A “prototypical diversity amplification loss” is added to prevent prototype collapse and strengthen capacity (Li et al., 2024). The overall objective combines cross-entropy, hierarchical tree loss, contrastive calibration, and diversity regularization.
Experimental evidence demonstrates that the explicit identification and calibration of DAS (notably those arising in inter-category confusion zones) catalyzes substantial performance gains, particularly in previously hard-to-classify subpopulations.
In change-detection visual question answering, DAS are mined as samples where the SFT-trained reference model assigns nearly equal probabilities to the ground-truth and the leading distractor answer. Formally, for sample and strongest distractor ,
and is DAS iff (e.g., ) (Dong et al., 31 Dec 2025).
The DARFT approach (Decision-Ambiguity-guided Reinforcement Fine-Tuning) then prioritizes these samples in policy optimization, employing multi-sample decoding and intra-group relative advantages to push apart scores for correct answers and strong distractors, effectively sharpening the decision boundary. Results indicate pronounced gains (+6.05pp to +10.48pp in overall accuracy) in few-shot data regimes, especially in fine-grained categories with high initial ambiguity.
6. Supervisor Detection Capabilities and Implications for Safety
The ability of DNN supervision techniques (“meta-prediction” methods) to detect DAS varies with the uncertainty type considered. Softmax-based supervisors—e.g. maximum softmax response, entropy, DeepGini—perform strongly in identifying DAS, with AUC-ROC on AmbiGuess-generated DAS. However, their performance on adversarial or OOD samples is modest (). Conversely, feature-space anomaly detectors excel on OOD/adversarial faults, yet underperform (AUC-ROC $0.17$–$0.48$) on DAS (Weiss et al., 2022).
This complementarity demonstrates that ensembles spanning both aleatoric (DAS) and epistemic (OOD, adversarial) uncertainty are required. Inclusion of DAS is essential in supervisor benchmarks to avoid overestimation of safety and robustness, especially in high-stakes applications such as medical diagnosis or autonomous driving, where overconfident misclassification on truly ambiguous samples may have grave consequences.
7. Theoretical and Empirical Impact, Current Limitations, and Future Directions
The integration of explicit formalization, generation, and calibration of DAS has yielded theoretically justified and empirically validated improvements in classifier reliability, boundary sharpness, and interpretability across application domains. Key findings are presented in the table below:
| Method/Domain | DAS Operationalization | Empirical Highlight |
|---|---|---|
| Binary classification | Reject region via 0-1-c-d loss, expert-labeled A | Highest test accuracy in mixed-band regions (Otani et al., 2020) |
| Representation learning | rAAE-based AmbiGuess, entropy and margin metrics | MNIST DAS: Top-1 ≈ 0.53, Top-2 ≈ 0.98, ≈ 1.22 (Weiss et al., 2022) |
| Action recognition | Hierarchical FN/FP mining, prototype contrast | +11.8pp F1 on most ambiguous micro-actions (Li et al., 2024) |
| Visual QA (CDVQA) | Margin-mined DAS, relative advantage RL | +6.05pp to +10.48pp OA improvement (Dong et al., 31 Dec 2025) |
Open directions include the extension of DAS formalism beyond classification to structured prediction and regression, richer human-in-the-loop ambiguity acceptance models, and principled integration of DAS into active learning, model auditing, and safety-critical deployment pipelines (Weiss et al., 2022). No single supervisor suffices; model developers are advised to systematically generate, benchmark, and calibrate against high-quality DAS for end-to-end reliability.