The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning (2508.21816v1)

Published 29 Aug 2025 in cs.CV and cs.AI

Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.

Collections

Summary

The paper introduces a novel SPMLL framework that reframes verb classification as a multi-label problem to effectively handle inherent semantics overlaps.
It employs a Graph Enhanced Verb MLP with GCN and adversarial training, achieving over 3% improvement in mean average precision compared to single-label baselines.
Empirical evaluations using visual embeddings and t-SNE plots validate that the approach clusters semantically similar classes more effectively.

The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning

Introduction

This paper addresses a significant challenge in situation recognition (SR) within computer vision: the prevalent ambiguity in verb classification. Traditional SR frameworks classify events in images using single-label approaches that fail to capture the multi-faceted nature of visual events. This paper posits that verb classification in SR is inherently a multi-label task and introduces a novel approach to address this need by framing it as a Single Positive Multi-Label Learning (SPMLL) problem.

The Ambiguity Challenge in Verb Classification

Existing SR tasks divide into three main components: verb classification, semantic role labeling, and semantic role grounding. Verb classification acts as the keystone, influencing subsequent tasks. Current methods, however, simplify verb classification into a single-label task, ignoring the inherent overlap among verb categories (e.g., "crying" vs. "weeping") (Figure 1).

Figure 1: Illustration of Baseline model's Top-1 predictions on test set for random sampled 10 classes. Each bar shows the predicted label and its proportion, with the ground truth label on the x-axis. It can be seen that most wrong cases are semantically similar or even semantically overlapped with the ground truth class.

The paper provides empirical evidence through visual embeddings and t-SNE plots that demonstrate significant overlaps in class distributions. Given this ambiguity, a multi-label framework better suits verb classification tasks.

Reformulating with Single Positive Multi-label Learning

Given the prohibitive cost of fully annotating existing datasets with multiple labels, the authors advocate for an SPMLL approach. This perspective allows training robust multi-label models using only single positive labels per instance, thereby aligning training requirements with practical constraints.

A comprehensive multi-label evaluation benchmark was created using automated pre-annotation steps that leveraged LLMs, followed by manual verification to ensure high-quality annotations (Figure 2).

Figure 2: Three main steps for data collection: a mini-benchmark dataset to validate the problem's existence, pre-annotation using LLMs, and refinement through human verification.

Proposed Method: Graph Enhanced Verb MLP

The solution proposed is the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), integrating graph neural networks to encode label correlations and adversarial training to refine decision boundaries. A sparse semantic similarity matrix functions as a proxy for label correlation, guiding the model through a multi-layer GCN to enhance classification accuracy.

Figure 3: Usage of semantic relations to regularize relations between categories. The semantic similarity matrix is employed to construct a label correlation graph used in a multi-layer GCN.

Adversarial training with FGSM and PGD further mitigates ambiguities by smoothing decision boundaries, achieving a balance between robustness and classification accuracy. Experimental results demonstrated over 3% improvement in MAP compared to single-label baselines.

Experimental Evaluation

The paper benchmarks several existing models with the proposed GE-VerbMLP, showing considerable improvements in the multi-label field without sacrificing single-label performance. The qualitative analyses using embedding visualizations reveal improved clustering of semantically similar classes after incorporating GCN layers (Figure 4).

Figure 4: T-SNE visualization of different class center vectors. Semantically similar classes are closer after GCN layers, illustrating effective clustering.

Conclusion

The paper presents a valuable rethinking of verb classification in SR, addressing fundamental ambiguities through an SPMLL framing. By developing a robust multi-label dataset and leveraging advanced neural frameworks, the paper provides crucial insights and methodologies for SR improvements. The implications extend beyond merely SR to other complex classification tasks dealing with inherent label ambiguities. Future research can build on this foundation to refine verb classification techniques further and explore applications to related tasks in multimedia and beyond.