Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition (1703.08338v2)

Published 24 Mar 2017 in cs.CV

Abstract: This work deviates from easy-to-define class boundaries for object interactions. For the task of object interaction recognition, often captured using an egocentric view, we show that semantic ambiguities in verbs and recognising sub-interactions along with concurrent interactions result in legitimate class overlaps (Figure 1). We thus aim to model the mapping between observations and interaction classes, as well as class overlaps, towards a probabilistic multi-label classifier that emulates human annotators. Given a video segment containing an object interaction, we model the probability for a verb, out of a list of possible verbs, to be used to annotate that interaction. The proba- bility is learnt from crowdsourced annotations, and is tested on two public datasets, comprising 1405 video sequences for which we provide annotations on 90 verbs. We outper- form conventional single-label classification by 11% and 6% on the two datasets respectively, and show that learning from annotation probabilities outperforms majority voting and enables discovery of co-occurring labels.

PDF Abstract

Overview of "Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition"

The paper "Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition" addresses the complexity inherent in object interaction recognition, particularly in scenarios captured via egocentric views. The authors critique the conventionally used single-label classification systems which are predicated on clearly defined class boundaries. They propose an innovative approach, characterized by a probabilistic multi-label classifier, which better accommodates the semantic ambiguities and class overlaps that naturally occur when annotating object interactions in video sequences.

Key Contributions and Methodology

The paper identifies three primary incorrect assumptions in traditional single-label classification systems: the assumption of semantically distinct classes, temporal granularity, and unified verb usage by annotators. To counter these assumptions, the authors introduce a probabilistic model that assigns a probability distribution over multiple potential labels for a given video interaction. This approach leverages crowdsourced annotation data to teach the model to predict the likelihood of each of 90 possible verbs being used to describe an interaction, thus closely mimicking human annotator behavior.

Contributions include:

Crowdsourced Annotations: The authors employed crowdsourcing to obtain a substantial amount of data, annotating 1,405 video sequences across two public datasets (CMU and GTEA+). This resulted in an informative semantic framework for the proposed multi-label recognition system.
Probabilistic Multi-Label Model: By reformulating interaction recognition as a probabilistic multi-label problem, the authors developed a more nuanced classifier using a two-stream convolutional neural network (CNN) architecture. This model improves upon the traditional one-vs-all classification by capturing inter-label relationships and concurrents interactions.
Significant Improvement in Accuracy: Testing against conventional single-label classifiers yielded notable results: the proposed method outperformed its single-label counterparts by 11% and 6% on the CMU and GTEA+ datasets, respectively. This demonstrates the efficacy of probabilistic multi-labeling in capturing the subtleties of human annotations.

Analysis and Implications

This research presents substantial theoretical and practical implications for the domain of action recognition. Theoretically, it challenges and extends the understanding of classification boundaries in complex object interactions, proposing a framework that inherently respects and models the variability in human language and perception. Practically, the introduction of a probabilistic approach paves the way for more robust and flexible AI systems capable of understanding complex human behaviors as seen through first-person cameras.

The work suggests further exploration into expansions of verb vocabularies used in an egocentric context, emphasizing the need to include a wider array of interactions and scenarios reflecting everyday life. Additionally, by leveraging semantic relationships between verbs, the approach could enhance action recognition systems, thereby potentially improving applications in assistive technologies, autonomous agents, and interactive systems.

Future Directions

Future research could focus on enhancing the localization of multiple labels in both spatial and temporal contexts within video sequences, which would be particularly beneficial in distinguishing temporally concurrent actions. Furthermore, investigating the integration of this probabilistic multi-label framework into broader automated systems could improve real-time action recognition applications, such as robotic perception and human-computer interaction systems.

In summary, the paper provides a compelling argument and solution to the challenges present in object interaction recognition through a well-founded methodological innovation. This foundation serves as a stepping stone for future exploration into the development of AI systems that accurately interpret and categorize the multifaceted nature of human activities.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Michael Wray (29 papers)
Davide Moltisanti (15 papers)
Walterio Mayol-Cuevas (27 papers)
Dima Damen (83 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos