Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition (1912.05534v1)

Published 11 Dec 2019 in cs.CV

Abstract: Human activities often occur in specific scene contexts, e.g., playing basketball on a basketball court. Training a model using existing video datasets thus inevitably captures and leverages such bias (instead of using the actual discriminative cues). The learned representation may not generalize well to new action classes or different tasks. In this paper, we propose to mitigate scene bias for video representation learning. Specifically, we augment the standard cross-entropy loss for action classification with 1) an adversarial loss for scene types and 2) a human mask confusion loss for videos where the human actors are masked out. These two losses encourage learning representations that are unable to predict the scene types and the correct actions when there is no evidence. We validate the effectiveness of our method by transferring our pre-trained model to three different tasks, including action classification, temporal localization, and spatio-temporal action detection. Our results show consistent improvement over the baseline model without debiasing.

Citations (164)

View on Semantic Scholar

Summary

The paper proposes two novel loss functions, scene adversarial loss and human mask confusion loss, to train action recognition models to learn scene-invariant features and rely on human actions.
Experimental results show that the debiasing method improves model generalization and performance on various action understanding tasks including classification, localization, and detection across multiple datasets.
Mitigating scene bias enhances model robustness and versatility with significant applications in areas like surveillance and sports analysis, suggesting future work could explore other biases and stronger network backbones.

Mitigating Scene Bias in Action Recognition

The paper, "Why Can't I Dance in the Mall?: Learning to Mitigate Scene Bias in Action Recognition," presents a compelling approach to address the issue of scene bias in video action recognition models. Traditionally, convolutional neural networks (CNNs) are trained on extensive video datasets, capturing correlations between actions and the contexts or scenes wherein they occur. This can lead to bias, where models tend to rely on scene cues rather than focusing on the specific human actions, thus hindering generalization to novel actions or different tasks beyond the training datasets.

Methodology

To combat this representation bias, the authors propose a debiasing method for video representation learning. Two distinct loss functions augment the traditional cross-entropy loss employed in action classification:

Scene Adversarial Loss: This loss function is intended to encourage the learning of scene-invariant features by making the model's ability to classify scene types adversarial. It utilizes a gradient reversal strategy during training, thereby ensuring that features learned are not easily classified into specific scenes.
Human Mask Confusion Loss: In this approach, videos are processed with human actors masked out. The model is trained to remain uncertain or confused about action classifications in the absence of visual evidence of human activity, enforcing reliance on actual action cues rather than scene context.

These methods aim to focus the model’s learning process on discernible actions rather than misleading scene-related cues that might dominate more traditional approaches.

Experimental Validation

The paper thoroughly evaluates the proposed method by examining its transferability to various action understanding tasks, including action classification, temporal action localization, and spatio-temporal action detection.

Action Classification: The model was pre-trained on Mini-Kinetics-200 and transferred to datasets such as UCF-101, HMDB-51, and Diving48. The debiasing method showed improved generalization abilities, with significant performance boosts in datasets with lower scene biases (e.g., Diving48), validating the effectiveness of bias mitigation.
Temporal Action Localization: In tests using the THUMOS-14 dataset, the debiased model demonstrated better mAP scores across different IoU thresholds compared to models without debiasing, indicating enhanced capability in identifying action boundaries accurately.
Spatio-Temporal Action Detection: Applying the debiasing technique improved action localization accuracies in untrimmed videos, as evidenced by results from the JHMDB dataset. This suggests broader applicability in complex, real-world scenarios where actions may be spread across varying temporal frames.

Implications and Future Work

The findings indicate that reducing scene bias enhances the versatility and robustness of action recognition models. This has significant applications in fields where precise action detection is crucial, such as video surveillance, sports analysis, and automated video editing.

Future research could expand upon this work by exploring debiasing in other dimensions of video representation, such as object and human biases. Additionally, integrating these techniques with more powerful backbone models like I3D or SlowFast networks could further elevate their performance thresholds.

This paper highlights an under-investigated aspect of action recognition, paving the way for more resilient models that do not generalize solely based on contextual similarities. The use of adversarial training and strategic masking provides a promising avenue for refining video scene analysis, ultimately contributing to more comprehensive AI systems capable of nuanced understanding and interaction with the real world.