Multi-Modal Domain Adaptation for Fine-Grained Action Recognition (2001.09691v2)

Published 27 Jan 2020 in cs.CV

Abstract: Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment. We test our approach on three kitchens from our large-scale dataset, EPIC-Kitchens, using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.

Authors (2)

Jonathan Munro (6 papers)
Dima Damen (83 papers)

Citations (184)

View on Semantic Scholar

Summary

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition: A Technical Overview

The research paper titled "Multi-Modal Domain Adaptation for Fine-Grained Action Recognition" presents an innovative approach to addressing the challenges posed by domain shift in fine-grained action recognition tasks. The paper introduces a novel unsupervised domain adaptation (UDA) strategy that leverages both adversarial training and multi-modal self-supervised alignment, highlighting its efficacy in improving recognition accuracy across unlabelled target domains.

Core Contributions

The primary contributions of this research are twofold. First, it introduces a multi-modal self-supervision mechanism as a domain adaptation technique, exploiting the correspondence between modalities such as RGB and Optical Flow. Secondly, it proposes a synergistic approach combining multi-modal self-supervision with adversarial domain alignment, where separate domain discriminators per modality facilitate feature extraction that is invariant to domain-specific characteristics.

Methodology

Problem Context

The paper addresses the problem of domain shift—a scenario where models trained in a source domain underperform when deployed in an unlabelled target domain. This is particularly prevalent in fine-grained action recognition datasets like EPIC-Kitchens, where environmental biases are significant.

Proposed Approach

The proposed Multi-Modal Self-Supervised Adversarial Domain Adaptation (MM-SADA) method implements the following components:

Adversarial Training: This involves learning domain-invariant features via domain discriminators applied to each modality. Gradient reversal layers negate the domain discriminator's gradients, allowing the model to reduce domain-specific information.
Multi-Modal Self-Supervision: The system learns modality correspondence without labels, effectively using information from both the source and target domains to align them. This is achieved by predicting whether sampled modalities correspond to the same action, which aids in aligning source and target domain distributions.

Implementation Details

The researchers utilized the I3D architecture for feature extraction, adapting it for RGB and Optical Flow modalities. The training strategy is designed to first optimize for classification and self-supervision before incorporating adversarial alignment, demonstrating robustness through hyper-parameter tuning and batch normalization tailored to target domain statistics.

Empirical Evaluation

The paper conducts extensive evaluations on EPIC-Kitchens, focusing on pairs of domains from different kitchen environments. The results indicate that MM-SADA outperforms traditional domain adaptation techniques such as batch normalization and maximum mean discrepancy in several domain pairs. Specifically, it achieves an average improvement of 4.8% over baseline source-only models and consistently aligns feature spaces between source and target domains.

The experiments further explore the contribution of each methodological component. Self-supervision alone is shown to be an effective alignment strategy, offering competitive performance against alternative methods. Additionally, separate evaluations of individual modalities reveal that RGB benefits significantly from the proposed multi-modal strategy, highlighting the robustness of MM-SADA in leveraging the strengths of multiple data streams.

Implications and Future Directions

The implications of this research extend to various AI applications involving video-based action recognition, particularly in scenarios where training and deployment environments differ. By providing a mechanism to enhance model performance across domains without requiring target domain labels, MM-SADA offers a scalable solution for real-world applications.

Looking forward, the authors suggest further investigations into additional modalities, such as audio, and alternative self-supervised tasks. The exploration of multi-task self-supervision for domain adaptation also presents a promising avenue for expanding the method's applicability.

In conclusion, the innovative integration of adversarial and self-supervised strategies within a multi-modal framework represents a significant advancement in domain adaptation for fine-grained action recognition. Through rigorous experimentation and analysis, this work sets a precedent for future research in building robust action recognition systems capable of generalizing across diverse environments.

PDF Markdown

Related Papers

YouTube

Show All Videos