Top-1 Solution of Multi-Moments in Time Challenge 2019 (2003.05837v2)

Published 12 Mar 2020 in cs.CV

Abstract: In this technical report, we briefly introduce the solutions of our team 'Efficient' for the Multi-Moments in Time challenge in ICCV 2019. We first conduct several experiments with popular Image-Based action recognition methods TRN, TSN, and TSM. Then a novel temporal interlacing network is proposed towards fast and accurate recognition. Besides, the SlowFast network and its variants are explored. Finally, we ensemble all the above models and achieve 67.22\% on the validation set and 60.77\% on the test set, which ranks 1st on the final leaderboard. In addition, we release a new code repository for video understanding which unifies state-of-the-art 2D and 3D methods based on PyTorch. The solution of the challenge is also included in the repository, which is available at https://github.com/Sense-X/X-Temporal.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel Temporal Interlacing Network (TIN) that dynamically learns temporal offsets, achieving a 1-2% improvement over TSM and contributing to top-ranking performance.
The authors integrate 2D CNN-based models and 3D SlowFast architectures to capture both appearance and motion features effectively in trimmed video samples.
Extensive experiments with multi-label loss functions and pre-training strategies underpin the method’s success, leading to a validation mAP of 67.22% and overall competition victory.

Top-1 Solution of Multi-Moments in Time Challenge 2019

The paper entitled "Top-1 Solution of Multi-Moments in Time Challenge 2019" details the efforts of the team 'Efficient' from SenseTime X-Lab and CUHK in achieving the leading position in the Multi-Moments in Time Challenge at ICCV 2019. The challenge consisted of recognizing multiple actions depicted in short, trimmed videos with a multimodal approach, leveraging a large dataset with over one million video samples.

Methodological Overview

The authors pursued a dual approach, employing both image-based and 3D-based architectures. The image-based models included TSN, TRN, TSM, and the proposed Temporal Interlacing Network (TIN), each leveraging 2D CNNs to capture temporal information. Although generally lighter than 3D methods, these models typically offer lower overall performance.

The 3D-based models focused on the SlowFast network and its variants. SlowFast architectures employ distinctive "slow" and "fast" pathways to capture appearance and motion, respectively. These were explored with various configurations that differ in computational complexity and input frame considerations.

Temporal Interlacing Network

The paper introduces a novel model, the Temporal Interlacing Network (TIN), aiming to enhance temporal information fusion by dynamically learning the shift distance for temporal dimension offsets. TIN constructs a differentiable module to determine optimal displacements, resulting in a 1-2% performance improvement over TSM when utilizing consistent training and testing configurations. This advancement underscored TIN's ability to balance model complexity and temporal recognition accuracy.

Performance and Results

The ensemble of all methods culminated in achieving a validation set mAP of 67.22% and a test set mAP of 60.77%, securing 1st place on the competition leaderboard. The results are notably driven by strategic ensembling of various models and the use of scale-sensitive multi-crop testing methodologies.

Technical Discussions

The paper discusses several factors that impacted model performance:

Loss Functionality: Experiments with numerous multi-label classification loss functions indicated superior performance when upscaling BCE loss. Furthermore, class imbalance strategies were evaluated, although the original data yielded optimal results.
Pre-training Considerations: While pre-training on Kinetics improved image-based methods, it surprisingly degraded the performance of SlowFast variants, likely due to dataset domain differences.
Training Specifications: For image-based models, spatial and temporal sample augmentations were employed. With SlowFast models, a half-period cosine learning rate, batch normalization, and dropout strategies facilitated effective convergence.

Conclusion and Future Work

The authors present a comprehensive approach to action recognition in the Multi-Moments in Time dataset, innovating with the TIN model within a multiple strategy framework. Future directions suggest integrating additional modal inputs, such as flow and audio, to harness multimodal information further. Additionally, exploiting more advanced loss formulations tailored for multi-label classification problems could yield further performance enhancements. The release of a unified code repository underlines the commitment to community engagement and methodological transparency.

PDF Markdown

Related Papers

GitHub

GitHub - Sense-X/X-Temporal: A general video understanding codebase from SenseTime X-Lab (444 stars)