- The paper presents a novel Temporal Interlacing Network (TIN) that dynamically learns temporal offsets, achieving a 1-2% improvement over TSM and contributing to top-ranking performance.
- The authors integrate 2D CNN-based models and 3D SlowFast architectures to capture both appearance and motion features effectively in trimmed video samples.
- Extensive experiments with multi-label loss functions and pre-training strategies underpin the method’s success, leading to a validation mAP of 67.22% and overall competition victory.
Top-1 Solution of Multi-Moments in Time Challenge 2019
The paper entitled "Top-1 Solution of Multi-Moments in Time Challenge 2019" details the efforts of the team 'Efficient' from SenseTime X-Lab and CUHK in achieving the leading position in the Multi-Moments in Time Challenge at ICCV 2019. The challenge consisted of recognizing multiple actions depicted in short, trimmed videos with a multimodal approach, leveraging a large dataset with over one million video samples.
Methodological Overview
The authors pursued a dual approach, employing both image-based and 3D-based architectures. The image-based models included TSN, TRN, TSM, and the proposed Temporal Interlacing Network (TIN), each leveraging 2D CNNs to capture temporal information. Although generally lighter than 3D methods, these models typically offer lower overall performance.
The 3D-based models focused on the SlowFast network and its variants. SlowFast architectures employ distinctive "slow" and "fast" pathways to capture appearance and motion, respectively. These were explored with various configurations that differ in computational complexity and input frame considerations.
Temporal Interlacing Network
The paper introduces a novel model, the Temporal Interlacing Network (TIN), aiming to enhance temporal information fusion by dynamically learning the shift distance for temporal dimension offsets. TIN constructs a differentiable module to determine optimal displacements, resulting in a 1-2% performance improvement over TSM when utilizing consistent training and testing configurations. This advancement underscored TIN's ability to balance model complexity and temporal recognition accuracy.
Performance and Results
The ensemble of all methods culminated in achieving a validation set mAP of 67.22% and a test set mAP of 60.77%, securing 1st place on the competition leaderboard. The results are notably driven by strategic ensembling of various models and the use of scale-sensitive multi-crop testing methodologies.
Technical Discussions
The paper discusses several factors that impacted model performance:
- Loss Functionality: Experiments with numerous multi-label classification loss functions indicated superior performance when upscaling BCE loss. Furthermore, class imbalance strategies were evaluated, although the original data yielded optimal results.
- Pre-training Considerations: While pre-training on Kinetics improved image-based methods, it surprisingly degraded the performance of SlowFast variants, likely due to dataset domain differences.
- Training Specifications: For image-based models, spatial and temporal sample augmentations were employed. With SlowFast models, a half-period cosine learning rate, batch normalization, and dropout strategies facilitated effective convergence.
Conclusion and Future Work
The authors present a comprehensive approach to action recognition in the Multi-Moments in Time dataset, innovating with the TIN model within a multiple strategy framework. Future directions suggest integrating additional modal inputs, such as flow and audio, to harness multimodal information further. Additionally, exploiting more advanced loss formulations tailored for multi-label classification problems could yield further performance enhancements. The release of a unified code repository underlines the commitment to community engagement and methodological transparency.