Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning (2101.07525v1)

Published 19 Jan 2021 in cs.LG and cs.CV

Abstract: In this paper, we present a novel approach, Momentum$^2$ Teacher, for student-teacher based self-supervised learning. The approach performs momentum update on both network weights and batch normalization (BN) statistics. The teacher's weight is a momentum update of the student, and the teacher's BN statistics is a momentum update of those in history. The Momentum$^2$ Teacher is simple and efficient. It can achieve the state of the art results (74.5\%) under ImageNet linear evaluation protocol using small-batch size(\eg, 128), without requiring large-batch training on special hardware like TPU or inefficient across GPU operation (\eg, shuffling BN, synced BN). Our implementation and pre-trained models will be given on GitHub\footnote{https://github.com/zengarden/momentum2-teacher}.

Citations (16)

View on Semantic Scholar

Summary

The paper presents a novel student-teacher framework that applies momentum updates to both network weights and batch normalization statistics.
It introduces Momentum BN to stabilize training, enabling small-batch training and achieving a top-1 accuracy of 74.5% on ImageNet.
The method enhances existing frameworks like MoCo and BYOL by offering a resource-efficient approach applicable to various deep learning architectures.

Analysis of Momentum $^2$ Teacher for Self-Supervised Learning

The paper "Momentum $^2$ Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning" by Zeming Li, Songtao Liu, and Jian Sun introduces a novel approach to enhance the student-teacher paradigm in self-supervised learning. The core advancement in this work is the application of momentum updates not only to network weights but also to batch normalization (BN) statistics, termed as Momentum $^2$ Teacher.

The student-teacher framework has been pivotal in enhancing self-supervised learning, especially in visual representation tasks. This framework involves two networks with identical architectures—one being the student, trained via gradient back-propagation, and the other the teacher, which is a momentum average of the student. The success of existing methods like MoCo and BYOL underscores the significance of this approach. However, they rely heavily on stable batch normalization, which typically necessitates large-batch sizes and specialized hardware like TPUs or inefficient cross-GPU operations.

Momentum $^2$ Teacher addresses this challenge by introducing momentum updates to BN statistics in the teacher network. This is achieved through a mechanism dubbed Momentum BN, which normalizes features using an exponential moving average of historical batch statistics rather than real-time mini-batch calculations. This technique allows for efficient training with smaller batch sizes (as low as 128) compared to figures upwards of 4096 required by some existing frameworks like BYOL. Notably, using the linear evaluation protocol on ImageNet, Momentum $^2$ Teacher achieves a top-1 accuracy of 74.5%, underscoring its efficacy.

Key Contributions

Efficiency with Small Batches: The Momentum $^2$ Teacher approach facilitates small-batch training, maintaining competitive performance without needing hardware such as TPUs or the computational expense of synchronized or shuffling BN. This makes the approach more accessible to a broader range of researchers.
Momentum BN: The introduction of Momentum BN, which conducts momentum updates on batch statistics, is pivotal. It stabilizes training without impacting the learning dynamics of the student-teacher framework. This advancement is not only theoretically interesting but also practically beneficial as it results in faster and more resource-efficient training.
Integration with Existing Frameworks: The proposed enhancement of BN statistics is widely applicable. The authors demonstrate improvements over existing methods such as MoCo and BYOL upon integrating Momentum BN, emphasizing the general applicability of their method.

Implications and Future Prospects

The implications of the findings are significant for the design of self-supervised learning systems, particularly in scenarios where computational resources are limited. This paper suggests a shift towards more efficient model training is both feasible and beneficial. The introduction of Momentum BN poses new directions for optimizing batch normalization in deep learning architectures, beyond self-supervised learning.

Looking ahead, several future developments can be anticipated. The adoption of adaptive momentum techniques in other components of the neural network training pipeline, as well as different neural architectures like Transformers, opens a fascinating avenue for exploration. Moreover, integrating these techniques with emerging hardware architectures could further enhance the scalability and efficiency of self-supervised frameworks.

In conclusion, by facilitating efficient and high-performance training with smaller batch sizes, Momentum $^2$ Teacher represents a strategic enhancement to the student-teacher paradigm, making robust self-supervised learning more accessible and practical. This work is an exemplary step towards improved algorithmic efficiency and wider applicability of self-supervised learning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - zengarden/momentum2-teacher: Implementation of momentum^2 teacher (120 stars)

Tweets

https://twitter.com/jbohnslav/status/1354866288250728448