OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning (2012.11552v2)

Published 21 Dec 2020 in cs.CV and cs.LG

Abstract: Learning image representations without human supervision is an important and active research field. Several recent approaches have successfully leveraged the idea of making such a representation invariant under different types of perturbations, especially via contrastive-based instance discrimination training. Although effective visual representations should indeed exhibit such invariances, there are other important characteristics, such as encoding contextual reasoning skills, for which alternative reconstruction-based approaches might be better suited. With this in mind, we propose a teacher-student scheme to learn representations by training a convolutional net to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image. Our strategy performs an online training of both the teacher network (whose role is to generate the BoW targets) and the student network (whose role is to learn representations), along with an online update of the visual-words vocabulary (used for the BoW targets). This idea effectively enables fully online BoW-guided unsupervised learning. Extensive experiments demonstrate the interest of our BoW-based strategy which surpasses previous state-of-the-art methods (including contrastive-based ones) in several applications. For instance, in downstream tasks such Pascal object detection, Pascal classification and Places205 classification, our method improves over all prior unsupervised approaches, thus establishing new state-of-the-art results that are also significantly better even than those of supervised pre-training. We provide the implementation code at https://github.com/valeoai/obow.

Authors (6)

Patrick Pérez (90 papers)
Spyros Gidaris (34 papers)
Andrei Bursuc (55 papers)
Gilles Puy (48 papers)
Nikos Komodakis (37 papers)
Matthieu Cord (129 papers)

Citations (68)

View on Semantic Scholar

Summary

OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning

The paper presents OBoW (Online Bag-of-Visual-Words), a novel methodology for self-supervised learning that leverages convolutional neural networks (CNNs) to learn image representations by reconstructing Bag-of-Visual-Words (BoW) representations from perturbed versions of images. The method elegantly combines reconstruction and representation learning techniques within a teacher-student framework, facilitating the generation of robust image embeddings without requiring annotated datasets.

Core Contributions

Fully Online Teacher-Student Scheme: The authors introduce an approach that continuously updates both the teacher and student networks. In contrast with prior methodologies requiring static teacher models or pre-trained networks, OBoW employs an online adaptation strategy that leverages exponentially moving averages to maintain a consistently improving teacher model.
Dynamic BoW Predication Module: The student network in OBoW dynamically predicts BoW through a generative network which adapts to online-updated vocabularies of visual words. This dynamic predictive component is crucial for achieving stability as it circumvents the issues arising from static vocabularies.
Augmented Contextual Learning: By utilizing multi-scale BoW targets and aggressive data perturbations including various cropping strategies, OBoW enriches the contextual learning of the student network. This methodology compels the student to learn richer and invariant features by forcing it to infer global image context from limited visible data.

Empirical Validation

OBoW exhibits superior performance compared to previous state-of-the-art self-supervised approaches in a variety of computer vision tasks. Extensive experiments are conducted on datasets such as Pascal VOC and Places205, where OBoW outperforms even supervised pre-trained models in certain scenarios, such as object detection and classification. This performance is quantitatively validated, offering improvements like a 0.7% increase in ImageNet top-1 accuracy over existing unsupervised techniques under comparable conditions.

Implications and Future Prospects

The implications of this work are considerable. By endorsing a framework that is not reliant on labeled data, OBoW paves the way for more feasible applications in large-scale, real-world settings where data annotation may be impractical or costly. The method's integration of BoW generation with online learning strategies could inspire similar approaches across other domains of artificial intelligence, potentially encouraging breakthroughs in contexts like video analysis or 3D data interpretation.

Moving forward, the primary areas for enhancement involve the scalability of OBoW towards more complex network architectures and more diverse datasets. Further exploration into leveraging the BoW reconstruction paradigm for other unsupervised learning paradigms, such as reinforcement learning or cross-modal learning, constitutes a promising trajectory.

In conclusion, the OBoW framework represents a significant methodological advancement in the domain of self-supervised learning. It demonstrates the potential of combining traditional BoW techniques with cutting-edge neural network training paradigms, opening avenues for future research to build upon these foundational insights.

PDF Markdown

Related Papers

GitHub

GitHub - valeoai/obow (96 stars)