Adversarial Masking for Self-Supervised Learning (2201.13100v3)

Published 31 Jan 2022 in cs.CV and cs.LG

Abstract: We propose ADIOS, a masked image model (MIM) framework for self-supervised learning, which simultaneously learns a masking function and an image encoder using an adversarial objective. The image encoder is trained to minimise the distance between representations of the original and that of a masked image. The masking function, conversely, aims at maximising this distance. ADIOS consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets -- including classification on ImageNet100 and STL10, transfer learning on CIFAR10/100, Flowers102 and iNaturalist, as well as robustness evaluated on the backgrounds challenge (Xiao et al., 2021) -- while generating semantically meaningful masks. Unlike modern MIM models such as MAE, BEiT and iBOT, ADIOS does not rely on the image-patch tokenisation construction of Vision Transformers, and can be implemented with convolutional backbones. We further demonstrate that the masks learned by ADIOS are more effective in improving representation learning of SSL methods than masking schemes used in popular MIM models. Code is available at https://github.com/YugeTen/adios.

View on arXiv

Authors (4)

Yuge Shi (11 papers)
N. Siddharth (38 papers)
Philip H. S. Torr (219 papers)
Adam R. Kosiorek (15 papers)

Citations (71)

View on Semantic Scholar

Summary

Adversarial Masking for Self-Supervised Learning

The paper introduces a novel framework named ADIOS (Adversarial Inference-Occlusion Self-supervision) for improving self-supervised learning (SSL) through masked image modeling (MIM). Unlike traditional MIM methodologies, ADIOS employs an adversarial learning strategy to concurrently optimize both an image encoder and a masking function. The encoder is trained to maintain a consistent representation between original and masked images, while the masking function is optimized to maximize the difference between these representations, thereby challenging the encoder's robustness in feature extraction.

ADIOS presents significant advancements over existing self-supervised models such as MAE, BEiT, and iBOT. A major differentiating factor is its ability to utilize convolutional neural network (CNN) architectures instead of relying solely on Vision Transformers (ViT), broadening its applicability across different neural network designs. This paper’s empirical evaluation exhibits improvement in various tasks, including image classification, transfer learning, and robustness against spurious correlation changes.

Key Numerical Results

The authors demonstrate marked improvements in representation learning of the ADIOS framework, surpassing several state-of-the-art SSL methods across multiple datasets, including ImageNet100, STL10, and CLEVR. For the ImageNet100-S dataset, BYOL integrated with ADIOS shows a notable linear evaluation accuracy improvement of 5.1%, achieving 61.4%, while on STL10, SimSiam paired with ADIOS accomplishes an accuracy of 86.4%. Transfer learning experiments further highlight ADIOS’s efficacy, displaying superior classification accuracy on downstream tasks such as CIFAR10, CIFAR100, Flowers102, and iNaturalist. Robustness tests using the backgrounds challenge dataset reveal improved stability against variations in image backgrounds, indicating less dependency on potentially misleading context information.

Implications and Future Directions

The implications of ADIOS are substantial both theoretically and practically. By introducing a learnable, adversarial masking scheme that enhances representation learning through semantic occlusion, ADIOS underscores that the content of occlusion matters significantly. This perspective challenges prior models that often employ random masking, revealing a potential avenue for further research into more strategic or semantic-based augmentation techniques for both self-supervised and supervised learning contexts.

Future research could extend ADIOS by exploring its scalability with larger datasets and more complex architectures. One potential direction is optimizing computational efficiency by increasing the number of masking slots without inflating resource demands, as suggested with the lightweight variant, ADIOS-s. Another area ripe for exploration is improving mask generation sophistication to achieve finer granularity and potentially bridge the gap towards more effective semi-supervised or unsupervised scene understanding frameworks.

In conclusion, ADIOS offers a compelling re-evaluation of MIM strategies in self-supervised learning, presenting significant improvements while broadening the applicability of such methods across neural architectures. Its adversarial approach to learning occlusions demonstrates a powerful means to advance the robustness and utility of learned image representations, warranting additional research and development for expanded functionality and performance in diverse applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - YugeTen/adios (61 stars)