Robust and Generalizable Visual Representation Learning via Random Convolutions (2007.13003v3)

Published 25 Jul 2020 in cs.CV and cs.LG

Abstract: While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. In this work, we show that the robustness of neural networks can be greatly improved through the use of random convolutions as data augmentation. Random convolutions are approximately shape-preserving and may distort local textures. Intuitively, randomized convolutions create an infinite number of new domains with similar global shapes but random local textures. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. In particular, in the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation.

Citations (165)

View on Semantic Scholar

Summary

The paper proposes Random Convolutions (RandConv), a data augmentation method that improves visual representation learning by disrupting local textures while preserving global shape.
RandConv uses randomized filters, image mixing, and a consistency loss to train models that are robust to texture variations and domain shifts.
Experiments show RandConv significantly improves performance on domain generalization benchmarks like PACS and ImageNet-Sketch compared to existing methods.

Robust and Generalizable Visual Representation Learning via Random Convolutions

This paper investigates an innovative approach to improve the robustness and generalizability of deep neural networks (DNNs) in visual representation learning, particularly addressing the ongoing challenge of domain shifts and perturbations that affect model performance. The authors propose a data augmentation technique using random convolutions—termed as 'RandConv'—which aims to disrupt local textures while preserving global shape information within images.

Problem Addressed

The recognition of objects in images traditionally depends on texture and superficial features—characteristics that current DNNs often exploit. However, these models typically falter under domain shifts, such as transitioning from synthetic data to real-world images or from natural photographs to sketches, due to their limited generalization capability. This challenge is amplified as models can become sensitive to adversarial changes in texture, making them vulnerable to adversarial attacks. In contrast, humans rely more on global shape information rather than texture, which provides a motivation for the proposed RandConv methodology.

Methodology: Random Convolutions for Robustness

RandConv leverages the concept of randomizing local textures while preserving the essential shapes in images via a convolution layer with non-fixed random weights. The randomness in convolution acts to generate a multitude of new domains through altering textures while maintaining similar structural features:

Randomized Convolution: Filters are sampled randomly, creating shape-preserving yet texture-disrupting convolutions. These convolutions use multi-scale random filters, providing transformations that are scalable.
Mixing Strategy: The authors present a mixing approach where random convoluted images are blended with original images, offering a gradated augmentation that retains some of the original texture characteristics while preparing the model for exposure to drastically different textures.
Consistency Loss: A consistency loss is employed to enforce invariant predictions across different transformations of the same image, promoting generalization under texture variations.

Experimentation and Results

The methodology is validated through experiments on multiple datasets including digit recognition tasks, PACS—a benchmark for domain generalization across differing styles like photos, art, cartoons, and sketches—and generalization tests from ImageNet to ImageNet-Sketch:

Digits Recognition: RandConv demonstrates superior performance in a digits recognition benchmark involving different datasets. It outperforms existing state-of-the-art single domain generalization methods by a notable margin.
PACS Benchmark: Strong improvements are observed particularly for the challenging Sketch domain within PACS, demonstrating RandConv’s robustness against texture shifts.
ImageNet to ImageNet-Sketch: Significant advancements in performance over baselines are noted, reflecting RandConv's effectiveness in combating severe texture and style shifts.

Implications and Future Directions

RandConv provides a promising avenue for crafting visual representations that are robust to texture variations without access to multiple domain sources during training. This methodology could be pivotal in the adaptation of models across diverse and unforeseen operating conditions—one potential direction is its application in scenarios like autonomous vehicles or robot vision systems where environments are drastically varied.

The paper also poses broader implications for transfer learning practices in computer vision, especially in pretraining models with enhanced robustness and transferring those benefits to downstream tasks. The notion that shape-biased pretrained models may aid in performance on semantic tasks opens substantial avenues for further exploration in disentangling and harnessing shape and texture information within neural frameworks.

Additionally, future research could explore further integration of RandConv with adversarial training techniques to enhance both robustness to domain shifts and to adversarial perturbations, thus providing dual utility in advancing model performance under real-world complex data scenarios.