On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location (2003.07064v2)

Published 16 Mar 2020 in cs.CV, cs.LG, and eess.IV

Abstract: In this paper we challenge the common assumption that convolutional layers in modern CNNs are translation invariant. We show that CNNs can and will exploit the absolute spatial location by learning filters that respond exclusively to particular absolute locations by exploiting image boundary effects. Because modern CNNs filters have a huge receptive field, these boundary effects operate even far from the image boundary, allowing the network to exploit absolute spatial location all over the image. We give a simple solution to remove spatial location encoding which improves translation invariance and thus gives a stronger visual inductive bias which particularly benefits small data sets. We broadly demonstrate these benefits on several architectures and various applications such as image classification, patch matching, and two video classification datasets.

Citations (222)

View on Semantic Scholar

Summary

The paper reveals that standard CNN architectures exploit image boundary effects, compromising true translation invariance.
It employs empirical evaluations with different padding strategies to demonstrate how spatial encoding affects patch classification.
The study indicates that enforcing translation invariance improves model generalization and robustness, particularly in small-data scenarios.

Overview of Translation Invariance in CNNs

The research presented in the paper "On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location" by Osman Semih Kayhan and Jan C. van Gemert offers a critical examination of translation invariance in convolutional neural networks (CNNs), a foundational attribute often assumed in the design and deployment of such models. It contends that convolutional layers in popular CNN architectures may exploit absolute spatial locations by leveraging image boundary effects, thus challenging the prevalent notion that these models are inherently translation invariant.

Core Insights

The authors argue that the potential for translation invariance, a property esteemed for reducing the number of learnable parameters thereby introducing a strong inductive bias, is eroded due to boundary effects in CNNs. These effects arise because images have finite support, requiring padding to handle edge cases when performing convolution operations. The predominant boundary handling techniques, such as zero padding, inadvertently allow CNNs to encode absolute spatial information, thereby exploiting specific image locations.

Through empirical evidence, the paper demonstrates that even simple CNN architectures can perfectly classify images based on the absolute position of objects—a task traditionally believed to be unachievable by translation-invariant models. This is achieved through image boundary effects, particularly pronounced in networks with large receptive fields, enabling convolutions to influence areas well beyond the immediate vicinity of the image border.

Experiments and Methodology

The methodology is robust, incorporating multiple traditional CNN architectures including ResNet and DenseNet variants. The researchers evaluated several convolution types, such as valid convolution (V-Conv), same convolution with zero padding (S-Conv), and full convolution with zero padding (F-Conv), highlighting how different padding strategies affect the encoding of spatial locations.

Key experiments focus on how CNNs can classify patches based solely on their location within an image and how much these classifications are affected by different border handling strategies. The paper also explores improvements in translation invariance by opting for padding schemes like circular padding and demonstrates a marked increase in data efficiency for smaller datasets when spatial location encoding is removed.

Results and Implications

The results reveal substantial nuances in the translation invariance of CNNs:

CNNs can exploit image boundary effects significantly, allowing them to memorize specific spatial locations.
Translation invariance can be enforced more effectively with full convolutions, leading to improved generalization and increased accuracy, particularly in small-data scenarios.
The absence of location-specific encoding fosters resilience to minor image perturbations, bolstering the model's robustness to shifts and transformations not encapsulated during training.

Theoretical and Practical Implications

Theoretically, this research introduces a requirement to reconsider one of the fundamental properties attributed to CNNs. It urges a re-evaluation of the assumptions underlying CNN design, particularly in applications requiring high data efficiency and generalization robustness. Practically, this insight unlocks potential optimizations for training models in settings constrained by data. Emphasizing translation invariance enhances the CNN's capacity to learn meaningful hierarchical representations, rather than shallow, location-dependent features.

Future Directions

Speculation on future advancements suggests a trajectory focused on refining convolutional architectures to truly embody the properties of translation equivariance. Researchers are likely to develop novel padding techniques or convolution operations that mitigate boundary effects without introducing semantic distortion as with circular padding, setting the stage for more robust and scalable CNN models in diverse application domains. Further exploration into the relationship between large receptive fields and absolute location encoding could yield optimized architectures that maintain high performance without succumbing to absolute spatial dependencies.

PDF Markdown

Related Papers

GitHub

GitHub - oskyhn/CNNs-Without-Borders: Official repository of CVPR 2020 paper "On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location" (69 stars)

Tweets

https://twitter.com/jan_gemert/status/1767892485001338888