Non-local Neural Networks (1711.07971v3)

Published 21 Nov 2017 in cs.CV

Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .

Authors (4)

Xiaolong Wang (243 papers)
Ross Girshick (75 papers)
Abhinav Gupta (178 papers)
Kaiming He (71 papers)

Citations (8,420)

View on Semantic Scholar

Summary

The paper introduces non-local operations to capture long-range dependencies directly, replacing repeated local computations in traditional neural networks.
Experiments on video classification and image recognition show that adding non-local blocks boosts accuracy and computational efficiency on benchmarks like Kinetics and COCO.
Comparative evaluations reveal that non-local networks complement existing models, paving the way for future research in integrating global context into deep learning architectures.

Non-local Neural Networks

Introduction

In the paper titled "Non-local Neural Networks," authored by Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He, a new family of building blocks for neural networks is introduced to address the limitations of convolutional and recurrent operations in capturing long-range dependencies. These building blocks, referred to as non-local operations, are inspired by the classical non-local means method in computer vision. The key idea is to compute the response at a given position as a weighted sum of the features at all positions, thus enabling the efficient modeling of long-range dependencies across space, time, or spacetime.

Non-local Operations

The paper proposes a general formulation for non-local operations in deep neural networks, defining them through a pairwise function f and a unary function g. Non-local operations compute responses at a given position by considering the features at all other positions, differing significantly from convolutional and recurrent operations, which are inherently local and require repeated applications to model long-range dependencies.

Several instantiations of the pairwise function f are explored, including the Gaussian, embedded Gaussian, dot-product, and concatenation methods. Each method provides a different approach to measuring similarity between positions, though the paper finds that the overall performance improvements are insensitive to the choice of instantiation, indicating the critical factor is the non-local nature of the operations themselves.

Applications in Video Classification

The effectiveness of non-local operations is demonstrated through comprehensive experiments on video classification tasks using the Kinetics and Charades datasets. The authors construct non-local neural networks by integrating non-local blocks into baseline 2D ConvNet and inflated 3D ConvNet architectures.

The results indicate that non-local networks outperform their baseline counterparts across various metrics. For instance, a non-local network constructed from a ResNet-50 baseline with five non-local blocks achieves higher accuracy than both 2D and 3D ConvNet baselines, while also being more computationally efficient. This performance is attributed to the non-local operations' ability to capture long-range dependencies directly, facilitating effective multi-hop communication between distant positions in the input data.

Generalization to Image Recognition

Beyond video classification, the paper extends the application of non-local networks to static image recognition tasks, specifically object detection, segmentation, and pose estimation on the COCO dataset. By incorporating non-local blocks into the Mask R-CNN backbone, the authors demonstrate significant improvements in both object detection and instance segmentation performance. Notably, a single non-local block can increase performance metrics such as AP (average precision) by approximately one percentage point, which is substantial given the baseline model's already strong performance.

Comparative Analysis

One of the important contributions of the paper is the comparative analysis between non-local networks and other state-of-the-art models. For instance, the comparison with I3D models highlights that non-local blocks can achieve better accuracy with fewer computational resources. Additionally, combining non-local blocks with I3D models shows complementary gains, indicating that non-local operations can enhance the performance of existing architectures rather than simply replacing them.

Future Directions and Implications

The introduction of non-local operations opens new avenues for research in neural network architectures. The ability to efficiently model long-range dependencies directly can have profound theoretical and practical implications, particularly for tasks requiring the integration of global context. One plausible direction for future research could involve exploring the synergy between non-local blocks and other architectural innovations, such as self-attention mechanisms and graph neural networks, to further enhance model performance and efficiency.

Conclusion

In summary, the paper "Non-local Neural Networks" presents a compelling case for the integration of non-local operations into deep neural network architectures. By effectively capturing long-range dependencies, non-local networks demonstrate significant improvements over traditional architectures in both video and image recognition tasks. The proposed method's flexibility and efficacy suggest that non-local operations may become a fundamental component in the design of future neural network models.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/video-nonlocal-net: Non-local Neural Networks for Video Classification (1,982 stars)

YouTube

Show All Videos