Fast Algorithms for Convolutional Neural Networks (1509.09308v2)

Published 30 Sep 2015 in cs.NE and cs.LG

Abstract: Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. The success of convolutional neural networks in these situations is limited by how fast we can compute them. Conventional FFT based convolution is fast for large filters, but state of the art convolutional neural networks use small, 3x3 filters. We introduce a new class of fast algorithms for convolutional neural networks using Winograd's minimal filtering algorithms. The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. We benchmark a GPU implementation of our algorithm with the VGG network and show state of the art throughput at batch sizes from 1 to 64.

Authors (2)

Andrew Lavin (3 papers)
Scott Gray (11 papers)

Citations (840)

View on Semantic Scholar

Summary

Fast Algorithms for Convolutional Neural Networks

This paper presents a class of fast algorithms for Convolutional Neural Networks (CNNs) that address computational inefficiencies associated with small filter sizes ( $3 \times 3$ ) commonly employed by state-of-the-art architectures like VGG. The motivation lies in the constraints posed by computational resources, especially when dealing with extensive datasets, low-latency applications such as pedestrian detection, and mobile device image recognition. The algorithms introduced exploit Winograd’s minimal filtering techniques to significantly reduce the arithmetic complexity, achieving up to a four-fold reduction compared to direct convolution.

Key Contributions

Winograd Filtering Algorithms: The paper introduces Winograd minimal filtering algorithms for efficient convolution over $3 \times 3$ filters, significantly reducing the number of multiplications. For instance, $F(2 \times 2, 3 \times 3)$ and $F(4 \times 4, 3 \times 3)$ were explored, reducing the number of multiplications dramatically compared to direct convolution methods.
Implementation for GPUs: The practical feasibility of these algorithms is demonstrated through a GPU implementation on NVIDIA Maxwell GPUs. The implementation achieves state-of-the-art throughput across various batch sizes, from 1 to 64, while maintaining minimal memory requirements, never exceeding 16MB.
Comparison with FFT-based Methods: The FFT-based convolution algorithms currently employed exhibit inefficiencies when dealing with small filter sizes. These methods are particularly suboptimal at moderate batch sizes due to significant arithmetic and memory demands. On the contrary, Winograd algorithms show optimal performance even with small batch sizes.
Empirical Evaluation: Benchmarking against the VGG network highlights that the proposed algorithms outperformed existing methods employing direct convolution or FFT-based convolution. At batch sizes ranging from 1 to 64, the implementation sustained high throughput, leveraging the arithmetic complexity reduction without resorting to large, memory-intensive workspaces.

Numerical Results

The proposed algorithm achieves a throughput of 9.57 TFLOPS for a batch size of 8 when using fp16 data, significantly outperforming the cuDNN library. For larger batch sizes, such as batch size 64, the throughput was 10.28 TFLOPS.
The arithmetic complexity for the multiplication stage is drastically reduced, achieving $\alpha' = 2.25$ for $F(4 \times 4, 3 \times 3)$ compared to $\alpha' = 4.0$ for direct convolution, suggesting a practical upper bound for speedup due to algorithmic efficiency.

Implications and Future Work

The practical implications of this research are manifold. The ability to efficiently train and infer using CNNs with reduced computational costs can notably benefit applications constrained by hardware limitations, particularly in the field of mobile computing and real-time image processing tasks. The reduction in memory requirements further suggests that smaller, more power-efficient hardware can be utilized without significant trade-offs in performance.

Future developments may include exploring the application of these fast algorithms to other layers beyond convolution, such as normalization or pooling layers. Additionally, integrating these fast algorithms with parallel optimization techniques like Strassen's algorithm for large-scale distributed training could further reduce computational overhead. The research could also extend to the deployment of these algorithms on other hardware accelerators, like FPGAs, to assess broader applicability.

Conclusion

The paper "Fast Algorithms for Convolutional Neural Networks" provides substantive advancements in the computational efficiency of convolutional operations within CNNs leveraging Winograd’s minimal filtering techniques. This work is particularly crucial for improving the performance and feasibility of deploying deep learning models in resource-constrained environments. The demonstrated GPU implementation and extensive benchmarking affirm the viability of these algorithms, paving the way for broader adoption and further optimization in deep learning frameworks.

Related Papers

Find Related Papers

Tweets

https://twitter.com/michalwols/status/1846403753483669982

YouTube

Show All Videos