Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NUMA-aware FFT-based Convolution on ARMv8 Many-core CPUs (2109.12259v1)

Published 25 Sep 2021 in cs.DC

Abstract: Convolutional Neural Networks (CNNs), one of the most representative algorithms of deep learning, are widely used in various artificial intelligence applications. Convolution operations often take most of the computational overhead of CNNs. The FFT-based algorithm can improve the efficiency of convolution by reducing its algorithm complexity, there are a lot of works about the high-performance implementation of FFT-based convolution on many-core CPUs. However, there is no optimization for the non-uniform memory access (NUMA) characteristics in many-core CPUs. In this paper, we present a NUMA-aware FFT-based convolution implementation on ARMv8 many-core CPUs with NUMA architectures. The implementation can reduce a number of remote memory access through the data reordering of FFT transformations and the three-level parallelization of the complex matrix multiplication. The experiment results on a ARMv8 many-core CPU with NUMA architectures demonstrate that our NUMA-aware implementation has much better performance than the state-of-the-art work in most cases.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiandong Huang (1 paper)
  2. Qinglin Wang (5 papers)
  3. Shuyu Lu (2 papers)
  4. Ruochen Hao (5 papers)
  5. Songzhu Mei (5 papers)
  6. Jie Liu (492 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.