Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L3 Fusion: Fast Transformed Convolutions on CPUs (1912.02165v1)

Published 4 Dec 2019 in cs.DC and cs.LG

Abstract: Fast convolutions via transforms, either Winograd or FFT, had emerged as a preferred way of performing the computation of convolutional layers, as it greatly reduces the number of required operations. Recent work shows that, for many layer structures, a well--designed implementation of fast convolutions can greatly utilize modern CPUs, significantly reducing the compute time. However, the generous amount of shared L3 cache present on modern CPUs is often neglected, and the algorithms are optimized solely for the private L2 cache. In this paper we propose an efficient L3 Fusion algorithm that is specifically designed for CPUs with significant amount of shared L3 cache. Using the hierarchical roofline model, we show that in many cases, especially for layers with fewer channels, the L3 fused approach can greatly outperform standard 3 stage one provided by big vendors such as Intel. We validate our theoretical findings, by benchmarking our L3 fused implementation against publicly available state of the art.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rati Gelashvili (21 papers)
  2. Nir Shavit (32 papers)
  3. Aleksandar Zlateski (11 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.