Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method (2104.03420v2)

Published 7 Apr 2021 in cs.AR

Abstract: Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59\times$ speedup and $123\times$ energy reduction compared to embedded CPU, and $292\times$ memory reduction over a standard full-size training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kaiqi Zhang (19 papers)
  2. Cole Hawkins (15 papers)
  3. Xiyuan Zhang (31 papers)
  4. Cong Hao (51 papers)
  5. Zheng Zhang (488 papers)
Citations (10)