Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

High Accuracy Low Precision QR Factorization and Least Square Solver on GPU with TensorCore (1912.05508v1)

Published 11 Dec 2019 in cs.MS

Abstract: Driven by the insatiable needs to process ever larger amount of data with more complex models, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and extremely optimized special units such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) that does half precision matrix-matrix multiplication exceptionally efficiently. In this paper we present a large scale mixed precision linear least square solver that achieves high accuracy using the low precision TensorCore GPU. The mixed precision system consists of both innovative algorithms and implementations, and is shown to be up to 14x faster than single precision cuSOLVER at QR matrix factorization at large scale with slightly lower accuracy, and up to 10x faster than double precision direct QR least square solver with comparable accuracy.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shaoshuai Zhang (2 papers)
  2. Panruo Wu (5 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.