Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi (1302.1078v1)

Published 5 Feb 2013 in cs.PF and cs.AR

Abstract: Intel Xeon Phi is a recently released high-performance coprocessor which features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cutting-edge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge general purpose processors and GPUs.

Citations (164)

View on Semantic Scholar

Summary

The paper demonstrates that optimized vectorized instructions on the Xeon Phi can achieve up to 22 GFlop/s for SpMV and 128 GFlop/s for SpMM.
The evaluation details how sparse matrix densities and memory access patterns significantly influence performance through reduced cache misses.
The study suggests that future improvements in data locality and matrix storage formats could further enhance the high throughput capabilities of the Xeon Phi architecture.

Evaluating Sparse Matrix Multiplication on Intel Xeon Phi

The paper "Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi" explores a comprehensive analysis of the Intel Xeon Phi coprocessor's ability to handle sparse matrix-vector and matrix-matrix operations, which are central to numerous scientific applications. The authors present a thorough examination of the device's architectural features, evaluating its potential against established processors and accelerators.

The Intel Xeon Phi boasts 61 cores with 512-bit wide SIMD registers, geared for high-performance computations. The coprocessor's architecture is tailored for tasks like linear solvers and graph mining algorithms, which heavily depend on efficient sparse matrix operations. The paper focuses chiefly on two kernels: Sparse Matrix-Vector Multiplication (SpMV) and Sparse Matrix-Matrix Multiplication (SpMM), assessing their performance across various sparse matrix densities and access patterns.

Performance Investigation

For SpMV, the paper reports significant variability in performance, influenced by the distribution of nonzeros across matrix rows. When compiled with high optimization levels, the Intel Xeon Phi outperforms other architectures, achieving up to 22 GFlop/s on specific matrices mainly due to effective vectorization. The SIMD capabilities, particularly the vgatherd instruction, facilitate this by reducing cacheline misses through improved data access patterns.

The SpMM kernel benefits further from the coprocessor's wide registers, achieving nearly 128 GFlop/s with manual vectorization and utilization of the enhanced memory write instructions, namely Non-Globally Ordered writes paired with No-Read hints (NRNGO). This setup demonstrates the coprocessor's robust handling of higher computational workloads common in applications like recommender systems using graph-based models.

Comparative Analysis

When compared to other contemporary architectures, Intel Xeon Phi showcases superior performance for both SpMV and SpMM kernels. In particular, SpMM sees a substantial performance boost with Xeon Phi, achieving greater efficiency than general-purpose processors and GPUs due to its ample vector unit capacity and memory bandwidth strategies.

Implications and Future Directions

The Xeon Phi's architecture, characterized by large SIMD registers and numerous cores, holds profound implications for high-performance computing (HPC). It potentially sets a new benchmark for energy-efficient, high throughput computations in scientific and enterprise applications, notably in parallel processing contexts. However, challenges remain, especially pertaining to memory latency rather than bandwidth, as identified in the paper.

Future research may further explore enhancing data locality and minimizing memory access duplication across multiple cores. Optimizations involving matrix storage formats and partitioning strategies could prove crucial to maximizing the potential of Intel Xeon Phi. The possibility of further tailored kernel implementations exploiting the device's architectural strengths also remains an avenue for exploration.

In essence, the Intel Xeon Phi coprocessor exhibits promising capabilities for intensive sparse matrix computations, leading the charge in areas where traditional CPUs and GPUs may lag. This paper's analysis underscores the technological strides achieved with the Xeon Phi, while also hinting at the ongoing evolution required to fully harness its computational prowess.

PDF Markdown