Grid: A next generation data parallel C++ QCD library (1512.03487v1)

Published 10 Dec 2015 in hep-lat, cs.DC, and cs.MS

Abstract: In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines directly in the Grid framework. It is hoped this will deliver best known practice performance across the next generation of supercomputers, which will provide programming challenges to traditional scalar codes. We illustrate the programming patterns used to implement our goals, and advances in productivity that have been enabled by using new features in C++11.

Citations (124)

View on Semantic Scholar

Summary

The paper introduces Grid, a novel C++ data parallel library that achieves a fivefold speedup in SU(3) matrix computations.
It details a methodology using modern C++11 features and intrinsic functions to abstract SIMD operations for broad hardware compatibility.
Grid employs overdecomposition and interleaving strategies to maximize vector processing efficiency, reaching 65% of theoretical peak on quad-core Haswell CPUs.

Grid: A Next Generation Data Parallel C++ QCD Library

The paper entitled "Grid: A Next Generation Data Parallel C++ QCD Library" by Peter Boyle et al. presents a detailed exposition of a state-of-the-art physics code library called Grid. Targeted towards achieving optimal performance on modern supercomputers, Grid seeks to transcend existing frameworks like QDP++ by leveraging high-level data parallelism and modern C++11 features to efficiently exploit multi-level parallelism inherent in quantum chromodynamics (QCD) computations.

Motivation and Architectural Considerations

In light of the evolving architecture of supercomputing systems, characterized by a significant rise in parallel execution units and vector lengths, traditional scalar code bases are increasingly insufficient. The authors argue for a complete reengineering of existing Lattice QCD codes, underlining the necessity for interoperability with extensive SIMD instructions, including SSE, AVX, and AVX512. The implementation of Grid is guided by an emphasis on portable performance across these architectures. Specifically, it abstracts platform-specific vector operations using C++11 intrinsic functions, enabling efficient and flexible deployment on diverse hardware architectures.

Grid Design Patterns and Implementation

Grid distinguishes itself by its novel use of C++ template programming and intrinsic functions to abstract SIMD vector operations, thus minimizing architecture-dependent code—claiming compatibility with upcoming HPC systems' programming paradigms. Importantly, the paper details how Grid employs overdecomposition techniques to evenly distribute Cartesian grid problems onto SIMD lanes, surpassing efficiency barriers associated with single over-decomposed nodes.

To optimize performance, Grid introduces C++ vector data type classes, allowing matrix-vector computations to leverage SIMD capabilities by performing multiple operations in parallel, eliminating horizontal summation penalties. The codebase supports 100% SIMD efficiency by interleaving data elements from different virtual nodes.

Performance and Comparative Analysis

The paper provides empirical evidence of Grid's performance superiority relative to QDP++, attributing a fivefold speed increase in SU(3) matrix multiplication tasks to its efficient handling of cache-resident data and intrinsic vector operations. They further document the performance advantages across various architectures, most notably on a quad-core Haswell CPU, achieving 65% of theoretical peak performance.

Furthermore, the paper contrasts Grid's optimizations with existing methods, indicating its significant edge in performance portability and computational efficiency, achieved without sacrificing the complexity of modern field theories.

Future Directions and Theoretical Implications

From a theoretical standpoint, Grid opens avenues for further exploration in efficiently solving PDEs using stencil support and overdecomposition strategies. The library's flexibility supports expanding its algorithmic suite to address multi-grid methods and complex boundary conditions, essential for advanced QCD simulations. Practically, the adaptability of Grid to future architectures suggests profound implications for HPC applications beyond QCD, promoting its potential adoption across various domains requiring similarly intensive computations.

In conclusion, the Grid library embodies a concerted effort to align Lattice QCD computational strategies with modern supercomputing demands, showcasing both pragmatic enhancements and theoretical contributions to parallel computing methodologies. The insights provided in the paper underscore Grid's role in advancing the state-of-the-art, laying a foundation for future developments in data parallel physics computing.

PDF Markdown

Related Papers

GitHub

GitHub - paboyle/Grid: Data parallel C++ mathematical object library (162 stars)