Comparing Parallel Functional Array Languages: Programming and Performance (2505.08906v1)

Published 13 May 2025 in cs.PL, cs.DC, and cs.PF

Abstract: Parallel functional array languages are an emerging class of programming languages that promise to combine low-effort parallel programming with good performance and performance portability. We systematically compare the designs and implementations of five different functional array languages: Accelerate, APL, DaCe, Futhark, and SaC. We demonstrate the expressiveness of functional array programming by means of four challenging benchmarks, namely N-body simulation, MultiGrid, Quickhull, and Flash Attention. These benchmarks represent a range of application domains and parallel computational models. We argue that the functional array code is much shorter and more comprehensible than the hand-optimized baseline implementations because it omits architecture-specific aspects. Instead, the language implementations generate both multicore and GPU executables from a single source code base. Hence, we further argue that functional array code could more easily be ported to, and optimized for, new parallel architectures than conventional implementations of numerical kernels. We demonstrate this potential by reporting the performance of the five parallel functional array languages on a total of 39 instances of the four benchmarks on both a 32-core AMD EPYC 7313 multicore system and on an NVIDIA A30 GPU. We explore in-depth why each language performs well or not so well on each benchmark and architecture. We argue that the results demonstrate that mature functional array languages have the potential to deliver performance competitive with the best available conventional techniques.

Authors (15)

David van Balen (2 papers)
Tiziano De Matteis (13 papers)
Clemens Grelck (7 papers)
Troels Henriksen (2 papers)
Aaron W. Hsu (1 paper)
Gabriele K. Keller (1 paper)
Thomas Koopman (4 papers)
Trevor L. McDonell (3 papers)
Cosmin Oancea (4 papers)
Sven-Bodo Scholz (4 papers)
Artjoms Sinkarovs (5 papers)
Tom Smeding (5 papers)
Phil Trinder (8 papers)
Ivo Gabe de Wolff (1 paper)
Alexandros Nikolaos Ziogas (16 papers)

Summary

This paper "Comparing Parallel Functional Array Languages: Programming and Performance" (Balen et al., 13 May 2025 ) systematically compares five such languages—Accelerate, APL, DaCe, Futhark, and SaC—across their design, implementation, and performance characteristics using four challenging benchmarks: N-body simulation, MultiGrid, Quickhull, and Flash Attention. The core motivation is to explore whether these languages can combine low-effort parallel programming with good performance and portability across different architectures, specifically multicores and GPUs. The authors argue that functional array programming offers a promising approach to taming the complexities of parallel programming, particularly for array-based computations common in scientific computing and machine learning, by providing a higher level of abstraction that is architecture-agnostic.

Language Overview (Section 2)

The paper introduces each language using a naive N-body simulation example:

Futhark: A statically typed functional array language with explicit data parallelism via Second Order Array Combinators (SOACs) like map and reduce. It uses a size-dependent type system and models multidimensional arrays as nested arrays, though represented internally as Structure-of-Arrays (SoA). It restricts higher-order functions and recursion to enable efficient compilation, particularly for GPUs.
Accelerate: A deep Domain-Specific Language (DSL) embedded in Haskell. It offers rank-polymorphic parallel operations on multi-dimensional arrays. Computations are represented as Abstract Syntax Trees (ASTs) of type Acc for parallel computations or Exp for sequential scalar computations. Like Futhark, it restricts higher-order functions to facilitate defunctionalization and compilation. It automatically transforms Array-of-Structures (AoS) to SoA and employs fusion as a core optimization.
Single Assignment C (SaC): A functional array language with C-like syntax. Data is conceptually immutable, and assignments are syntactic sugar for let-expressions. It supports multi-dimensional, rectilinear arrays and is rank-polymorphic. Key constructs include tensor-comprehensions (which are syntactic sugar for with-loop) and reductions. It supports function overloading and automatically transforms AoS to SoA.
APL: A dynamically-typed functional language with primitives operating over N-dimensional arrays. It is known for its conciseness, achieved through single-symbol primitives and operators that work on arrays of arbitrary shape and depth (including jagged and nested arrays). Parallelism is largely implicit in these primitives and operators, and it supports nested parallelism.
DaCe: A framework rather than a standalone language, often embedded in languages like Python. It converts high-level programs into a graph-based Intermediate Representation (IR) called Stateful Dataflow multiGraphs (SDFG). SDFG is dataflow-based, making data movement explicit. Computations are represented by Tasklets, Maps (for parametric parallelism), Consumes (for streaming), and Library Nodes. It separates the high-level mathematical description from performance optimizations applied via graph transformations.

Language Design Comparison (Section 3)

The paper systematically compares the languages based on their type systems, array representations, and parallel computation paradigms:

Type Systems: APL is dynamically typed, while the others are statically typed. Accelerate, Futhark, and DaCe (via frontends) support parametric polymorphism. Futhark and SaC support dependent types (specifically, size-dependent types for array shapes).
Array Representation: All languages support rectilinear arrays. APL is unique in supporting jagged and heterogeneous arrays. Accelerate, APL, and SaC offer rank polymorphism, allowing functions to operate on arrays of different ranks. Futhark and SaC guarantee SoA representation for performance.
Parallel Computation Paradigms: All support bulk data parallelism. Accelerate and Futhark use explicit parallel combinators, while APL and SaC rely more on implicit parallelism from array operations/comprehensions. DaCe allows both explicit SDFG constructs and implicit parallelism from frontends. Futhark, APL, DaCe, and SaC support nested parallelism. Only APL and DaCe support task parallelism. Accelerate, Futhark, and SaC are deterministic in principle, while DaCe can inherit non-determinism from target backends (though SDFG itself is deterministic). Most languages restrict higher-order functions for compilation efficiency.

Language Implementation Comparison (Section 4)

The implementation approaches vary:

Futhark and SaC: Conventional ahead-of-time compilers (written in Haskell and C, respectively). They perform standard and array-specific optimizations (like fusion, flattening/with-loop optimizations). Futhark generates C libraries, while SaC generates standalone C executables.
Accelerate: A deep EDSL in Haskell, compiled Just-In-Time (JIT) at runtime. It benefits from the host language's infrastructure but incurs some runtime overhead. It relies on fusion and flattening.
APL: Traditionally interpreted (Dyalog) but also has a new GPU compiler (Co-dfns, written in APL). The compiler relies heavily on runtime libraries for performance, calling specialized routines for primitives. Fusion is limited to combinations of known primitives. Dynamic typing complicates static specialization.
DaCe: A framework embedded in languages like Python. It translates code to SDFG, on which optimizations are applied via graph transformations, either manually or automatically. It generates code for various backends (C/C++, CUDA/HIP, HLS).

All implementations target multicores and NVIDIA GPUs. APL, DaCe, and SaC also target clusters, and DaCe targets FPGAs. Optimizations vary, with fusion being common across most. AoS/SoA transformation is handled by Futhark, Accelerate, and SaC.

Benchmark Rationale and Methodology (Section 5)

The four benchmarks were chosen because they are challenging and representative of different domains and parallel patterns:

N-body: Simple O(N²⁾ with regular nested parallelism, sensitive to data layout (AoS vs SoA) and cache/shared memory usage.
MultiGrid (NAS MG): Stencil computation with recursive V-cycle, involving regular nested parallelism and requiring temporal reuse optimization.
Quickhull (PBBS): Computational geometry with irregular nested parallelism (divide-and-conquer), exercising flattening, scan, reduce-by-index, and scatter.
Flash Attention: Deep learning kernel, memory-bound, requiring specific tiling/fusion (Online Softmax) to fit intermediate results in fast memory.

The evaluation used a Radboud server with a 32-core AMD EPYC CPU and an NVIDIA A30 GPU. For each benchmark, a single functional array program was written (except for DaCe's N-body and Flash Attention, which had separate CPU/GPU versions), which was then compiled and optimized for both platforms. This contrasts with the baselines, which consist of separate, hand-optimized OpenMP (CPU) and CUDA (GPU) implementations. Performance was measured as total runtime or compute rate, excluding initialization/transfer times for GPUs.

Benchmark Performance and Analysis (Sections 6-9)

N-body: Functional languages generally performed well on CPU, with SaC and DaCe competitive with the baseline. On GPU, Futhark and DaCe often outperformed the baseline on larger datasets, attributed to effective exploitation of nested parallelism and tiling. Accelerate and SaC showed lower GPU performance, limited by their backends' maturity for this benchmark.
MultiGrid: Functional languages generally lagged the baseline on CPU, with SaC and DaCe showing the best performance relative to the baseline. On GPU, Futhark and DaCe were competitive with or exceeded the baseline on larger datasets by effectively handling the stencil computations and recursion (or its flattened equivalent). Accelerate showed lower GPU performance. SaC's GPU implementation was not completed due to compiler issues.
Quickhull: This benchmark highlighted challenges with irregular nested parallelism and missing primitives. DaCe and SaC could not provide parallel implementations due to the lack of primitives like scan or scatter, or support for fork-join style parallelism. Futhark and Accelerate, using manually flattened implementations, achieved strong performance on GPU, outperforming the baseline. On CPU, the baseline (using fork-join) significantly outperformed the functional languages.
Flash Attention: This benchmark required a specific algorithmic optimization (Flash Attention/Online Softmax tiling) to achieve high performance. Most functional language implementations resembled the less efficient "Custom Attention" variant. DaCe's CPU implementation of Flash Attention performed well against the baseline. On GPU, DaCe and Futhark implementing "Custom Attention" were the closest to the baseline, but still significantly slower due to not fully implementing the Flash Attention algorithm's memory optimization.

Benchmarking Summary and Analysis (Section 10)

The paper summarizes the findings:

Expressiveness: Functional array languages enable more concise and comprehensible code than the low-level baselines. The total functional code base (683 SLOC) is significantly smaller than the baseline code base (7633 SLOC). The code often stays closer to the mathematical specification.
Performance Potential: While individual languages may not perform well on all benchmarks or architectures, the results show the potential. Across all 36 benchmark instances with baselines, at least one functional language matches or outperforms the baseline in 30% of instances and achieves over 80% of the baseline performance in 70% of instances. This suggests that mature functional array languages can be competitive.
Implementation Maturity: The variations in performance highlight the maturity of the language implementations. Futhark and DaCe generally demonstrated stronger GPU performance, while SaC and DaCe were stronger on CPU. Differences in supported optimizations (fusion, tiling, locality) and ability to handle different parallel patterns (regular/irregular nested parallelism) were key performance differentiators. APL's conciseness came at the cost of performance, limited by optimization limitations in its current implementations.

Conclusion (Section 12)

The paper concludes that functional array languages hold significant promise for high-performance parallel programming. They offer a high-level, architecture-agnostic specification, contributing to correctness and portability. While not every language performs optimally on every benchmark and architecture, the collective results demonstrate that the paradigm has the potential to deliver performance competitive with conventional hand-optimized methods. The primary challenge lies in the maturity of the language implementations and compilers, which need further engineering effort to consistently translate the high-level functional descriptions into efficient code for diverse hardware. The authors anticipate continued development in this area.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/susumuota/status/1924259085362348262

https://twitter.com/bohannon_bot/status/1924241775906173372

HackerNews

Comparing Parallel Functional Array Languages: Programming and Performance (89 points, 23 comments)
Comparing Parallel Functional Array Languages: Programming and Performance (12 points, 0 comments)

Reddit

Comparing Parallel Functional Array Languages: Programming and Performance (26 points, 2 comments)
Comparing Parallel Functional Array Languages: Programming and Performance (arXiv) (9 points, 1 comment)
Comparing Parallel Functional Array Languages: Programming and Performance (5 points, 0 comments)