Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 41 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

AcceleratedKernels.jl: Cross-Architecture Parallel Algorithms from a Unified, Transpiled Codebase (2507.16710v1)

Published 22 Jul 2025 in cs.DC and cs.PF

Abstract: AcceleratedKernels.jl is introduced as a backend-agnostic library for parallel computing in Julia, natively targeting NVIDIA, AMD, Intel, and Apple accelerators via a unique transpilation architecture. Written in a unified, compact codebase, it enables productive parallel programming with minimised implementation and usage complexities. Benchmarks of arithmetic-heavy kernels show performance on par with C and OpenMP-multithreaded CPU implementations, with Julia sometimes offering more consistent and predictable numerical performance than conventional C compilers. Exceptional composability is highlighted as simultaneous CPU-GPU co-processing is achievable - such as CPU-GPU co-sorting - with transparent use of hardware-specialised MPI implementations. Tests on the Baskerville Tier 2 UK HPC cluster achieved world-class sorting throughputs of 538-855 GB/s using 200 NVIDIA A100 GPUs, comparable to the highest literature-reported figure of 900 GB/s achieved on 262,144 CPU cores. The use of direct NVLink GPU-to-GPU interconnects resulted in a 4.93x speedup on average; normalised by a combined capital, running and environmental cost, communication-heavy HPC tasks only become economically viable on GPUs if GPUDirect interconnects are employed.