Distributing conversion work across heterogeneous systems

Determine optimal strategies for distributing data conversion tasks induced by memory-centric annotations across heterogeneous computing systems, including whether to offload conversions to GPUs or external units, perform conversions en bloc before offloading, or execute them lazily on streaming data, and how to leverage vector co-processors.

Background

The proposed approach compresses storage precision and packs data, requiring conversions to native types before computation. Future platforms are heterogeneous (CPU, GPU, accelerators) with varied memory hierarchies.

The authors state uncertainty about where and how to perform these conversions efficiently within heterogeneous workflows, enumerating several candidate strategies.

References

It not clear how work has to be distributed within heterogeneous systems: Should the conversions be deployed to a GPU if the computations run on the accelerator, could they be deployed to external smart compute units or networks once the data is expelled from the local caches, are the transformations en-bloc operations on all input data that precede the invocation of an offloaded compute kernel, or can they be triggered lazily on a stream while a compute unit already starts to process data, can we utilise AVX co-processors, and so forth?

— An extension of C++ with memory-centric specifications for HPC to reduce memory footprints and streamline MPI development (2406.06095 - Radtke et al., 10 Jun 2024) in Section 7 (Conclusion)

Distributing conversion work across heterogeneous systems

Background

References

Related Problems