Distributing conversion work across heterogeneous systems
Determine optimal strategies for distributing data conversion tasks induced by memory-centric annotations across heterogeneous computing systems, including whether to offload conversions to GPUs or external units, perform conversions en bloc before offloading, or execute them lazily on streaming data, and how to leverage vector co-processors.
References
It not clear how work has to be distributed within heterogeneous systems: Should the conversions be deployed to a GPU if the computations run on the accelerator, could they be deployed to external smart compute units or networks once the data is expelled from the local caches, are the transformations en-bloc operations on all input data that precede the invocation of an offloaded compute kernel, or can they be triggered lazily on a stream while a compute unit already starts to process data, can we utilise AVX co-processors, and so forth?