AoS-to-SoA Transformations
- AoS-to-SoA transformations are a data reorganization technique that converts structures with grouped attributes into contiguous arrays for each field.
- They leverage compiler-driven annotations and automated buffer management to expose stride-one memory access patterns and enhance SIMD/SIMT acceleration.
- This approach leads to significant performance gains in HPC codes by optimizing vectorization, balancing reduced precision, and improving memory bandwidth utilization.
An Array-of-Structures to Structure-of-Arrays (AoS-to-SoA) transformation is a systematic reorganization of data in memory from the Array-of-Structures layout, where all attributes of a simulation entity (e.g., a particle) are packed together, to the Structure-of-Arrays layout, in which each attribute is stored contiguously across all entities. This transformation is central to exposing memory access patterns that exploit wide SIMD/SIMT units on modern CPUs and GPUs, facilitating vectorization, memory bandwidth utilization, and kernel offloading in high-performance computing (HPC) codes. The challenging orchestration of these transformations—especially in the context of heterogeneous hardware and with domain codes originally designed for AoS—has motivated several recent lines of work on compiler support, annotation-driven strategies, and integration with reduced-precision numerics (Radtke et al., 5 Dec 2025, Radtke et al., 23 Feb 2025, Radtke et al., 21 May 2024, Homann et al., 2017).
1. AoS and SoA Memory Layouts: Definitions and Addressing
In AoS, a struct type (e.g., struct Particle { float x[3], v[3]; float u, m, h, rho, P, cs, du, dt; ... };) is instantiated N times as an array. The -th particle’s -th field is stored at
where is the compile-time offset of field inside the struct. This results in local data for each entity being contiguous, but the same attribute for all entities is strided, hindering vectorization.
In SoA, each field of type becomes its own array , and attribute of entity is at
which provides stride-one access to a given attribute across all entities, ideal for vector instructions.
Table: Memory layout formulas
| Layout | Address formula | Access pattern |
|---|---|---|
| AoS | Strided per field | |
| SoA | Contiguous per field |
Where and is the type of the th field (Radtke et al., 5 Dec 2025, Homann et al., 2017).
2. Compiler-Driven Transformation and Annotation Mechanisms
To facilitate AoS-to-SoA transitions without requiring global data structure rewrites, recent compiler extensions leverage C++11-style attributes ([[]]-annotations) that the compiler recognizes and expands into prologue/epilogue packing code, temporary buffer management, and loop body rewriting (Radtke et al., 5 Dec 2025, Radtke et al., 23 Feb 2025, Radtke et al., 21 May 2024). Key attributes include:
[[clang::soa_conversion_compute_offload]]: triggers both data layout conversion and loop offloading (e.g., to a GPU).[[clang::soa_conversion_handler(host|device)]]: specifies the site (CPU or GPU) of the packing, compression, or decompression.[[clang::soa_conversion_hoist(level)]]: hoists conversion code outside nested loops to buffer across multiple operations.[[clang::soa_conversion_inputs(...)]], [[clang::soa_conversion_outputs(...)]], [[clang::soa_conversion_target(...)]]: allow semi-manual, per-loop conversion targeting precise field subsets (Radtke et al., 21 May 2024).
For example, annotating a kernel as
1 2 3 4 |
[[clang::soa_conversion_compute_offload]]
[[clang::soa_conversion_handler(host)]]
for (Particle &p : particles)
Kernel(p); |
3. Transformation Formalism and Pipeline Steps
AoS-to-SoA transformations in these frameworks are modeled as compositions of the following operators:
- /: Narrow/widen to only fields accessed/written by a kernel.
- /: Out-of-place AoS-to-SoA rearrangement and its transpose.
- /: Unpack/repack floating-point data for reduced precision representations.
- /: Transfer between host and accelerator.
Sequence examples:
- CPU-side unpack and SoA transformation:
- GPU-side conversion:
Here, denotes function composition; denotes application across multiple stages (Radtke et al., 5 Dec 2025).
4. Reduced Precision Integration and Bitfield Packing
By permitting reduced mantissa bit-widths for floating-point members (e.g., via [[clang::truncate_mantissa(N)]]), these systems can pack multiple compressed floats into bitstreams, further lowering memory footprints. Unpack () and repack () steps perform the appropriate bit masking and shifting.
Mapping of bit widths to IEEE formats:
| Total bits | Sign | Exponent | Mantissa | Format |
|---|---|---|---|---|
| 64 | 1 | 11 | 52 | double |
| 32 | 1 | 8 | 23 | float |
| 16 | 1 | 5 | 10 | half |
Empirically, for the force kernel, relative RMSE vs 64-bit reference is at machine precision for mantissa ≥50 bits; below 32 bits, error increases rapidly, and symmetry in long-running tests breaks with excessive truncation. The precise bit width should thus reflect per-kernel error budget requirements (Radtke et al., 5 Dec 2025).
5. Performance Characteristics and Hardware Observations
Benchmark campaigns covering SPH kernels on recent CPUs and GPUs highlight the variable payoff of AoS-to-SoA conversions:
- On NVIDIA GH200 platforms, GPU-side SoA conversions (all fields, 32-bit) are up to 35× faster than CPU; 16-bit variants reach up to 500×.
- For compute-bound kernels (“force”), SoA yields speedups of 2–3×, while memory-bound “kick”/“drift” kernels can reach 8× if data layout is fully adapted.
- AMD platforms are less sensitive, often showing muted gains (~1.4×) and, at times, a 10–50% penalty for reduced precision on quadratic kernels.
- In-place data orchestration (keeping the packed buffer on the device for several kernels) amortizes packing overhead and yields higher speedups than per-kernel streaming.
Kernel speedup examples (GH200):
| Kernel | Memory characteristic | Max speedup |
|---|---|---|
| Kick/Drift | memory-bound | up to 8× |
| Force | compute-bound | 2–3× |
| Density | mixed | 4× (NVIDIA), 1.5× (AMD) |
(Radtke et al., 5 Dec 2025, Radtke et al., 21 May 2024, Homann et al., 2017)
6. Best Practices and Limitations
Recommended practices for leveraging compiler-driven AoS-to-SoA transforms include:
- Only annotate loops/kernels that are streaming-intensive or where the field set is amenable to SoA vectorization (e.g., dense/SPH pairwise kernels).
- Prefer in-place device-side conversion when using many fields on hardware with fast accelerator-host interconnects; use host-side streaming when kernels access few fields over slow PCIe links.
- Streaming conversions benefit only isolated kernels; in-place conversion suits pipelines with multiple dependent kernels.
- Reduced-precision is effective for memory-bound, bandwidth-limited kernels on NVIDIA GPUs, but may degrade accuracy or performance elsewhere.
- Manual performance tuning remains necessary for best results, particularly when combining offloaded kernels or optimizing GPU occupancy.
Limitations:
- All gather/scatter operations impose cost; for large field sets or small this can dominate runtime.
- Requires full visibility of kernel code for field-use analysis (whole-TU or LTO).
- Nested conversions with field aliasing are not universally supported.
- On GPUs, explicit block/occupancy tuning and kernel fusion remain essential (Radtke et al., 23 Feb 2025, Radtke et al., 5 Dec 2025, Radtke et al., 21 May 2024).
7. Generalization, Research Directions, and Broader Impact
The annotation- and compiler-driven AoS-to-SoA approach fundamentally separates domain logic (natural AoS, object-style code) from performance engineering, allowing performance specialists to experiment with data layout, field subsets, and precision on a per-kernel basis without invasive source changes (Radtke et al., 5 Dec 2025, Radtke et al., 21 May 2024).
Key implications and future directions:
- These methods are domain-agnostic—potential targets include any code with tight, strided, SIMD/SIMT-sensitive kernels, not just SPH or particle problems.
- Automatic, possibly runtime-adaptive, selection of fields, precision, and conversion location (host or device) remains a promising direction.
- Integration with other memory layout transformations (e.g., alignment, persistent buffer pools), and tighter coupling with accelerator-specific hardware (e.g., shared memory for SoA) could further enhance throughput.
- A plausible implication is that as hardware and programming models evolve, tight collaboration between language/compiler infrastructure and application developers will become essential to sustain high performance (Radtke et al., 5 Dec 2025, Radtke et al., 23 Feb 2025).
In sum, AoS-to-SoA transformations—particularly those orchestrated by compiler-driven annotation—enable substantial kernel and pipeline performance improvements (2–8× kernel speedups, up to 500× packing speedups), with the flexibility to balance performance, accuracy, and productivity on heterogeneous architectures (Radtke et al., 5 Dec 2025, Radtke et al., 23 Feb 2025, Radtke et al., 21 May 2024, Homann et al., 2017).