Adaptive Compute Acceleration Platform (ACAP)

Updated 19 March 2026

ACAP is a heterogeneous system-on-chip architecture that combines vectorized AI Engines, reconfigurable programmable logic, and ARM-based processing systems.
It partitions data-intensive workloads such as GEMM, CNNs, and transformers into multi-level tiles to maximize compute occupancy and energy efficiency.
Automated ML-guided design exploration and optimized buffer management boost overall performance and facilitate hardware/software co-design.

The Adaptive Compute Acceleration Platform (ACAP) is a heterogeneous system-on-chip architecture introduced by AMD/Xilinx, targeting energy- and performance-critical workloads in domains such as deep learning, scientific computing, signal processing, and high-performance data analytics. ACAP tightly integrates vectorized AI Engines (AIEs), reconfigurable programmable logic (PL), and a high-bandwidth network-on-chip (NoC) with embedded ARM-based processing systems (PS), providing multi-level memory hierarchies, rich on-chip communication, and hardware/software co-design opportunities. The design and optimization of computation on Versal ACAP devices require judicious mapping of workload tiles, buffer hierarchies, and dataflows to maximize compute utilization and energy efficiency while managing resource and bandwidth constraints.

1. Architecture and System Organization

ACAP devices combine three tightly integrated hardware domains:

AI Engines (AIEs): Arrays of VLIW+SIMD cores, each equipped with 32 KB local scratchpad. For example, VCK190 integrates a 50×8 mesh (400 AIEs), with each core executing compute-intensive kernels (e.g., fixed-size GEMMs) at frequencies up to 1.25–1.33 GHz. The mesh provides multi-terabit/s neighbor DMA and inter-core streaming bandwidth, connected via a high-speed NoC (Papalamprou et al., 10 Nov 2025, Taka et al., 2024).
Programmable Logic (PL): FPGA fabric with configurable LUT/FF/DSP resources, and block/ultra RAM (BRAM/URAM). PL implements custom datapaths, DMA engines for off-chip DDR/PL–AIE transfer, and on-chip multi-level tiling buffers (Papalamprou et al., 10 Nov 2025, Lei et al., 2024, Li et al., 13 Jun 2025). PL serves as a critical bridge for data movement, pipeline orchestration, and non-AIE compute.
Processing System (PS): Embedded ARM (typically Cortex-A72) CPUs running Linux, acting as the host controller—configuring kernels, orchestrating NoC, managing reconfiguration, and scheduling hardware accelerators (Papalamprou et al., 10 Nov 2025, Zhang et al., 2024).

The memory subsystem comprises external DDR4 DRAM, on-chip PL RAM (20+ MB), per-tile AIE scratchpads (12.8 MB on VC1902), and a hierarchical routing fabric for tile-to-tile and PL–AIE communication. Off-chip/PL–AIE interfaces reach up to tens of gigabytes per second, while on-chip aggregate AIE memory bandwidth exceeds 15 TB/s (Taka et al., 2024).

Key architectural parameters (VCK5000/VC1902):

Subsystem	Resource	Value
AI Engine	#Tiles (AIEs)	400
	Local Scratchpad per tile	32 KB
	Peak INT8/FP32 Ops	145 TOPS / 6.4 TFLOPs
Programmable Logic	LUT, FF, DSP, BRAM/URAM	600K LUT, 290K FF, 400 URAM+BRAM
Network-on-Chip	Internal PL–AIE BW, NoC aggregate	1–1.3 TB/s, 23.5 TB/s
DDR4	Off-chip bandwidth	102.4 GB/s

2. Computational Mapping and Design Methodologies

Workload acceleration on ACAP leverages domain-specific frameworks to partition computation, exploit locality, and maximize hardware utilization:

Tile-Based Workload Partitioning: Compute-intensive workloads (e.g., GEMM, deep neural network layers, convolutions, GCNs) are decomposed into multi-dimensional tiles, which are mapped onto AIE partitions, PL buffers, and off-chip memory segments (Papalamprou et al., 10 Nov 2025, Lei et al., 2024, Dai et al., 2024). The most common micro-tile size for GEMM on a single AIE is 32×32×32 (Papalamprou et al., 10 Nov 2025, Taka et al., 2024).
Multi-Level Tiling and Buffering: PL implements on-chip BRAM/URAM tiling along all critical dimensions (e.g., M, N, K for GEMM), double-buffered to maximize data reuse and overlap DMA with compute (Taka et al., 2024, Li et al., 13 Jun 2025). The AI Engine array schedules overlapping loads, compute, and stores, achieving high compute occupancy and latency hiding (Lei et al., 2024).
Automated Mapping and Exploration: Frameworks such as WideSA (Dai et al., 2024), CHARM (Zhuang et al., 2023), AIM (Yang et al., 2023), and ML-guided DSE (Papalamprou et al., 10 Nov 2025) provide exploration of tiling parameters, buffer allocation, and partitioning to balance compute, bandwidth, and resource constraints. Notably, ML-guided approaches surpass analytical models in design selection quality, reducing error by more than 50% (Papalamprou et al., 10 Nov 2025).
Programmability and Code Generation: Design flows integrate C++/intrinsics for AIE kernels, high-level synthesis (HLS) for PL kernels, and auto-generated host (PS) code. Graph generators (as in EA4RCA (Zhang et al., 2024)) automate connection topologies, DMA, and pipelining code for large numbers of compute kernels.

3. Performance, Energy Efficiency, and Scaling

Performance and energy efficiency on ACAP are dictated by sustaining high compute utilization, minimizing idle/stall cycles, and carefully managing memory movement:

Peak Throughput and Utilization: Single-board (VCK5000/VC1902) GEMM acceleration achieves up to 77 TOPS (int8), corresponding to 60% of theoretical peak, and matrix multiply throughput up to 4.15 TOPS (float) utilizing ≥95% of AIE cores (Taka et al., 2024, Dai et al., 2024).
Energy Efficiency: INT8 GEMM yields 0.94 TOPS/W at 82 W board-level power, with ~100% AIE and 88% PL RAM utilization (Taka et al., 2024). Across neural networks (CNN/Transformer), energy-efficiency gains of 8.6× over traditional FPGA DPUs and up to 7.8× over GPU baselines are observed (Li et al., 13 Jun 2025, Zhang et al., 2024).
Workload-Aware Trade-Offs: For low- and mid-size GEMMs (low-MAC demand), maximizing throughput can penalize energy efficiency significantly (up to 22% drop), motivating tiling and partitioning schemes that adapt hardware allocation to workload density (Papalamprou et al., 10 Nov 2025). High-FLOP GEMMs see convergence of throughput- and energy-optimal configurations; all AIEs are used (Papalamprou et al., 10 Nov 2025).
Communication-Avoiding Optimizations: Decoupling compute and data movement phases (as in EA4RCA) enables burst-mode AIE DMA, filling scratchpads while allowing uninterrupted SIMD computation, translating into up to 22.19× throughput improvement and 7.0× energy efficiency gain over previous SOTA (Zhang et al., 2024).
Scalability Bottlenecks: Performance scales quasi-linearly with AIE count up to ~200 AIEs, after which memory bandwidth (PLIO interfaces, buffer sizes) limits further scaling (Dai et al., 2024, Lei et al., 2024). Network-on-chip link allocation, PLIO placement, and DMA resource balancing remain critical for maintaining efficient multi-AIE scaling.

4. Frameworks and Application Domains

The ACAP architectural paradigm is leveraged for diverse classes of workloads and supported by a set of leading mapping and code-generation frameworks:

Dense Computation (GEMM, CNNs, Transformers): ML/DNN workloads benefit from frameworks such as CHARM (heterogeneous MM composition) (Zhuang et al., 2023), CAT (Accelerator for Transformers) (Zhang et al., 2024), and DPUV4E (CNN throughput optimizer) (Li et al., 13 Jun 2025). These systems target high-level throughput and energy goals by balancing AIE array usage, PL buffering, and memory scheduling.
Sparse and Structured Algorithms (Graph and FFT): H-GCN partitions sparse graph convolutions by density to AIE or PL, achieving up to 3.5 GFLOPS/AIE for dense tiles and 1.6–3.5 GFLOPS/AIE for sparse tiles (Zhang et al., 2022). FFT and structured CA computations exploit regular communication patterns and on-tile buffering as in EA4RCA (Zhang et al., 2024).
Arbitrary-Precision and Irregular Workloads: AIM maps variable-precision multiplications onto AIEs, balancing PL and AIE resources for cryptographic and large-integer workloads. Energy efficiency improvements reach 12.6× over CPUs and 2.1× over GPUs (Yang et al., 2023).
HPC and Uniform Recurrences: WideSA introduces polyhedral-based space–time transformations and routing-aware PLIO assignment for uniform recurrences on large AIE meshes, reaching ≥95% utilization and boosting energy efficiency over PL-only approaches (Dai et al., 2024).

5. Resource Constraints, Bottlenecks, and Design Guidelines

Resource Allocation: On-chip RAM (BRAM/URAM) must be precisely partitioned between buffering, instruction, and data scratchpads. Bitstream and hardware resource budgets guide the granularity of tiling and workload parallelism (Papalamprou et al., 10 Nov 2025, Taka et al., 2024, Li et al., 13 Jun 2025).
Bandwidth Management: PL↔AIE interface is a recurrent bottleneck (e.g., 78 PLIOs at 128 bits per cycle). Optimal mapping considers the matching of interface width, buffer size, and AIE compute tile size to saturate available BW (Taka et al., 2024, Brown, 2022). Under-provisioning stalls AIE arrays; over-provisioning exhausts PL memory.
Programmability Considerations: Vitis HLS and AIE SDK require explicit pipelining, buffer binding, and resource annotation to avoid overuse (e.g., HLS AUTO mapping can overrun URAM by >100%) (Taka et al., 2024). Routing heuristics for PLIO/NoC placement impact legal compilation and latency hiding (Dai et al., 2024).
Host Orchestration and Runtime: The PS must coordinate DMA scheduling, kernel launch, and bitstream reconfiguration. For multi-accelerator designs, a dependency-aware runtime (as in CHARM) balances idle states and job dispatch (Zhuang et al., 2023).

6. Impact, Lessons Learned, and Future Directions

The ACAP architecture has demonstrated significant performance and efficiency gains across a broad spectrum of workloads. Key takeaways include:

Heterogeneous Co-design: Decoupling large compute (AIE) from data movement (PL/NoC) and control (PS) is crucial for high sustained throughput (Papalamprou et al., 10 Nov 2025, Zhang et al., 2024). The heterogeneity enables workload- and resource-aware partitioning, which is infeasible on homogeneous architectures.
Automated Design-Space Exploration: Employing ML-guided and polyhedral exploration methods yields better energy–performance Pareto fronts than traditional analytical modeling or hand-tuning (Papalamprou et al., 10 Nov 2025, Dai et al., 2024).
Communication-Avoidance: Explicit separation of communication and compute phases, especially for regular CA algorithms, is a driver of energy and throughput gains (Zhang et al., 2024).
Scaling and Bottlenecks: Physical interface and on-chip bandwidth will remain primary constraints. Innovations in NoC design, memory hierarchies, and PL–AIE co-scheduling are critical for further improvements.
Generalizability: The frameworks and methodologies established on VCK5000/VC1902 port directly to other Versal-class devices under parametric adjustment of AIE count, buffer capacity, and NoC bandwidth (Zhang et al., 2024).

ACAP represents a state-of-the-art platform for highly efficient, high-performance domain-specialized hardware acceleration. Ongoing research seeks to address programmability, scalable toolchains, and dynamic reconfiguration to fully unlock ACAP’s architectural potential.