Papers
Topics
Authors
Recent
Search
2000 character limit reached

IMAX3 System: High-Efficiency AI Accelerator

Updated 3 January 2026
  • IMAX3 is a general-purpose accelerator featuring a scalable, systolic-optimized CGLA architecture that enhances high-throughput AI computations.
  • It integrates a linear array of processing elements with local memory and custom SIMD instructions designed for efficient dense linear algebra operations.
  • Evaluations on FPGA and projected ASIC platforms demonstrate IMAX3's potential for energy-efficient, on-device AI inference with improved throughput.

IMAX3 is a general-purpose accelerator system based on a Coarse-Grained Linear Array (CGLA) architecture. Designed to support high-throughput, energy-efficient execution of key computational kernels in modern AI workloads, IMAX3 employs a scalable, systolic-optimized network of processing elements (PEs) tailored for dense linear algebra operations and fine-grained parallelism. An in-depth evaluation of IMAX3 using the stable-diffusion.cpp image generation framework both on FPGA and projected ASIC implementations demonstrates its potential for AI-specialized, on-device computing and establishes guidelines for further architectural refinement (Ando et al., 4 Nov 2025).

1. Architectural Organization

The core of the IMAX3 system is a linear array of PEs grouped into lanes. Each PE incorporates:

  • A small-scale ALU pipeline capable of integer and floating-point multiply–accumulate (MAC) operations and custom SIMD instructions.
  • A local Register File (RF) consisting of 32 × 32-bit registers for temporaries and loop indices.
  • A Local Memory Module (LMM) of 8 KB per PE (configurable to 32 KB), functioning as a software-managed scratchpad for storing activation tiles, weights, and partial sums.

PEs are arrayed in lanes, with 64 PEs per lane in the FPGA prototype and scalability up to 8 lanes (512 PEs total). Linear inter-PE connections in both directions enable nearest-neighbor communication, supporting systolic "shift-and-sum" execution for dot-product and convolution kernels. Each PE features four point-to-point links for east/west neighbor transfer, vertical communication to special REP registers, and downward configuration control.

Configuration employs a lightweight command-stream protocol: host CPUs deliver configuration words (CONF) to an on-chip FIFO via DMA, specifying PE function units (e.g., OP_SML8, OP_AD24, OP_CVT53), address generators, and loop control. Once loaded, PEs execute aligned micro-programs autonomously, minimizing host interaction and associated control overhead (Ando et al., 4 Nov 2025).

2. Memory Hierarchy and Off-Chip Interfaces

Each PE's LMM (8 KB, expandable to 32 KB) is double-buffered to hide memory latency. LMM storage is software-managed, supporting all weight, activation, and partial sum fragments arising in matrix multiplication and convolution kernels. On-chip buffer per lane is 512 KB.

External connectivity is via an AXI-4 interface to DDR4 DRAM (8 GB for OS, 4 GB for DMA). Measured off-chip bandwidth on the Versal platform is approximately 25 GB/s. Data and configuration traffic use dedicated DMA channels between host and programmable logic (PL) regions, providing parallel, low-latency command and data transmission (Ando et al., 4 Nov 2025).

3. Performance Metrics and Key Parameters

Salient parameters and measurable outcomes for IMAX3 are summarized:

Parameter Value (FPGA Prototype) Projected ASIC
PEs per lane (NPEN_{PE}) 64 64
Lanes per system (LL) up to 8 up to 8
LMM per PE (MLMMM_{LMM}) 8 KB 8 KB-32 KB
On-chip buffer per lane (MbufM_{buf}) 512 KB 512 KB
Clock frequency (fFPGA/fASICf_{FPGA}/f_{ASIC}) 145 MHz 840 MHz
Peak lane throughput (GMAC/s\mathrm{GMAC/s}) 18.56 107.52
System throughput (8 lanes) (GMAC/s\mathrm{GMAC/s}) 148.5 860.2
Off-chip bandwidth (BDDR4B_{DDR4}) 25 GB/s (same/projected higher)
Energy per int8 MAC (EMACE_{MAC}, ASIC proj.) — ≈0.36 pJ/MAC\approx 0.36\,\mathrm{pJ/MAC}

End-to-end latency and power-delay product (PDP) for 512×512 SD-Turbo inference (one step) position IMAX3 ASIC at 754.5–558.0 s and 39,853–26,620 J (for Q3_K and Q8_0 quantization, respectively). In kernel-only execution, a single-lane array outperforms the ARM Cortex-A72 host by ≈1.1×, and is competitive with a CPU core at 840 MHz. Multi-lane scaling is near-ideal up to 2 lanes, after which the host's DMA bandwidth limits further performance gains (Ando et al., 4 Nov 2025).

4. Workload Mapping, Quantization, and Instruction Fusion

IMAX3 achieves high arithmetic intensity and efficiency via domain-specific mapping of stable-diffusion.cpp dot-product and convolution kernels. Supported quantization schemes include:

  • Q8_0: 8-bit weight × 8-bit activation, yielding 24-bit partial sums and FP32 accumulations.
  • Q3_K: 3-bit weights with 6-bit scaling, restructured as fused 5-bit scale and 3-bit weights via OP_CVT53.

Dot-product loops are tiled into fragments of T=12T=12 elements per tile; groups of 12 PEs cooperate synchronously per tile. For dot-product dimension DD, the array processes ⌈D/T⌉\lceil D / T \rceil tiles, distributing them across ∼\sim46 PEs (Q8_0) or ∼\sim51 PEs (Q3_K), with multi-lane parallelism harnessed for independent dot-products (e.g., multiple output channels) (Ando et al., 4 Nov 2025).

Kernel fusion is implemented through:

  • OP_SML8: 2-way SIMD 8-bit multiply-add.
  • OP_AD24: SIMD 24-bit addition.
  • OP_CVT53: fused operation for 5-bit scale and 3-bit weight multiply–accumulate.

By merging the original 3-stage process (load, multiply, convert) into a 2-stage pipeline (OP_SML8+OP_AD24, OP_CVT53), IMAX3 reduces LMM data movement by 25% and eliminates conversion-stage latency (Ando et al., 4 Nov 2025).

5. Experimental Evaluation and Resource Utilization

FPGA prototyping confirms functional and quantitative viability of the architectural approach. A single lane (64 PEs) occupies ≈128 DSP48E2s, ≈16 K LUTs, and ≈4 MB URAM. The four-board FPGA system (up to 8 lanes) serves as an empirical baseline for ASIC projections.

Performance comparison across platforms for a single SD-Turbo step:

Device Q3_K Latency [s] Q8_0 Latency [s] Power [W] PDP [J]
ARM Cortex-A72 (host) 809.7 625.1 1.5 1,214.6
IMAX3 (FPGA) 790.3 654.7 180 142,254
IMAX3 (ASIC proj.) 754.5 558.0 52.8 (Q3_K) / 47.7 (Q8_0) 39,853 / 26,620
Intel Xeon w5-2465X 59.3 – 200 11,860
NVIDIA GTX 1080 Ti 16.2 – 250 4,050

The data show that IMAX3 ASIC achieves an energy per int8 MAC of approximately 0.36 pJ/MAC0.36\,\mathrm{pJ/MAC}, with notable kernel-level energy efficiency. However, for end-to-end workloads, GPUs remain superior due to more mature host–accelerator integration and higher off-chip bandwidth (Ando et al., 4 Nov 2025).

6. Design Insights and Future Architectural Directions

Guidelines for evolving the IMAX3 family include:

  • Increasing the offload ratio: implementing FP16/FP32 dot-product kernels to shift a greater share of inner-loop computation from the host, and fusing backward/activation kernels to better exploit PE pipeline locality.
  • Host–accelerator parallelism: adopting a manycore CPU or hardware-managed DMA to support multi-lane operation beyond two lanes without bandwidth saturation, alongside integration of a dedicated on-chip scheduler.
  • Memory hierarchy: expanding LMM per PE and adopting high-bandwidth memory (≥400 GB/s) to mitigate off-chip stalls, and enabling register-file broadcast for efficient sharing of common sub-expressions.
  • ISA and reconfigurability: pruning underutilized PE functional units, compressing the ISA through kernel profiling, and introducing programmable macro-operations for frequently encountered AI primitives such as 1×1 convolution and attention mechanisms.
  • Energy efficiency: integrating fine-grained power gating in idle PE lanes, enabling dynamic voltage/frequency scaling (DVFS) tuned by workload, and supporting mixed-precision accumulate chains (8, 16, or 32 bits per kernel) to minimize switching energy (Ando et al., 4 Nov 2025).

This suggests that while IMAX3 demonstrates robust kernel-level efficiency and architectural scalability, further system-level integration, richer on-chip memory, and improved host communication are required to close the end-to-end performance gap with contemporary GPUs. Continued refinement is projected to advance the platform as a foundation for next-generation, energy-efficient, on-device AI computation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IMAX3 System.