IMAX3 System: High-Efficiency AI Accelerator

Updated 3 January 2026

IMAX3 is a general-purpose accelerator featuring a scalable, systolic-optimized CGLA architecture that enhances high-throughput AI computations.
It integrates a linear array of processing elements with local memory and custom SIMD instructions designed for efficient dense linear algebra operations.
Evaluations on FPGA and projected ASIC platforms demonstrate IMAX3's potential for energy-efficient, on-device AI inference with improved throughput.

IMAX3 is a general-purpose accelerator system based on a Coarse-Grained Linear Array (CGLA) architecture. Designed to support high-throughput, energy-efficient execution of key computational kernels in modern AI workloads, IMAX3 employs a scalable, systolic-optimized network of processing elements (PEs) tailored for dense linear algebra operations and fine-grained parallelism. An in-depth evaluation of IMAX3 using the stable-diffusion.cpp image generation framework both on FPGA and projected ASIC implementations demonstrates its potential for AI-specialized, on-device computing and establishes guidelines for further architectural refinement (Ando et al., 4 Nov 2025).

1. Architectural Organization

The core of the IMAX3 system is a linear array of PEs grouped into lanes. Each PE incorporates:

A small-scale ALU pipeline capable of integer and floating-point multiply–accumulate (MAC) operations and custom SIMD instructions.
A local Register File (RF) consisting of 32 × 32-bit registers for temporaries and loop indices.
A Local Memory Module (LMM) of 8 KB per PE (configurable to 32 KB), functioning as a software-managed scratchpad for storing activation tiles, weights, and partial sums.

PEs are arrayed in lanes, with 64 PEs per lane in the FPGA prototype and scalability up to 8 lanes (512 PEs total). Linear inter-PE connections in both directions enable nearest-neighbor communication, supporting systolic "shift-and-sum" execution for dot-product and convolution kernels. Each PE features four point-to-point links for east/west neighbor transfer, vertical communication to special REP registers, and downward configuration control.

Configuration employs a lightweight command-stream protocol: host CPUs deliver configuration words (CONF) to an on-chip FIFO via DMA, specifying PE function units (e.g., OP_SML8, OP_AD24, OP_CVT53), address generators, and loop control. Once loaded, PEs execute aligned micro-programs autonomously, minimizing host interaction and associated control overhead (Ando et al., 4 Nov 2025).

2. Memory Hierarchy and Off-Chip Interfaces

Each PE's LMM (8 KB, expandable to 32 KB) is double-buffered to hide memory latency. LMM storage is software-managed, supporting all weight, activation, and partial sum fragments arising in matrix multiplication and convolution kernels. On-chip buffer per lane is 512 KB.

External connectivity is via an AXI-4 interface to DDR4 DRAM (8 GB for OS, 4 GB for DMA). Measured off-chip bandwidth on the Versal platform is approximately 25 GB/s. Data and configuration traffic use dedicated DMA channels between host and programmable logic (PL) regions, providing parallel, low-latency command and data transmission (Ando et al., 4 Nov 2025).

3. Performance Metrics and Key Parameters

Salient parameters and measurable outcomes for IMAX3 are summarized:

Parameter	Value (FPGA Prototype)	Projected ASIC
PEs per lane ( $N_{PE}$ )	64	64
Lanes per system ( $L$ )	up to 8	up to 8
LMM per PE ( $M_{LMM}$ )	8 KB	8 KB-32 KB
On-chip buffer per lane ( $M_{buf}$ )	512 KB	512 KB
Clock frequency ( $f_{FPGA}/f_{ASIC}$ )	145 MHz	840 MHz
Peak lane throughput ( $\mathrm{GMAC/s}$ )	18.56	107.52
System throughput (8 lanes) ( $\mathrm{GMAC/s}$ )	148.5	860.2
Off-chip bandwidth ( $B_{DDR4}$ )	25 GB/s	(same/projected higher)
Energy per int8 MAC ( $E_{MAC}$ , ASIC proj.)	—	$\approx 0.36\,\mathrm{pJ/MAC}$

End-to-end latency and power-delay product (PDP) for 512×512 SD-Turbo inference (one step) position IMAX3 ASIC at 754.5–558.0 s and 39,853–26,620 J (for Q3_K and Q8_0 quantization, respectively). In kernel-only execution, a single-lane array outperforms the ARM Cortex-A72 host by ≈1.1×, and is competitive with a CPU core at 840 MHz. Multi-lane scaling is near-ideal up to 2 lanes, after which the host's DMA bandwidth limits further performance gains (Ando et al., 4 Nov 2025).

4. Workload Mapping, Quantization, and Instruction Fusion

IMAX3 achieves high arithmetic intensity and efficiency via domain-specific mapping of stable-diffusion.cpp dot-product and convolution kernels. Supported quantization schemes include:

Q8_0: 8-bit weight × 8-bit activation, yielding 24-bit partial sums and FP32 accumulations.
Q3_K: 3-bit weights with 6-bit scaling, restructured as fused 5-bit scale and 3-bit weights via OP_CVT53.

Dot-product loops are tiled into fragments of $T=12$ elements per tile; groups of 12 PEs cooperate synchronously per tile. For dot-product dimension $D$ , the array processes $\lceil D / T \rceil$ tiles, distributing them across $\sim$ 46 PEs (Q8_0) or $\sim$ 51 PEs (Q3_K), with multi-lane parallelism harnessed for independent dot-products (e.g., multiple output channels) (Ando et al., 4 Nov 2025).

Kernel fusion is implemented through:

OP_SML8: 2-way SIMD 8-bit multiply-add.
OP_AD24: SIMD 24-bit addition.
OP_CVT53: fused operation for 5-bit scale and 3-bit weight multiply–accumulate.

By merging the original 3-stage process (load, multiply, convert) into a 2-stage pipeline (OP_SML8+OP_AD24, OP_CVT53), IMAX3 reduces LMM data movement by 25% and eliminates conversion-stage latency (Ando et al., 4 Nov 2025).

5. Experimental Evaluation and Resource Utilization

FPGA prototyping confirms functional and quantitative viability of the architectural approach. A single lane (64 PEs) occupies ≈128 DSP48E2s, ≈16 K LUTs, and ≈4 MB URAM. The four-board FPGA system (up to 8 lanes) serves as an empirical baseline for ASIC projections.

Performance comparison across platforms for a single SD-Turbo step:

Device	Q3_K Latency [s]	Q8_0 Latency [s]	Power [W]	PDP [J]
ARM Cortex-A72 (host)	809.7	625.1	1.5	1,214.6
IMAX3 (FPGA)	790.3	654.7	180	142,254
IMAX3 (ASIC proj.)	754.5	558.0	52.8 (Q3_K) / 47.7 (Q8_0)	39,853 / 26,620
Intel Xeon w5-2465X	59.3	–	200	11,860
NVIDIA GTX 1080 Ti	16.2	–	250	4,050

The data show that IMAX3 ASIC achieves an energy per int8 MAC of approximately $0.36\,\mathrm{pJ/MAC}$ , with notable kernel-level energy efficiency. However, for end-to-end workloads, GPUs remain superior due to more mature host–accelerator integration and higher off-chip bandwidth (Ando et al., 4 Nov 2025).

6. Design Insights and Future Architectural Directions

Guidelines for evolving the IMAX3 family include:

Increasing the offload ratio: implementing FP16/FP32 dot-product kernels to shift a greater share of inner-loop computation from the host, and fusing backward/activation kernels to better exploit PE pipeline locality.
Host–accelerator parallelism: adopting a manycore CPU or hardware-managed DMA to support multi-lane operation beyond two lanes without bandwidth saturation, alongside integration of a dedicated on-chip scheduler.
Memory hierarchy: expanding LMM per PE and adopting high-bandwidth memory (≥400 GB/s) to mitigate off-chip stalls, and enabling register-file broadcast for efficient sharing of common sub-expressions.
ISA and reconfigurability: pruning underutilized PE functional units, compressing the ISA through kernel profiling, and introducing programmable macro-operations for frequently encountered AI primitives such as 1×1 convolution and attention mechanisms.
Energy efficiency: integrating fine-grained power gating in idle PE lanes, enabling dynamic voltage/frequency scaling (DVFS) tuned by workload, and supporting mixed-precision accumulate chains (8, 16, or 32 bits per kernel) to minimize switching energy (Ando et al., 4 Nov 2025).

This suggests that while IMAX3 demonstrates robust kernel-level efficiency and architectural scalability, further system-level integration, richer on-chip memory, and improved host communication are required to close the end-to-end performance gap with contemporary GPUs. Continued refinement is projected to advance the platform as a foundation for next-generation, energy-efficient, on-device AI computation.

Markdown Report Issue Upgrade to Chat

References (1)

Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IMAX3 System.