IMAX3: General-Purpose CGLA Accelerator

Updated 3 July 2026

IMAX3 is a general-purpose coarse-grained reconfigurable accelerator (CGLA) with a linear PE-cache architecture designed for executing quantized AI kernels.
Its design integrates multi-lane processing and custom SIMD instructions to handle diverse kernels, though host orchestration remains a scaling bottleneck.
Evaluation using Stable Diffusion quantized kernels reveals robust kernel-level performance with modest end-to-end speedups due to limited offload ratios and data transfer overheads.

Searching arXiv for papers on IMAX3 and related sources. IMAX3 is a general-purpose coarse-grained reconfigurable accelerator described as a CGLA (Coarse-Grained Linear Array), presented as an “evolution” of a CGRA and evaluated through the implementation of the primary computational kernels from the stable-diffusion.cpp image generation framework (Ando et al., 4 Nov 2025). The term should be distinguished from IMaX, the Imaging Magnetograph eXperiment for the Sunrise balloon-borne solar observatory, which is unrelated in function and domain and does not mention “IMAX3” explicitly (Pillet et al., 2010). In the accelerator literature represented here, IMAX3 denotes a linear, programmable spatial-computing platform intended to retain flexibility across workloads such as sparse GEMM, FFT, CNNs, LLMs, and, in the cited evaluation, Stable Diffusion; it is therefore positioned between highly programmable CPUs/GPUs and highly specialized fixed-function AI ASICs (Ando et al., 4 Nov 2025).

1. Terminology and scope

The available arXiv evidence uses “IMAX3” in a specific accelerator context rather than as a continuation of the solar-physics instrument named IMaX. The 2010 IMaX paper is explicitly about the Imaging Magnetograph eXperiment, a spectropolarimeter for Sunrise, and the paper “does not mention the term ‘IMAX3’ explicitly, nor does it define any third-generation instrument, ‘IMaX-3,’ or later upgrade carrying that exact name” (Pillet et al., 2010). That source is therefore relevant only for disambiguation.

Within the accelerator context, IMAX3 is described as a general-purpose computational platform based on a CGLA, with a linear arrangement of processing resources and interleaved processing and memory structures (Ando et al., 4 Nov 2025). Its evaluation through Stable Diffusion is framed as a test of whether a general-purpose reconfigurable design can execute modern generative-AI kernels efficiently without being restricted to a single model family (Ando et al., 4 Nov 2025).

A common source of confusion is the orthographic similarity between IMaX and IMAX3. The former is a balloon-borne solar imaging spectropolarimeter using dual-LCVR modulation, dual-beam polarimetry, and a LiNbO $_3$ etalon (Pillet et al., 2010). The latter is a CGLA accelerator for quantized AI kernels (Ando et al., 4 Nov 2025). No naming lineage is established by the cited papers.

2. Architectural organization

IMAX3 is characterized by a linear arrangement of processing resources rather than a 2D mesh-style reconfigurable fabric (Ando et al., 4 Nov 2025). Its defining organization is the interleaving of Processing Elements (PEs) and cache memory along that linear structure, which is intended to absorb irregular memory-access latency, improve use of Local Memory Modules (LMMs), reduce memory-access bottlenecks, and preserve programmability across multiple kernel classes (Ando et al., 4 Nov 2025).

Each PE contains an ALU, local memory, and a pipelined structure to maintain throughput (Ando et al., 4 Nov 2025). The architecture supports logical placement of functional units and “logically aligned execution patterns,” which are presented as a design philosophy for improving both compilation efficiency and execution efficiency (Ando et al., 4 Nov 2025). The paper does not provide a full microarchitectural exposition, but it does identify IMAX3 as a spatial compute fabric rather than a purely scalar or vector coprocessor.

IMAX3 is also explicitly multi-lane. A lane appears as an independently operating linear PE array, and the system is reported in an 8-lane configuration on four AMD Versal VPK180 devices, while the Stable Diffusion evaluation on FPGA uses one lane at 145 MHz (Ando et al., 4 Nov 2025). Table I notes that IMAX3 has 64 cores per lane, where “core” means PE (Ando et al., 4 Nov 2025). This lane structure is important because later results show that the architecture’s scaling behavior is limited not simply by PE count but by host-orchestration capacity.

At the SoC level, IMAX3 is implemented with a Processing System (PS) containing an ARM Cortex-A72 and a Programmable Logic (PL) region containing the IMAX CGLA core, with PS and PL connected via NoC (Ando et al., 4 Nov 2025). The PS runs the OS and handles system control, whereas the PL hosts the spatial compute array (Ando et al., 4 Nov 2025). This host–accelerator coupling is central to the reported system behavior.

3. Memory system, execution model, and instruction support

The cited evaluation emphasizes that IMAX3’s performance cannot be understood from PE computation alone. The memory and control structures mentioned include cache memory interleaved with PEs, Local Memory Modules (LMM), main memory, a DMA buffer, and host-side DDR4 (Ando et al., 4 Nov 2025). For the FPGA platform, the listed capacities are 8 GB DDR4 for OS buffer and 4 GB DDR4 for DMA buffer; for projected ASIC power estimation, the paper mentions a 512 KB LMM configuration (Ando et al., 4 Nov 2025).

The execution-time decomposition is given in terms of LOAD, DRAIN, EXEC, CONF, and REGV (Ando et al., 4 Nov 2025). These denote, respectively, main-memory-to-cache transfer, cache-to-main-memory transfer, computation on the IMAX core, command transfer, and register initialization (Ando et al., 4 Nov 2025). The implied execution sequence is therefore a configured and host-managed heterogeneous workflow rather than a fully autonomous accelerator pipeline.

The paper’s description of programmability is unusually important for defining IMAX3. It is repeatedly described as not a model-specific accelerator, and the Stable Diffusion work reuses GGML-oriented quantized dot-product kernels previously developed for llama.cpp (Ando et al., 4 Nov 2025). That reuse is presented as evidence that the architecture supports kernels across disparate model families rather than only a single fixed workload.

To support GGML quantization formats efficiently, the implementation adds concrete custom instructions (Ando et al., 4 Nov 2025):

Instruction	Function
OP_SML8	2-way SIMD signed 8-bit integer multiply-add with sign-extended 24-bit output
OP_AD24	2-way 24-bit integer addition for accumulation
OP_CVT53	SIMD conversion specialized for Q3_K

These instructions show that IMAX3 is described as an extensible ISA-like accelerator rather than a completely fixed datapath (Ando et al., 4 Nov 2025). A plausible implication is that the architecture’s “general-purpose” character depends not only on reconfigurable PE placement but also on selectively introduced domain-specific primitives.

4. Stable Diffusion implementation and supported operators

The evaluated software framework is stable-diffusion.cpp, a pure C/C++ implementation of Stable Diffusion v1.5 using GGML (Ando et al., 4 Nov 2025). The workload focus is not the entire model graph but the primary quantized dot-product kernels used under quantized models. This is crucial because the paper’s most significant limitations stem from partial rather than comprehensive offload.

The workload includes four data types: F32, F16, Q3_K, and Q8_0 (Ando et al., 4 Nov 2025). Of these, only Q3_K and Q8_0 are implemented on IMAX3; FP16 and FP32 remain on the host CPU (Ando et al., 4 Nov 2025). The targeted primitive is the dot product, identified as central to convolutions, attention, and other U-Net operations (Ando et al., 4 Nov 2025).

For the Q8_0 kernel, the IMAX dataflow performs 8-bit integer multiply and add, aggregates partial sums into a 24-bit integer across every 12 PEs, and then performs a final multiplication with a 32-bit single-precision floating-point value (Ando et al., 4 Nov 2025). The enabling custom instructions are primarily OP_SML8 and OP_AD24 (Ando et al., 4 Nov 2025).

For Q3_K, GGML’s format contains 6-bit scale data and 2-bit and 1-bit quantized data (Ando et al., 4 Nov 2025). To fit IMAX’s SIMD model, the implementation converts 6-bit scale data to 5-bit and packs 2-bit and 1-bit segments into a unified 3-bit form (Ando et al., 4 Nov 2025). The paper states that this approximation has “almost no effect on the final calculation results” (Ando et al., 4 Nov 2025). This is one of the clearest instances of hardware–software co-design in the evaluation: the original quantization format is transformed to better match the accelerator’s SIMD execution structure.

Spatial mapping is reported explicitly. The Q3_K kernel is mapped across 51 of 64 PEs, while the Q8_0 kernel uses 46 of 64 PEs, yielding effective spatial utilizations of 79.7% and 71.9%, respectively (Ando et al., 4 Nov 2025). The paper treats these values as evidence that preserving generality entails some unused resources even for hand-tailored kernels.

5. Experimental platform and quantitative results

The benchmark consists of generating a 512 × 512 image with SD-Turbo, using 1 inference step and the prompt “a lovely cat” (Ando et al., 4 Nov 2025). The compared hardware comprises an ARM Cortex-A72 on the Versal host, the IMAX3 FPGA prototype, a projected IMAX3 ASIC (28 nm), an Intel Xeon w5-2465X, and an NVIDIA GTX 1080 Ti (Ando et al., 4 Nov 2025).

The hardware specifications given in the paper include the following (Ando et al., 4 Nov 2025):

Platform	Key parameters
ARM Cortex-A72 on Versal	2 cores, 7 nm, 1.4 GHz, 8 GB DDR4, 1.5 W
IMAX3 FPGA prototype	64 PEs per lane, 145 MHz, 8 + 4 GB DDR4, 180 W
Projected IMAX3 ASIC (28 nm)	64 PEs per lane, 14.6 mm $^2$ , 800 MHz in table, 840 MHz maximum from STA
Intel Xeon w5-2465X	16 cores, 3.1 GHz, 512 GB DDR5, 200 W
NVIDIA GTX 1080 Ti	3584 CUDA cores, 471 mm $^2$ , 1480 MHz, 11 GB GDDR5X, 250 W

The paper notes a small inconsistency between the table entry (800 MHz) and the text, which reports 840 MHz maximum operating frequency from static timing analysis and uses 840 MHz for projection (Ando et al., 4 Nov 2025).

The offload ratio is limited. The time breakdown of dot-product execution, excluding memory-copy overhead, shows that for the Q3_K model, F32 accounts for 30.7%, F16 for 59.0%, and Q3_K for 10.3%; for the Q8_0 model, F32 accounts for 21.8%, F16 for 62.0%, and Q8_0 for 16.3% (Ando et al., 4 Nov 2025). The paper states that the overall offload ratio is less than 20% (Ando et al., 4 Nov 2025). This directly explains why kernel acceleration does not translate into large end-to-end improvements.

The reported end-to-end latencies are (Ando et al., 4 Nov 2025):

Model	Platform	Latency
Q3_K	IMAX3 FPGA	790.3 s
Q3_K	ARM Cortex-A72 standalone	809.7 s
Q3_K	IMAX3 ASIC projected	754.5 s
Q3_K	Intel Xeon	59.3 s
Q3_K	GTX 1080 Ti	16.2 s
Q8_0	IMAX3 FPGA	654.7 s
Q8_0	ARM Cortex-A72 standalone	625.1 s
Q8_0	IMAX3 ASIC projected	558.0 s

For the FPGA-to-ASIC compute projection, the paper uses the frequency ratio $\frac{840}{145} \approx 5.79$ , stated as an approximate 5.8× reduction in IMAX computation time (Ando et al., 4 Nov 2025). Nevertheless, because most work remains on the host CPU, the end-to-end latency changes are much smaller than this compute-only factor would suggest.

6. Power, efficiency, scaling, and bottlenecks

The energy-efficiency metric used is Power-Delay Product (PDP), defined in the paper as

$PDP = \text{Execution Time} \times \text{Power}$

(Ando et al., 4 Nov 2025). The power figures used in the evaluation are 1.5 W for the ARM Cortex-A72, 180 W for the IMAX3 FPGA, 47.7 W for the projected IMAX3 ASIC in the Q8_0 configuration, 52.8 W for the projected IMAX3 ASIC in the Q3_K configuration, 200 W for the Xeon, and 250 W for the GTX 1080 Ti (Ando et al., 4 Nov 2025).

The paper’s qualitative conclusions are that the ARM Cortex-A72 has the lowest PDP overall, that the projected IMAX ASIC beats the Xeon in PDP for both Q3_K and Q8_0, and that for Q3_K, the projected IMAX ASIC also achieves lower PDP than the GPU under the paper’s phase-based accounting methodology (Ando et al., 4 Nov 2025). The text also notes that these PDP comparisons are based on “power consumption during each distinct execution phase,” so the exact plotted values are not reconstructible from a simple constant-power interpretation (Ando et al., 4 Nov 2025).

Kernel-level findings are more favorable than end-to-end results. The discussion reports that single-lane FPGA IMAX at 145 MHz is already faster than the host ARM CPU on kernel execution, that the projected 840 MHz ASIC is competitive with a high-performance Xeon CPU at kernel level, and that a significant gap remains to the GPU (Ando et al., 4 Nov 2025). Exact kernel-time numeric values are not provided in the text.

Scaling behavior is explicitly host-limited. Performance improves efficiently up to 2 lanes, but saturates for 3 or more lanes because the dual-core host CPU becomes the bottleneck for feeding and controlling the lanes (Ando et al., 4 Nov 2025). This is one of the paper’s central systems-level results: accelerator scaling is limited by orchestration and data supply rather than only by PE-array size.

The paper’s own explanation for the small end-to-end speedups is summarized in six points: only quantized kernels are offloaded; FP16 and FP32 dot products dominate; the offloaded fraction is < 20%; the host ARM CPU executes the majority of work; host control and data movement overheads are substantial; and additional IMAX lanes are bottlenecked by the dual-core host (Ando et al., 4 Nov 2025). The weaker FPGA result for Q8_0 is attributed specifically to larger data transfer volume (Ando et al., 4 Nov 2025).

7. Interpretation, limitations, and relation to the broader IMAX name

The Stable Diffusion evaluation is explicitly a first implementation and baseline characterization, not a complete deployment of Stable Diffusion on IMAX3 (Ando et al., 4 Nov 2025). The major limitations stated in the paper are that only quantized dot-product kernels were implemented, FP16 and FP32 kernels were not offloaded, end-to-end acceleration therefore remained small, the dual-core host CPU limited multi-lane scaling, and data transfer overhead hurt Q8_0 especially (Ando et al., 4 Nov 2025). The evaluation also used single-step SD-Turbo at 512×512, not a more demanding multi-step diffusion workload or larger-image regime (Ando et al., 4 Nov 2025).

Within those limits, the paper argues that IMAX3 is a credible general-purpose AI acceleration substrate because it successfully reuses and executes GGML quantized kernels across both LLM and image generation workloads (Ando et al., 4 Nov 2025). This suggests that the architecture’s strongest validation lies in cross-domain kernel reuse rather than in immediate superiority on whole-application image-generation latency.

The cited design lessons for future IMAX generations are concrete: increase offload ratio by implementing FP16 and FP32 kernels, co-design the accelerator with a stronger multi-core host, improve data-movement efficiency, continue adding AI-oriented instructions while pruning unnecessary general-purpose features, and use IMAX3 as a baseline from which to derive more AI-specialized CGLA variants (Ando et al., 4 Nov 2025). A plausible implication is that IMAX3 should be understood less as a finished endpoint than as a transitional platform for studying how much specialization can be introduced into a reconfigurable linear array without sacrificing workload breadth.

The relationship to IMaX is only disambiguating. The solar-instrument paper provides a foundational technical reference for the original Imaging Magnetograph eXperiment, including dual-LCVR modulation, dual-beam spectropolarimetry, a LiNbO $_3$ etalon in double pass, and high-resolution observations of the Fe I 5250.2 Å line (Pillet et al., 2010). It does not define or document IMAX3, and any connection beyond name similarity would be inferential rather than established by the cited literature (Pillet et al., 2010).

Markdown Report Issue Upgrade to Chat

References (2)

Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator (2025)

The Imaging Magnetograph eXperiment (IMaX) for the Sunrise balloon-borne solar observatory (2010)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IMAX3.

IMAX3: General-Purpose CGLA Accelerator

1. Terminology and scope

2. Architectural organization

3. Memory system, execution model, and instruction support

4. Stable Diffusion implementation and supported operators

5. Experimental platform and quantitative results

6. Power, efficiency, scaling, and bottlenecks

7. Interpretation, limitations, and relation to the broader IMAX name

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IMAX3: General-Purpose CGLA Accelerator

1. Terminology and scope

2. Architectural organization

3. Memory system, execution model, and instruction support

4. Stable Diffusion implementation and supported operators

5. Experimental platform and quantitative results

6. Power, efficiency, scaling, and bottlenecks

7. Interpretation, limitations, and relation to the broader IMAX name

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research