AI-Optimised Architectures

Updated 14 December 2025

AI-Optimised Architectures are integrated systems combining custom hardware, optimized software, and tailored algorithms for enhanced AI performance and energy efficiency.
They utilize diverse paradigms including GPUs, ASICs, FPGAs, and emerging analog/PIM devices, each engineered with specific dataflow and memory optimizations.
Co-design strategies such as DSE and NAS streamline the exploration of vast design spaces, enabling optimized solutions for cloud, edge, and domain-specific AI applications.

AI-Optimised Architectures are integrated systems—spanning custom hardware, software stacks, and algorithmic methodologies—that are explicitly designed and tuned to maximize the computational efficiency, energy utilization, scalability, and task-adaptive performance of AI workloads. These architectures represent a core shift from general-purpose von Neumann paradigms to domain-specialized, co-designed platforms in which each architectural layer—from accelerator datapath and memory hierarchy to execution runtime and model graph—embodies constraints and affordances tailored to machine learning inference and training.

1. Architectural Paradigms for AI Acceleration

The landscape is dominated by three microarchitectural paradigms—GPUs, ASICs, and FPGAs—augmented by new classes such as analog/PIM and neuromorphic devices:

GPUs (Graphics Processing Units) leverage massive thread-level parallelism (10,000+ SIMT cores), deep multi-level queueing, and specialized mixed-precision Tensor Cores for high throughput in DNN operations. Modern GPUs deliver up to ∼60 TFLOPS FP16, with energy efficiencies of 100–200 GFLOPS/W when running AI kernels at high occupancy (Amin et al., 13 Nov 2025).
ASICs (Application-Specific Integrated Circuits), exemplified by TPUs, instantiate rigid compute arrays (e.g., 2D meshes of MACs) with tightly coupled on-chip buffers and weight- or output-stationary dataflows, achieving up to 4 TOPS/W and outperforming all other architectures per-unit power at the expense of generality and long silicon design cycles.
FPGAs (Field-Programmable Gate Arrays) offer bit-level reconfigurability: logic LUTs, block-RAM, and DSP slices are synthesised per deployment for efficient pipelined dataflow engines tailored to specific DNN layers. Energy efficiency reaches 8–10 TOPS/W, but aggregate throughput is lower than ASIC or GPU for large models; however, streaming architectures excel for quantized or sparse models (Amin et al., 13 Nov 2025).
Hybrid/Emerging: Processing-in-Memory (PIM) and neuromorphic hybrids extend the stack, embedding Arithmetic or SNN primitives near or within memory arrays to combat the “memory wall,” with analog/photonic PIM promising >10×–50× data movement energy reduction over digital solutions (Yin et al., 10 Mar 2025, Klein et al., 2022). Neuromorphic chips (e.g., Loihi) operate at 1–10 pJ/synaptic event, ideal for always-on edge tasks.

2. Principles of Dataflow, Memory Hierarchy, and Resource Utilization

Maximizing hardware utilization and minimizing system-level energy are achieved through co-designed dataflows, memory, and mapping:

Dataflow Optimisations: Three canonical paradigms exist: weight-stationary (WS), output-stationary (OS), and row-stationary (RS). Each is characterized by the reuse factor $R = (\text{# usages of data item})/(\text{\# DRAM fetches})$. E.g., in a $3\times3$ convolution on a $P\times P$ systolic array, $R_w = P^2/9$ (Amin et al., 13 Nov 2025, Ahn, 2020).
Memory Hierarchy and Bandwidth: Modern accelerators deploy multi-tiered hierarchies—off-chip HBM ( $B_\text{DRAM}$ ), large SRAM global buffers, PE-local registers ( $B_\text{SRAM}$ , $B_\text{REG}$ )—to satisfy bandwidth and locality needs. Buffer sizing $S_\text{buf}$ must at least match the working set per dataflow tile (Amin et al., 13 Nov 2025).
On-Chip Pipelines and “Data-Flow Matching”: Near-optimal designs for CNNs fully pipeline the entire “neuron” computation (multiply→sum→activate→writeback) so the hardware graph exactly matches the DNN layer-graph, ensuring >97% multiplier utilization and minimal pipeline bubbles (Ahn, 2020).
Resource Breakdown and Efficiency Metrics: Efficiency is quantified by the product $U = \text{actual multiplies}/\text{peak multiplies}$ and $R_c = \text{multiplier area}/\text{total area}$ ; their product, $Eff_{arch} = U \times R_c$ , approaches unity ( $\sim0.54$ in empirical near-optimal systems), establishing a resource-theoretical ceiling (Ahn, 2020).

3. Hardware–Software Co-Design and Design Space Exploration

Automated methodologies—DSE, NAS, and learning-based predictors—are essential due to the explosive, non-convex design space:

Learning-based DSE: AIRCHITECT and AIRCHITECT v2 frame per-layer hardware selection as a classification or transformer-based ordinal regression problem, encoding both workloads (e.g., GEMM dimensions, dataflows) and design points (PE, buffer size) via learned embeddings, contrastive losses, and unified ordinal vectors. This enables constant-time DSE, outperforming greedy/heuristic search by 15% in optimality and yielding 1.7× latency reduction on unseen LLMs (Seo et al., 17 Jan 2025, Samajdar et al., 2021).
Differentiable Architecture–Implementation Co-Search: EDD fuses DNN structural and hardware implementation variables into a single differentiable objective ( $\mathcal{L}(A, I) = Acc_\text{loss}(A, I)\,\times\,Perf_\text{loss}(I) + \beta C^{RES(I)-RES_{ub}}$ ), optimized via Gumbel-Softmax and SGD. Co-searched models realize 1.40–1.45× speedup over state-of-the-art for both GPU and FPGA with no accuracy drop (Li et al., 2020).
Synthesis-in-the-Loop Optimization: For extreme power/area/delay-constrained edge/ASICs, multi-objective Bayesian optimization tightly integrates quantization-aware NN training and full-backend RTL–GDSII flow (OpenLANE/PDK), exploring a search space exceeding $2.2\!\times\!10^9$ options and delivering Pareto-optimal, in-pixel deployable solutions (Kharel et al., 18 Jul 2024).

4. Heterogeneous and Hybrid AI-Optimised Architectures

Emerging demand for ultra-efficient AI is driving the development of tightly coupled heterogeneous platforms:

Electronic–Photonic Hybrid PIM: H³PIMAP orchestrates mapping of DNN workloads across SRAM/ReRAM (weight-static) and photonic tensor-core (dynamic, ultra-fast) tiers, optimizing the (energy, latency, accuracy) Pareto front via evolutionary NSGA-II followed by sensitivity-guided row remapping. On LLM/vision tasks, this delivers 2.74× energy and 3.47× latency reduction relative to homogeneous mappings, while 40% of layer rows placed on photonic tiers account for 60% of speed gain (Yin et al., 10 Mar 2025).
Analog In-Memory Computing: Systems like ALPINE tightly couple analog MVM tiles to CPUs via new ISA extensions, enabling on-the-fly weight programming and constant-latency MVMs (<100 ns), supporting MLPs/CNNs/LSTMs. Empirical simulation yields up to 20.5× performance and 20.8× energy gains compared to SIMD ARM (Klein et al., 2022).
Edge–HPC Co-Optimization: Hardware-aware NAS systems leveraging in situ latency profiling on embedded edge devices (Jetson AGX Orin) whilst candidates are trained on HPC, achieve 8.8× speedup and 1.35× accuracy gain over human-crafted baselines by closing the optimization loop on real deployment conditions (Aach et al., 26 May 2025).

5. Cloud-Native and Scalable AI Application Architectures

AI-optimized architectures transcend chips, encompassing scalable cloud-native patterns to support modern AI-driven workloads:

Cloud Database Integration: Performance-critical AI apps use vector stores (pgvector, FAISS), graph DBs (Neptune), and hybrid RAG (Retrieval-Augmented Generation) pipelines to fuse SQL/NoSQL with low-latency k-NN semantic search, multi-modal fusion, and adaptive caching. Real-world deployments hit sub-50 ms end-to-end SLAs, with performance models balancing network transfer, index search (e.g., $O(\log N)$ for HNSW), and NLP augmentation (Bhupathi, 26 Apr 2025).
Best-Practices and Security: Architectures require auto-scaling, resource sharding, autoscaling, serverless backends, plus compliance (in-flight/at-rest encryption, row/column security, provenance tracking, HIPAA/GDPR adherence).
Case Studies: Architectures are validated in domain-specific deployments—EHR-based diagnosis RAG, graph-RAG for fraud detection, hybrid SQL/vector search for e-commerce—demonstrating consistent gains in speed, cost-effectiveness, and accuracy (Bhupathi, 26 Apr 2025).

6. Reliability, Modularity, and Agentic AI System Composition

Agentic AI systems and intelligent agents introduce a new axis for reliability and adaptability. Architectural reliability is a property of disciplined componentization and interface design:

Componentized Blueprints: Reliable agents are decomposed into goal manager, planner, tool router, executor, memory (multi-tiered), verifiers, safety monitor, and telemetry, each with explicit type-checked interfaces, schema contracts, and transactional/permissioning semantics (Nowaczyk, 10 Dec 2025).
Patterns and Failure Modes: Categorization includes tool-using agents (with strong schema/idempotency capabilities), memory-augmented agents (with provenance and hygiene enforcement), planning/simulation agents (budget-aware), multi-agent systems (typed schema messaging, arbitration, consensus), and embodied/web agents (simulate-before-actuate) (Nowaczyk, 10 Dec 2025, Bansod, 2 Jun 2025).
Governance and Safeguards: Run-time governance enforces step/cost/time budgets, with simulate-before-actuate and verifiers at every action boundary. Schema-driven least-privilege permissioning and provenance/version control mitigate hallucination, drift, and error propagation.
Agentic AI in Next-Gen Networks: Agentic architecture is integral to next-gen 6G and cloud networking, where distributed AI agents control slice management, real-time QoS, and operate in energy/security-constrained environments using decentralized learning/pruning/adaptation (Dev et al., 3 Feb 2025).

7. Application-Driven, Domain-Specific, and Geometric AI Architectures

AI-optimized architectures extend beyond vanilla DNNs to domain-specialized and geometric settings:

Graph Neural Network Optimization: In cheminformatics assays, Bayesian-optimized GINs excel for data-abundant cases by leveraging strong topological expressiveness, while GATs are optimal for small datasets owing to built-in attention inductive bias. Comprehensive Bayesian optimization across GCN/GAT/GINs yields best-practice hyperparameter states for each regime, with fully reproducible code and metrics (Kalian et al., 22 Jul 2025).
Neuromorphic and Spiking Co-Design: Automated, massively parallel model-based AutoML loops search mixed integer/real/categorical neuromorphic SNN spaces, modeling surrogates incrementally and optimizing for on-chip learning accuracy under resource/latency constraints; such approaches scale to both shallow and “mushroom-body” deep SNNs (Yanguas-Gil et al., 2023).
Pruned/Quantized DNNs for Edge: Block-diagonal, permutation-aware pruning and quantization during training, tightly linked to hardware generator flows (RISC-V/Chisel SoC), achieve up to 36 TOPS/W in silicon, with <1% accuracy loss and static schedule generation that minimizes data motion and idle cycles (Naous et al., 2019).

In summary, AI-optimised architectures constitute a rapidly diversifying field in which algorithmic, hardware, and system co-design is not a weak post-hoc tuning step but a first-class concern. The architecture—understood in its full breadth, from silicon and photonic circuits through cloud-native workflows and agentic schemas—is the fundamental enabler and governor of the reliability, scalability, and efficiency envelope for modern artificial intelligence (Amin et al., 13 Nov 2025, Nowaczyk, 10 Dec 2025, Seo et al., 17 Jan 2025).