Custom ML Accelerator Architecture

Updated 3 September 2025

Custom ML accelerator architecture is a specialized hardware design optimized for deep neural network computations using tailored scheduling, memory, and compute strategies.
It employs multidimensional optimization and design space exploration, leveraging analytical models and genetic algorithms to balance latency, area, and power constraints.
The architecture enables hardware/software co-design, facilitating efficient deployment across diverse ML workloads and adaptive performance tuning.

A custom ML accelerator architecture refers to specialized hardware systems and co-design methodologies engineered to efficiently execute computational workloads characteristic of modern ML models, particularly deep neural networks (DNNs). In contrast to general-purpose processors, these accelerators are tailored through architectural, scheduling, and memory-system optimizations to maximize throughput, minimize latency, and satisfy area and power constraints for target DNN applications. Recent research iteratively advances such architectures by integrating analytic modeling, design space exploration, hardware/compiler co-optimization, and ML-driven automation.

1. Application-Driven Design and Architectural Modeling

Custom ML accelerator architectures are increasingly derived from target DNN applications rather than relying on generic design templates. The architecture design process typically begins with parsing the DNN's computational graph—often output as a directed acyclic graph (DAG) from frameworks such as TensorFlow—into operation streams (e.g., 2D convolutions, matrix multiplications) and dynamic memory footprint estimates (Yu et al., 2019).

Operation-level analytical models play a crucial role, quantifying hardware execution costs (compute and memory latency) for given architectural parameters. For convolutional and similar layers, compute latency is modeled as the product of tiling and unrolling cycles; for instance, the inter-tiling cycle formula:

$inter\_tiling\_cycle = \left\lceil \frac{N_{if}}{T_{if}} \right\rceil \left\lceil \frac{N_{kx}}{T_{kx}} \right\rceil \left\lceil \frac{N_{ky}}{T_{ky}} \right\rceil \left\lceil \frac{N_{ox}}{T_{ox}} \right\rceil \left\lceil \frac{N_{oy}}{T_{oy}} \right\rceil \left\lceil \frac{N_{of}}{T_{of}} \right\rceil$

where $N_*$ denote DNN-defined dimensions and $T_*$ are tiling factors. Inner-loop unrolling, parameterized as:

$inner\_tiling\_latency = \left\lceil \frac{T_{if}}{P_{if}} \right\rceil ...$

models fine-grained pipelining and parallelism. Total layer latency is taken as the max of compute and memory transfer latencies, with memory bandwidth, buffer constraints, and data reuse (e.g., $weight\_reuse$ , $input\_reuse$ ) explicitly represented.

2. Multidimensional Optimization and Design Space Exploration

The selection of optimal hardware configurations for a given DNN is formulated as a constrained multidimensional optimization problem. Variables include loop tiling/unrolling factors, processing-element (PE) group counts, MAC units/group, and buffer sizes, subject to resource and data dependency constraints:

Constraint Type	Mathematical Formulation	Description
Parallelism	$PE\_group \times MAC/group \geq P_{ox} \times P_{oy} \times ...$	Minimum MACs required for given parallelism
Buffer Capacity	$weight\_buffer \geq T_{kx} \cdots T_{of} \times bit\_width$	Minimum buffer space for weights and activations

Optimization is performed using population-based metaheuristics, especially Genetic Algorithms (GA), which iteratively select, crossover, and mutate candidate designs based on fitness (i.e., latency/area). Only configurations within the fixed area or power budget and satisfying all functional constraints are eligible for selection (Yu et al., 2019). This approach supports scalability across DNN types (CNNs, RNNs, etc.).

3. Joint Hardware/Model Co-Design and Multi-Context Optimization

Modern frameworks increasingly support simultaneous optimization of both the accelerator architecture and DNN model (Yu et al., 2019). By interleaving layers from different DNNs (e.g., combining memory-intensive and compute-intensive models), these approaches generate architectures that balance resource allocation, often reducing the global compute and memory footprint compared to isolated, application-specific designs.

Co-design extends to sensitivity analysis: modifications in DNN parameters (feature map sizes, depth, layer types) can be explored to assess their impact on optimal hardware parameters. Insights may inform iterative adjustments in either the DNN or hardware, with case studies (e.g., incremental addition of depthwise-separable convolutions or fully connected layers) demonstrating that significant alterations—such as addition of large matrix multiplications—often force corresponding shifts in PE group allocation and tiling sizes.

4. Latency, Area, and Performance Metrics

A rigorous architectural cost model underpins the optimization process, yielding hard estimates of throughput (e.g., giga-operations-per-second, GOPS), latency, and area. For given DNNs (e.g., Inception-v3, ResNet-50, VGG16, DeepLabv3, etc.), application-driven custom architectures demonstrate geometric mean performance improvements in the range of 12.0%–117.9% relative to baseline per-application-tuned designs, as documented in the paper’s results (Yu et al., 2019).

Area and buffer sizes are computed directly from configuration parameters; e.g., buffer size as:

$activation\_buffer \geq (T_{ix} \times T_{iy} \times T_{if} + T_{ox} \times T_{oy} \times T_{of}) \times bit\_width$

This enables rigorous scaling and area utilization studies, critical for ASIC and FPGA deployment considerations.

5. Support for Diverse ML Workloads

The application-driven methodology supports a wide variety of networks, including convolutional, recurrent, and recommendation models. The cost model’s operation-level abstraction allows the optimization system to handle complex graphs encompassing 2D/3D convolution, depthwise-separable convolution, and large matrix-multiplication layers, as well as networks with irregular resource profiles. The design is also extensible to future DNN paradigms as long as operation-level hardware models are supplied.

6. Implications for Deployment and Future Co-Design Strategies

Rigorous application-aware custom ML accelerator architectures facilitate efficient mapping of large, heterogeneous DNNs without over-provisioning for non-essential cases. Practical deployment is enabled by the automation of both design space exploration and configuration selection, reducing hardware-software integration time and enabling hardware–software co-evolution.

The methodology provides a pathway for both forward design—adapting hardware to evolving DNNs—and backward adaptation, where DNNs are adjusted to maximize the efficiency of a given hardware accelerator. This co-design philosophy, with demonstrated substantial performance improvement over rigidly optimized per-application designs, points to a future in which adaptive, application-driven custom ML accelerator architectures are integral to state-of-the-art ML deployment (Yu et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture (2019)

Follow Topic

Get notified by email when new papers are published related to Custom ML Accelerator Architecture.