Deep Learning Acceleration Stack (DLAS)

Updated 1 February 2026

Deep Learning Acceleration Stack (DLAS) is a comprehensive framework offering full-stack, tunable workflows for hardware-accelerated DNN inference and training.
It leverages ML-guided design-space exploration and predictive models like GCN-based regressors to accurately optimize power, performance, and area metrics.
DLAS supports diverse hardware platforms—from ASICs and GPUs to multi-node distributed systems—delivering significant speedup and energy-efficiency improvements.

The Deep Learning Acceleration Stack (DLAS) is an integrative framework implementing full-stack, performance-portable, and highly tunable workflows for hardware-accelerated deep neural network (DNN) and ML inference and training. DLAS structures development around unified backend–frontend modeling, ML-guided design-space exploration (DSE), physical-design-driven prediction, compact code-generation, and cross-stack co-design—all supporting both ASIC-class accelerators and multi-node distributed platforms (Esmaeilzadeh et al., 2023, Gibson et al., 2023). These stacks span the spectrum from push-button accelerator generation and system-level scheduling to advanced architectural template parameterization, providing actionable guidance for hardware–software co-design and optimization.

1. DLAS Conceptual Structure and Layering

DLAS is framed as a six-layer, cross-stack pipeline:

Datasets & Problem Spaces: Image classification (CIFAR-10, ImageNet), language tasks (GLUE), robotics, biomedical, and domain-specific problems.
Models & Neural Architectures: Popular DNN architecture types including CNNs (VGG, ResNet, MobileNet), DenseNets, Transformers, GANs, and diffusion models.
Model Optimizations: Pruning (unstructured/structured), quantization (float16, int8), knowledge distillation, and low-rank/blocked sparsity approaches.
Algorithms & Data Formats: Direct, GEMM-based, Winograd, spatial-pack convolution; dense vs. sparse formats; common tensor layouts (NCHW, NHWC).
Systems Software: Tensor compilers (TVM, IREE), vendor libraries (cuDNN, oneDNN), auto-tuning frameworks (Ansor, AutoTVM), and runtime integration.
Hardware Platforms: CPUs (x86, ARM), GPUs (Nvidia, ARM Mali), FPGAs, ASICs (TPU, MAERI, NVDLA), and associated memory, interconnect, and SoC components (Gibson et al., 2023, Genc et al., 2019).

This layer-oriented abstraction enables systematic analysis of interactions and dependencies, facilitating vertical perturbation experiments and tightly coupled co-optimization.

2. Stack Architectures and Workflow Integration

DLAS architecturally joins backend physical-design flow with frontend performance simulation:

Backend PPA Predictor: Models synthesis, place-&-route, predicting power ( $P$ ), frequency ( $f$ ), area ( $A$ ), using architectural knob vectors $\mathbf x$ , target clock ( $\tau$ ), floorplan utilization ( $u$ ), and a compact logical hierarchy graph (LHG) from the RTL netlist.
Frontend Performance Simulator: Uses ML workload descriptors (DNN type, algorithm) and the backend PPA outputs, estimating runtime ( $T$ ) and energy ( $E$ ) with respect to realistic hardware timings (Esmaeilzadeh et al., 2023).

The stack is instantiated for accelerator generators such as VTA and VeriGOOD-ML; design parameters are automatically swept across architecture and backend space.

Example Model Flow

$(\hat P, \hat f, \hat A) = \text{BackendPredictor}(\mathbf x, \tau, u, \text{LHG})$

$(\hat T, \hat E) = \text{PerfSimulator}(\mathbf x, \hat f, \hat P)$

Unified ML-based prediction brings end-to-end error to ≲7% for PPA and system metrics across major DL platforms in commercial/research nodes (12 nm, 45 nm), dramatically shortening design–silicon cycles from months to hours (Esmaeilzadeh et al., 2023).

3. Learning-Based Models and Design-Space Exploration

Prediction of PPA and system metrics relies on ensembles of:

GBDT, Random Forest, ANN, stacked-ensemble, and Graph Convolutional Network (GCN) regressors using joint knob and LHG features.
Multi-Objective Tree-Structured Parzen Estimator (MOTPE): A Bayesian optimizer maximizing the density ratio $p(\mathrm{good})/p(\mathrm{bad})$ , converging in hundreds of queries (milliseconds/query).

DSE is formalized as minimization of a weighted sum:

$\min_{\mathbf x, \tau, u} (\alpha \hat E + \beta \hat T + \gamma \hat A)$

with runtime and power constraints. Pareto-optimal accelerator configurations, confirmed by full SP&R, yield <7% error for metrics (energy, area, runtime) (Esmaeilzadeh et al., 2023).

GCN models consuming LHGs outperform pure vector models by 10–20% error on unseen architectures, especially in low-data regimes.

4. System-Level Integration and Programming Stacks

DLAS incorporates systematic integration for accelerator generators and SoC platforms:

Hardware Templates: Parameterizable accelerator arrays (Gemmini), custom microarchitectures (fully-pipelined, combinational, hybrid), flexible data representation, and memory topologies.
Programming Stack: Compiler passes (ONNX→TVM→IR→hardware), DMA-tiled data movement, runtime blocking and scheduling heuristics, and push-button flows for automatic binary generation (Genc et al., 2019).
SoC Effects: Shared caches, page-table walker, OS scheduling, and context-switch overheads are integrated into DLAS evaluation, exposing real-world resource contention and utilization bottlenecks.

Empirical benchmarks on Gemmini report > $10^2$ – $10^3$ × speedup over CPU-only execution, with simulated energy efficiency 0.32 pJ/MAC matching leading commercial designs.

5. Cross-Stack Perturbation, Co-Design, and Empirical Insights

DLAS research highlights the non-trivial coupling between layers:

Model Size, Accuracy, Latency Are Not Highly Correlated: Smaller networks may not yield faster inference; model compression does not guarantee speedup—only 10–30% of theoretical speedup realized for aggressive weight pruning in many cases.
Algorithmic–Hardware Dependency: Best algorithms and data formats for a DNN are highly platform-dependent; auto-tuning (Ansor) can invert primitive choices depending on hardware and schedule (Gibson et al., 2023).
Quantization and Pruning: Int8 quantization yields 1.2–9.2× speedup on CPU (SIMD integer math), 1.3–2× on GPU; float16 is disadvantageous on CPU (emulation), modestly beneficial on GPU.
Auto-Tuning Yield and Cost: 2–5× speedups over untuned schedules on server platforms, but search times prohibitive on edge devices.
Framework Assumptions: Rapid evolution of architectures (e.g., EfficientNetB0 violating TVM quantization pipeline assumptions) necessitates robust and maintainable integration (Gibson et al., 2023).

DLAS thus provides a reference checklist for co-optimization—jointly tuning model structure, algorithmic primitive, compiler schedule, and hardware configuration.

6. Distributed and Overlay Stack Extensions

DLAS generalizes to distributed and overlay contexts:

Distributed Multi-GPU/Node Training: PowerAI DDL multi-ring all-reduce, scalable to hundreds of GPUs; topology-aware collective algorithms matched to network hierarchy, yielding 84–92% scaling efficiency (ResNet-50/101), slashing training times from days to hours (Cho et al., 2017).
Compute–Communication Overlap: Endpoint collective engines (ACE) offload All-Reduce computation, reducing HBM bandwidth demand by 3.5×, boosting effective network bandwidth by 1.44× on average and iteration speedup by 1.12–1.41× over optimized baselines (Rashidi et al., 2020).
FPGA Overlays: DLA overlays programmable via VLIW networks and a domain-specific graph compiler, supporting CNN and LSTM subgraphs with only ~1% area overhead, reporting up to 900 fps inference on GoogLeNet and 12× LSTM speedup (Abdelfattah et al., 2018).

7. Future Directions, Lessons, and Standardization

Key future and actionable recommendations include:

Wider auto-tuning toolchain support for sparse and quantized ops across HW platforms.
Holistic co-optimization frameworks: Joint exploration of compression ratio, algorithm format, and hardware-schedule knobs, optionally informed by chip floorplanning feedback (Gibson et al., 2023, Esmaeilzadeh et al., 2023).
Hardware primitives: Native support for mixed precision, block-sparse MAC arrays, flexible indexing, tensor core-style operations.
Framework maintainability: Resilient codegen pipelines bridging rapid DNN architectural evolution with hardware differentiation.

DLAS stands as both a unifying reference framework and a practical toolset for academic and industrial researchers seeking systematic, validated, and composable acceleration approaches for deep learning (Esmaeilzadeh et al., 2023, Gibson et al., 2023, Genc et al., 2019, Cho et al., 2017, Rashidi et al., 2020, Abdelfattah et al., 2018, Georganas et al., 2019).