Hardware–Software Co-Design
- Hardware–software co-design is a holistic approach that jointly optimizes hardware and software components, enhancing performance, energy efficiency, and cost-effectiveness.
- It employs unified intermediate representations, algorithm partitioning, and hierarchical optimization techniques to navigate multi-dimensional design spaces.
- Applications in AI accelerators, cryptography, and quantum computing demonstrate significant speedups, area savings, and energy improvements through integrated co-design.
Hardware–software co-design is the techno-economic practice of jointly architecting and optimizing hardware and software systems, leveraging their mutual interdependencies to maximize performance, energy efficiency, cost-effectiveness, and functionality beyond what hardware-only or software-only methodologies permit. In modern digital systems—from neural network accelerators to NVMe storage, post-quantum cryptography, neuromorphic spiking networks, and distributed quantum computers—co-design strategies are fundamental for advancing the Pareto frontier of achievable metrics. Unlike approaches that treat hardware and software as separate or sequentially layered artifacts, co-design exploits cross-layer feedback, shared intermediate representations, and holistic optimization strategies, especially as Moore’s Law’s traditional guarantees decline and “intelligence” workloads (e.g., AI/ML) demand aggressive end-to-end throughput scaling (Yazdanbakhsh, 9 Apr 2025).
1. Origins, Evolution, and Systemic Imperative
Historically, hardware–software co-design has evolved through several “epochs” marked by the degree of integration and coupling between hardware and software artifacts. Early systems such as the MIT Tagged‐Token Dataflow and Manchester Dataflow machine lacked a formal ISA abstraction; compiler and hardware were developed hand-in-hand. The 1980s brought VLIW and dataflow ISAs (Monsoon, ELI-512), pushing complexity into the compiler and scheduler. The 1990s, dominated by Moore’s Law, shifted the landscape toward rigid abstraction boundaries—co-design receded to niche domains as general-purpose designs leveraged transistor scaling. The fragmentation of “dark silicon” and the AI accelerator era reignited co-design: specialized systems (Google TPU, Eyeriss, Anton) found order-of-magnitude efficiency gains through domain-specific hardware tightly married to software (compilers, dataflow DSLs, kernel fusion). In the 2020s, the “redshift” of generative AI has collapsed the boundary entirely; full-system intelligence, algorithm–hardware co-evolution, and software-defined hardware are first-class design axes (Yazdanbakhsh, 9 Apr 2025).
Key drivers for renewed co-design urgency include:
- Moore’s Law slow-down, which eliminates guaranteed per-transistor performance scaling.
- The “hardware lottery” effect, where algorithmic adoption is critically bottlenecked by hardware support for novel computational models.
- The exponential growth in computational and memory requirements for modern AI/ML, streaming, and cryptographic workloads.
Co-design is thus a structural requirement for contemporary ICT platforms spanning cloud-to-edge and classical-to-quantum domains.
2. Methodological Frameworks and Search Strategies
Hardware–software co-design workflows generally depart from fixed, sequenced hardware-then-software implementation. Instead, they explore multidimensional design spaces by:
- Unifying intermediate representations (IRs) that permit coupling of hardware and algorithmic transformations (e.g., tensor syntax trees for deep learning, Bristol netlists for garbled circuits, SDF graphs for multimedia systems) (Mo et al., 2022, Xiao et al., 2021, Sebai et al., 2010).
- Partitioning algorithms into hardware and software modules on quantifiable objective axes such as latency, area, energy, or “performance-density” (speedup per LUT, per BRAM) (Montanaro et al., 2022).
- Employing hierarchical or semi-decoupled joint optimization engines, e.g., bi-level neural architecture and hardware search (NAS × accelerator parameterization), often leveraging rank correlation to drastically shrink the search space without optimality loss (Lu et al., 2022, Jiang et al., 2019).
- Integrating software scheduling and resource allocation logic that exploits static workload determinism for hardware simplification and resource reduction, as in METRO’s globally scheduled NoC for DNN accelerators (Wang et al., 2021), or HAAC’s streaming scratchpad (Mo et al., 2022).
Table: Schematic overview of major co-design search strategies.
| Approach | Key Feature | Reference |
|---|---|---|
| Semi-decoupled NAS/HW search | Proxy ranking, monotonicity | (Lu et al., 2022) |
| RL-based two-level co-exploration | Fast perf. pruning, slow RL | (Jiang et al., 2019) |
| Compiler-guided streaming & scratchpad | Data-oblivious programs, statically scheduled memory | (Mo et al., 2022) |
| Bayesian/heuristic multi-objective opt | Layered design space decomposition | (Xiao et al., 2021) |
A common mathematical formulation is constrained multi-objective optimization: maximize accuracy or throughput while bounding latency, energy, or area costs.
3. Embedded and Domain-Specific Co-Design Implementations
Co-design has proven decisive in numerous sharply defined applications:
- Post-quantum cryptography: BIKE KEM partitioned across ARM/FPGA hybrids achieves up to 2.78× speedup over software-only by selectively synthesizing hardware accelerators for KeyGen/Decaps (but not Encaps), guided by “performance-density” per resource (Montanaro et al., 2022).
- AGILE cryptography hardware: Finesse demonstrates end-to-end IR–ISA–hardware parameterization for pairing-based cryptography, with an agile co-optimization loop yielding up to 34× throughput and 6.2× area-efficiency over prior frameworks (Pan et al., 12 Sep 2025).
- Event-driven spiking networks: SNN frameworks deploy a single reference artifact, used by both software simulators and event-driven FPGAs, ensuring bit-exact determinism and scope-aware benchmarking (energy/latency per scope) (Lee et al., 24 Apr 2026).
- Sparse SNN accelerators: Hardware, dataflow, and training are co-designed to exploit both static (synaptic) and dynamic (spike) sparsity using compressed on-chip representations and event-driven pipelines, often tuned by DSE (design space exploration) engines (Aliyev et al., 2024).
- Automotive CPS: Architectural models (AADL) link system-level algorithm simulations (Simulink, CarSim) with static analysis (schedulability, utilization) and real-time hardware-software partitioning, enabling rapid design iteration and early hardware selection (Zhou et al., 2016).
These cases illustrate a spectrum of co-design—from ultra-tight coupling of fixed toolchains to modular, rapid-recompilation frameworks capable of instantaneous adaptation to new algorithms or parameters.
4. Hardware–Software Partitioning, Scheduling, and Streaming
Efficient co-design architectures typically depend on nontrivial partitioning and scheduling strategies:
- Hardware–software partitioning is often driven by module-level profiling (e.g., offloading modules with highest “latency reduction per area” to FPGA, as with BIKE) (Montanaro et al., 2022).
- Memory–compute decoupling is achieved via compiler-driven streaming, e.g., HAAC’s separation of instruction, constant, and “out-of-range” wire queues enables in-order compute cores to be fully utilized regardless of volatile memory latencies (Mo et al., 2022), while METRO’s time-division flow control systematically overlays contention-free multi-flow scheduling atop minimal hardware routers (Wang et al., 2021).
- SDF graph models allow the explicit quantification of the impact of software-to-hardware task migration, encompassing execution-time scaling, TDMA slot reclamation, and synchronization-induced overheads (Sebai et al., 2010).
- For emerging computation models (e.g., distributed quantum computing), co-design includes adaptive runtime gate scheduling that leverages noisy, buffered entanglement to reduce circuit depth and improve fidelity over purely synchronous or statically scheduled baselines (Liu et al., 24 Mar 2025).
A critical attribute is the ability to expose fine-grained hardware configuration choices (pipeline depth, buffer size, dataflow patterns) as first-class parameters for joint software–hardware optimization.
5. Performance Metrics and Quantitative Impact
Hardware–software co-design is evaluated using formal and empirical metrics, often expressed in LaTeX notations or as reified experimental data:
- Speedup: , with T_SW and T_CO for software-only and co-designed systems.
- Energy efficiency: .
- Throughput: Number of operations per second, often normalized to per-area or per-energy units (e.g., GSOP/W, kops/slice).
- Area-delay and energy-delay products: .
- Pareto-optimality: Co-design frameworks empirically “push” the Pareto frontier up and to the right (accuracy vs. utilization, throughput vs. area).
- Trade-off curves: SNN co-design studies quantify the direct and indirect effects of parameter tuning (e.g., spike-count reduction vs. accuracy drop when varying LIF β and θ, revealing interpretable cost functions) (Aliyev et al., 2024).
- Hardware-software streaming bandwidths: HAAC, for garbled circuits, achieves 2.7×103 speedup on HBM2 over a CPU, with only 4.3 mm² silicon for the IP core (Mo et al., 2022); SNN FPGA co-designs demonstrate sub-microsecond inference latency and nJ-level energy per inference at functional parity with CPU/GPU (Lee et al., 24 Apr 2026).
- Distributed quantum: Over 3× circuit depth reductions vs. baseline synchronous scheduling, and up to 2× circuit fidelity gains via co-designed buffering and adaptive scheduling (Liu et al., 24 Mar 2025).
Systematic reporting of these metrics enables the rational comparison of co-designed versus monolithic or sequentially developed systems across diverse domains.
6. Design Lessons, Trade-offs, and Generalization
Across the literature, several general principles and trade-offs emerge:
- Co-design effectiveness increases inversely with hardware “headroom”: The less generational scaling hardware provides, the greater the systemic gains from co-design (Yazdanbakhsh, 9 Apr 2025).
- Compiler–hardware splitting must be justified by workload determinism. Fixed-program or data-oblivious classes (e.g., garbled circuits, DNN recognition pipelines) benefit maximally from offline software scheduling and hardware runtime simplification.
- Modular abstractions (e.g., property-rich IRs, streaming APIs, software-defined scratchpads rather than hardware-managed caches) facilitate both performance and adaptability to new domains (see xNVMe for storage and SUSHI for DNN inference (Lund et al., 2024, Behnam et al., 2023)).
- The “hardware lottery” can be partially mitigated via generality-preserving design: exposing reconfiguration hooks, maintaining portability (across OS boundaries, technology nodes), and supporting rapid design iteration (e.g., Finesse’s minutes-long compile→synthesize cycle (Pan et al., 12 Sep 2025)).
- Co-design cost is not monotonic with domain specificity; care must be taken to avoid overfitting hardware to transient algorithms without provision for sustainable future generality (Yazdanbakhsh, 9 Apr 2025).
- For real-time and embedded systems, early integration of analytic and simulation feedback loops allows the rapid pruning of infeasible design choices and supports multi-dimensional metrics, including safety, reliability, and energy, in addition to raw performance (Zhou et al., 2016, Zhou et al., 2024).
The cumulative effect, as observed in diverse applications, is a systematic raising of efficiency and performance benchmarks, the ability to navigate multi-objective design spaces, and an accelerated cycle of innovation to meet both current and unanticipated workload demands.