Papers
Topics
Authors
Recent
2000 character limit reached

Hardware-Software Co-Design Overview

Updated 24 December 2025
  • Hardware-software co-design is an integrated design approach that synchronizes hardware architectures and software systems to meet strict performance, latency, and timing constraints.
  • It leverages multi-level formal models and reconfigurable platforms such as FPGAs and ASICs to emulate and optimize metrics like latency, bandwidth, and error rates in diverse environments.
  • Dynamic reconfiguration and rigorous validation techniques enable precise runtime tuning and scalable prototyping for applications including URLLC, TSN, and distributed cloud/edge systems.

Hardware-software co-design refers to the integrated approach for jointly developing hardware architectures and software systems to achieve optimized overall functionality, performance, and fidelity—particularly when performance bottlenecks, correctness constraints, or end-to-end timing requirements preclude traditional hardware or software-centric engineering. This methodology spans diverse domains, including memory hierarchy prototyping, latency-critical radio-frequency (RF) testbeds, wireless network emulators, storage systems, network edge emulation, and cloud/edge distributed computing. Its prominence arises from the increasing complexity of modern systems and the need to close the semantic gap between microarchitectural realities and system-level behavior.

1. Formal Foundations and Models

Hardware-software co-design is anchored in multi-level formal models that map the execution semantics and timing characteristics of software abstractions onto hardware-implemented or hardware-emulated primitives. Mathematically, these systems often define explicit mappings between function fSWf_{SW}, implemented in software, and its hardware realization fHWf_{HW}, with a joint optimization problem:

minfHW,fSWL(fHW,fSW;Θ)\min_{f_{HW}, f_{SW}} \quad \mathcal{L}(f_{HW}, f_{SW}; \Theta)

where L\mathcal{L} quantifies loss (e.g., latency, error, power, correctness) under parameter set Θ\Theta. In memory emulation, for example, the total emulated latency LaccessL_{\text{access}} might be given by:

Laccess=Lnative+NstallTclkL_{\text{access}} = L_{\text{native}} + N_{\text{stall}} \cdot T_{\text{clk}}

for hardware-induced delays, or Lemulated=Llocal+ΔLL_{emulated} = L_{local} + \Delta L for software-injected NUMA/CXL penalties (Wen et al., 2020, Gond et al., 12 Apr 2024).

In RF channel emulation, the baseband channel impulse response is parameterized as:

h(t,τ)=l=1L(t)al(t)δ(ττl(t))h(t, \tau) = \sum_{l=1}^{L(t)} a_{l}(t) \delta(\tau - \tau_{l}(t))

with mapping to finite-tap filter arrays, ensuring physical-layer fidelity at nanosecond-level delays (Villa et al., 2022).

2. Architectures and Toolchains

Co-designed systems exhibit heterogeneous or hybrid architectures in which portions of the functional stack are instantiated in reconfigurable logic (FPGAs), custom ASICs, or high-speed interconnect logic, while the control, configuration, and validation logic reside in software frameworks or system-level libraries. Notable examples include:

  • FPGA-based Hybrid Memory Emulation: Hardware implements a memory management datapath (e.g., Hybrid Memory Management Unit, HMMU) that processes PCIe request streams, injects delays for emulating NVM, and exposes performance/migration counters to the OS. Software integrates with kernel drivers and can reconfigure policies dynamically (Wen et al., 2020).
  • Main Memory Emulators for System Software: On Xilinx SoCs (e.g., ZCU104), rate controllers within the FPGA programmable logic inject per-region latencies, bandwidth caps, and bit-flip errors while privileged/unprivileged software transparently interacts with underlying hardware. Application-level APIs allow run-time reconfiguration (Hirofuchi et al., 2023).
  • Channel Emulation and Sounder Toolchains: Emulation toolchains like CaST include a scenario generator that transforms ray-tracing output into sequences of finite impulse response (FIR) filter taps, which are programmed into FPGAs, and SDR-based sounders for pass-through measurement, validating the accuracy of emulated delays/gains (Villa et al., 2022).
  • User-Space Libraries for Disaggregated Memory: Solutions such as emucxl bridge QEMU-driven NUMA emulation with userspace APIs, enabling software to allocate memory either on the "local" or emulated "CXL" node, controlling per-access injected delay or bandwidth limitation at the application level (Gond et al., 12 Apr 2024).

3. Latency and Fidelity Control

Emulating or controlling latency with high accuracy is a central challenge for hardware-software co-design in testbed and prototype development. Approaches span the full spectrum:

  • Nanosecond-Accurate RF Path Emulation: Direct mapping of path delays (rounded to FPGA filter tap granularity) achieves ≤20 ns error, supporting sub-microsecond latency fidelity for 5G/URLLC protocol testing. The signal chain includes ray-tracing models \to tap file generation \to FPGA reconfiguration, with cross-correlation using SDR sounders for end-to-end validation (Villa et al., 2022).
  • Hardware Delay Insertion: Memory system emulators introduce programmable pipeline stalls at the memory-controller datapath level to emulate NVM/DRAM asymmetry. The stall count NstallN_{stall} is configured to match device-level specifications, allowing rapid exploration of different technology profiles without hardware resynthesis (Wen et al., 2020).
  • User-Space Delay/Throttling: Software-based emulators inject precise delays into memory accesses or communication calls by timing-aware signal handling, busy-waiting, or explicit thread stalls. This allows arbitrary, per-region, or per-access emulated behavior, with accuracy routinely at or below 1–5% error versus hardware (Koshiba et al., 2019, Gond et al., 12 Apr 2024).
  • TSN and Real-Time Networks: Architectural integration of time-aware shapers in the Linux kernel, with patched Mininet and per-veth interface qdisc, enables microsecond-level scheduling and jitter control, facilitating validation of deterministic networking algorithms (Gracia et al., 2 Jun 2025).

4. Dynamic Reconfiguration and Extensibility

A core requirement for co-design testbeds is runtime reconfigurability and support for dynamic workloads. This is realized through:

  • Dynamic Scenario Updates: RF and wireless emulators (CaST, near-memory accelerators) support scenario configuration packets (SCPs) that reprogram propagation delays, reflection coefficients, and mobility models at millisecond scale without pipeline stalling (Villa et al., 2022, Mukherjee et al., 13 Jun 2024).
  • Flexible Memory Policies: FPGA-based hybrid and main memory emulators allow software-in-the-loop modification of placement and migration policy logic. For example, users can swap in hot/cold page trackers or streaming-vs-random access detectors by modifying LUTs and resetting the pipelined controller, supporting iterative design (Wen et al., 2020, Hirofuchi et al., 2023).
  • Programmable Network Impairments: Network emulation platforms integrate Linux tc/netem, eBPF-EDT, or similar tools to permit online reconfiguration of latency, jitter, bandwidth, and loss per virtual or physical link, as in FLEET and Kollaps (Hamdan et al., 30 Aug 2025, Gouveia et al., 2020, Becker et al., 2022).
  • Automation and Infrastructure-as-Code: System-wide TOML/YAML-based configuration layers, deployment generators, and orchestration agents automate the pipeline from topology to parameterization, enabling reproducible and extensible hardware-software co-design experimentation (Hamdan et al., 30 Aug 2025, Gouveia et al., 2020).

5. Verification, Validation, and Measurement Infrastructure

Ensuring that the jointly designed hardware-software system delivers ground-truth physical timing and behavior is critical. Techniques include:

  • End-to-End Sounding: Cross-layer measurement, such as transmitting known PN sequences through the emulation fabric and correlating received IQ samples, quantifies actual delay/gain errors versus theoretical (Villa et al., 2022).
  • Performance Counters and Benchmarks: Presentation of built-in read/write counters, regionwise bandwidth/error counters, and integration with real-world workloads (SPECCPU, YCSB, BabelStream, etc.) assess fidelity relative to reference hardware (Wen et al., 2020, Hirofuchi et al., 2023).
  • Timestamping and Scheduling Audits: TSN emulators use low-overhead timestamping (syscall, socket timestamps, BPF/XDP helpers) at all path points, with median and variance corrections, to characterize latency/jitter at sub-microsecond scale (Gracia et al., 2 Jun 2025).
  • Configuration-Accurate Modeling: Emulators such as emucxl expose tunable API knobs (delay/jitter/bandwidth), with microbenchmark validation (pointer chasing, STREAM) to quantify deviation between emulated and reference measurements (Gond et al., 12 Apr 2024).

6. Limitations and Trade-offs

Hardware-software co-design introduces trade-offs between accuracy, reusability, and complexity:

  • Granularity: Hardware-level emulation is often limited by minimum step size (e.g., tap delay quantization, minimum attainable latency ~400 ns in small FPGAs) or page-level allocation (NUMA, CXL), precluding cache-line or sub-nanosecond emulation (Hirofuchi et al., 2023, Gond et al., 12 Apr 2024).
  • Model Simplification: Coarse delay/stall insertion does not reproduce all vendor-specific state-machine or signal-integrity effects (e.g., per-command DDR4 timings, real PCIe bus backpressure) (Wen et al., 2020, Gond et al., 12 Apr 2024).
  • Scalability: While eBPF-based network emulation achieves constant-time lookups per packet, hardware memory resources or qdisc/namespace instantiation can be limiting at scale (Becker et al., 2022, Gouveia et al., 2020).
  • Validation Overheads: Accurate validation (e.g., end-to-end IQ correlation, timestamping with BPF) can impose measurement overheads or require corrections for system-call or context-switching noise (Gracia et al., 2 Jun 2025).

7. Best Practices and Representative Use Cases

Effective hardware-software co-design mandates synchronization between architectural choices, configuration tools, and measurement methodology. Established practices include:

  • Pre-experiment Calibration: Always perform sounder-based or timestamp-based validation before running protocol or application-level benchmarks to ensure drift or aging is detected (Villa et al., 2022, Gracia et al., 2 Jun 2025).
  • Matching Update Rates to Dynamics: Synchronize reconfiguration frequency with channel or workload dynamics (e.g., mobility in wireless, time-varying memory hotness, background traffic in networks) (Villa et al., 2022, Hamdan et al., 30 Aug 2025).
  • Transparent Software Integration: Expose full software stack to the actual hardware emulation logic, avoiding model-only simulation to capture real OS, cache, TLB, numa, or coherence effects (Wen et al., 2020, Hirofuchi et al., 2023).
  • Layered Modular Design: Architect modular blocks for traffic generation, emulation, and instrumentation, expanding experimental reach and reuse—in cloud or edge, this allows scaling from small in-lab setups to thousands of VMs and endpoints (Pluzhnik et al., 2014, Becker et al., 2022).

In practice, these principles enable research and prototyping of systems for URLLC, TSN, federated learning, hybrid memory, and disaggregated data center architectures, ensuring that hardware constraints and software behavior are co-optimized for end-to-end fidelity and experimental tractability.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hardware-Software Co-Design.