Data-Algorithm Co-Design Engine
- Data-Algorithm Co-Design Engine is a framework that jointly optimizes data representation, algorithmic transformation, and hardware mapping through integrated design processes.
- It uses workload-aware techniques and surrogate models to concurrently minimize energy, latency, and computational overhead while improving system performance.
- The engine employs multi-stage design flows and runtime control policies, enabling efficient deployments in areas like continuous vision, LLM inference, and genomic computations.
The “Data-Algorithm Co-Design Engine” (Editor’s term) denotes a class of specification-driven, workload-aware frameworks that jointly optimize data or workload representation, algorithmic transformation, and hardware or system realization, rather than fixing a model first and adapting hardware afterward. In the cited literature, this pattern appears under several neighboring names—algorithm-hardware co-design, algorithm-system co-design, algorithm-SoC co-design, and hardware-software co-design—and is instantiated in domains including aerial robotics, mobile continuous vision, long-context LLM inference, dynamic programming in memory, mixed-signal PIM, approximate arithmetic, and FPGA deployment of sequence models (Krishnan et al., 2021, Zhu et al., 2018, Liang et al., 7 Apr 2025, Lu et al., 27 Feb 2026, Behnam et al., 2022, Khan et al., 26 Oct 2025, Lyu et al., 3 May 2026, Yi et al., 18 Nov 2025). Interpreting these systems under one umbrella is an overview rather than a uniform author-supplied taxonomy.
1. Problem formulation and conceptual basis
A recurring premise of this literature is that conventional top-down deployment treats model design and hardware mapping as separate stages, which can produce hardware-unfriendly models, excessive manual tuning, and poor system-level trade-offs (Yildirim et al., 18 Jun 2026). This concern is not limited to neural architecture search. In continuous vision, the waste arises because each video frame is treated as an independent image even though changes in pixel data between consecutive frames encode visual motion (Zhu et al., 2018). In long-context LLM inference, the bottleneck shifts from arithmetic to the storage and loading of KVCache, which grows linearly with sequence length and batch size and can dominate both memory usage and data transfer overhead (Yi et al., 18 Nov 2025). In dynamic programming workloads such as APSP and genomics, the decisive cost is repeatedly moving and staging large matrices or lookup tables through the memory hierarchy, not merely performing the recurrence itself (Lu et al., 27 Feb 2026).
This suggests a broader definition of co-design in which “data” means more than a dataset identifier. Depending on the system, it can mean environment generation and task specification in RL pipelines, motion vectors in a mobile camera stack, tensor outlier structure in quantization, KVCache placement and reuse patterns, or tier-aware mapping of PTR/CAL tables and DP tiles in monolithic 3D DRAM (Krishnan et al., 2021, Zhu et al., 2018, Lyu et al., 3 May 2026, Yi et al., 18 Nov 2025, Lu et al., 27 Feb 2026). A plausible implication is that these engines optimize not just operators and hardware blocks, but also the representation, locality, and lifecycle of the state that the algorithm consumes.
Several papers make this joint dependency explicit in mathematical or procedural form. One reconfigurable CNN co-design framework writes the scalarized objective as
thereby combining algorithmic quality and deployment cost in a single search target (Fan et al., 2021). Euphrates, by contrast, formalizes a runtime update rule for motion-based extrapolation,
which shows how data-derived motion estimates directly replace repeated CNN inference in E-frames (Zhu et al., 2018). CLO uses a head-wise cache-hit criterion,
so that approximate reuse and transfer scheduling are driven by the geometry of adjacent query vectors rather than by exact block-level cache state (Yi et al., 18 Nov 2025).
2. Engine structure and recurring workflow
Although the implementations differ, a common structural pattern is visible. One branch of the literature uses an explicit multi-stage workflow. AutoSoC begins from user-defined task information, platform constraints, and domain-specific optimization targets; launches multiple Air Learning training instances in parallel; prunes policies below a success-rate threshold; configures the surviving networks into the FlexACL accelerator template; checks hardware/software numerical consistency; generates RTL with Catapult HLS; and continues to place-and-route and floorplanned ASIC layout generation (Krishnan et al., 2021). A3C3 describes a related three-stage flow of bundle-library characterization, joint search over architecture and implementation, and automated co-generation of final model and hardware artifacts (Yildirim et al., 18 Jun 2026). The reconfigurable CNN co-design framework of 2021 uses “Specify and Train,” “Modeling,” and “Exploration,” with a once-for-all supernet, Gaussian-process surrogates for CE loss, latency, and energy, and a genetic search over architecture and hardware configurations (Fan et al., 2021).
Other systems replace explicit search loops with structured runtime control. Euphrates alternates between I-frames and E-frames, with the motion controller deciding when to invoke the CNN engine and when to propagate prior ROIs through motion vectors (Zhu et al., 2018). CLO partitions KV heads between persistent HBM residency and CPU-offloaded storage, then performs one-layer-ahead speculative sparse prefetching, head-wise approximate reuse, zero-copy transfer, and GPU-centric synchronization (Yi et al., 18 Nov 2025). GenDRAM similarly integrates placement, scheduling, and execution mode selection: Search PUs serve seeding and filtering, Compute PUs execute semiring-style DP updates, and the controller switches between heterogeneous producer-consumer mode for genomics and homogeneous blocked-Floyd–Warshall mode for APSP (Lu et al., 27 Feb 2026).
| Framework | Data/workload lever | Joint realization |
|---|---|---|
| AutoSoC (Krishnan et al., 2021) | randomized environment generator, success-rate threshold | RL policy training, FlexACL generation, ASIC layout |
| Euphrates (Zhu et al., 2018) | ISP motion vectors, I-frame/E-frame partition | ROI extrapolation, motion controller, autonomous SoC pipeline |
| GenDRAM (Lu et al., 27 Feb 2026) | tier-aware PTR/CAL placement, tile striping | Search PUs, Compute PUs, M3D DRAM mapping |
| ViM-Q (Lyu et al., 3 May 2026) | per-token, per-channel, per-block tensor statistics | hardware-aware quantization, LUT linear engine, pipelined SSM engine |
| CLO (Yi et al., 18 Nov 2025) | head-wise KV reuse difficulty and transfer hideability | approximate caching, zero-copy transfer, GPU-centric synchronization |
| AccLLM (Liang et al., 7 Apr 2025) | -shaped attention and KV4 cache compression | pruning-aware FPGA accelerator, MM/VM reconfiguration |
This suggests that a co-design engine is less a single algorithm than a systems pattern: define the coupled spaces, expose task- or data-dependent structure, amortize or surrogate expensive evaluations, and materialize the result either through search or through runtime orchestration.
3. Data and workload as optimization variables
The most distinctive feature of the data-algorithm interpretation is that the “data side” is itself parameterized. In mobile continuous vision, Euphrates exploits motion vectors already generated inside the ISP’s temporal denoising stage. Rather than adding a new sensor or recomputing optical flow in software, it exports these vectors through the metadata section of the frame buffer, double-buffers the relevant SRAM so DMA does not stall the ISP, and uses them to replace many CNN inferences with ROI extrapolation on E-frames (Zhu et al., 2018). In this case, the key optimization variable is neither the dataset nor the CNN topology alone, but the reuse of latent motion metadata already present in the imaging pipeline.
AutoSoC exposes a different notion of “data”: the simulator-generated environment distribution. Air Learning varies obstacle count, seed, and goal position, uses curriculum learning over multiple zones, and gates hardware generation by success rate in randomized environments (Krishnan et al., 2021). This makes environment generation and task validation part of the design loop, even though the paper does not formulate a formal data-search objective. A plausible implication is that, for cyber-physical systems, the data generator itself becomes a co-design variable because it determines what policy complexity is needed and which accelerator trade-offs are meaningful.
ViM-Q makes tensor statistics explicit design inputs. The paper argues that Vision Mamba linear layers exhibit both persistent channel-wise outliers and dynamic per-token outliers, so it combines per-channel smoothing with dynamic per-token activation quantization and 4-bit per-block APoT weight quantization (Lyu et al., 3 May 2026). The important point is not merely low precision; it is the choice of different granularity for different tensors because the observed error modes differ across tokens, channels, and blocks. AccLLM makes an analogous distinction for long-context LLMs: weights, activations, and KVCache are quantized differently in W2A8KV4 because they play different roles in model size, decode bandwidth, and sequence-length scaling (Liang et al., 7 Apr 2025).
GenDRAM and CLO push this logic into memory hierarchy management. GenDRAM places the genomics PTR and CAL tables, totaling about 17 GB, in Tier 0 of M3D DRAM because their accesses are random, dependent, and latency-critical, while APSP and alignment tiles are striped across channels and bank-groups for bandwidth and conflict reduction (Lu et al., 27 Feb 2026). CLO partitions KV heads by reuse difficulty and transfer hideability, retaining some heads persistently in GPU HBM and offloading others to CPU DRAM with speculative prefetching (Yi et al., 18 Nov 2025). These systems treat data placement, not just compute scheduling, as the primary optimization axis.
RAMAN and HybridAC show yet another form of data-aware design: weight or arithmetic sensitivity. RAMAN couples posit(8,2), approximate multiplication, and approximation-aware training so that hardware savings are judged at application level rather than only at circuit level, with a quality-of-results threshold of 96.5% for acceptability in the co-design workflow (Khan et al., 26 Oct 2025). HybridAC identifies variation-sensitive input channels using a Hessian-based sensitivity estimate and routes those channels to digital acceleration while leaving the remaining computation in analog PIM (Behnam et al., 2022). In both cases, data sensitivity—numerical in one paper, channel-wise variation sensitivity in the other—is the mechanism that decides which computations deserve robust or precise treatment.
4. Search, surrogate modeling, and control policies
Many co-design engines make the joint space tractable through surrogates. The reconfigurable CNN co-design framework of 2021 trains a once-for-all supernet, then fits Gaussian-process regressors for CE loss, latency, and energy, using a Matérn-$3/2$ kernel for loss and Matérn-$5/2$ kernels for latency and energy, with a genetic algorithm for exploration (Fan et al., 2021). The reported search cost is only $0.1$ GPU hour per optimized design, which the paper contrasts with tens to hundreds of GPU hours for prior approaches. CODEBench builds a related but larger benchmark-driven engine: CNNBench defines the model space, AccelBench performs cycle-accurate accelerator simulation, and BOSHCODE uses BOSHNAS to train a neural heteroscedastic surrogate model and search in a continuous embedding space before projecting back to valid CNN-accelerator pairs (Tuli et al., 2022).
A3C3 generalizes this structure at a methodological level. It presents bundle-library characterization, Pareto frontier extraction, PSO in SkyNet, and differentiable architecture-and-implementation co-search in EDD, with joint objectives over accuracy loss, performance loss, and resource penalties (Yildirim et al., 18 Jun 2026). The framework is explicit that its strongest reusable ideas are the joint parameterization of model and implementation spaces, hardware-aware objective functions, and automatic co-generation of deployable artifacts. At the same time, it also states that A3C3 is primarily an algorithm-hardware co-design methodology with growing memory/workload awareness rather than a complete data-centric engine (Yildirim et al., 18 Jun 2026).
Not all engines rely on offline surrogate search. CLO uses runtime control laws instead. It computes head-specific cache thresholds via
where is head importance, is the upper-bound similarity threshold, and 0 is a shaping parameter (Yi et al., 18 Nov 2025). It then estimates how many heads can be prefetched,
1
and allocates persistent heads via
2
where 3 is the number of difficult heads in layer 4 (Yi et al., 18 Nov 2025). This is not a static design-space exploration loop; it is an online controller that decides data placement and transfer timing from profiled similarity and bandwidth budgets.
This suggests two distinct engine modalities. One is compile-time: define the coupled search space, build surrogates, and return a model-hardware pair. The other is runtime: define a coarse semantic unit of reuse, attach low-cost metadata to it, and let a controller map that unit across memory tiers or execution modes on the fly.
5. Hardware and system embodiments
The hardware side of these engines is highly heterogeneous, but several recurring strategies appear. One strategy is to build dedicated compute substrates that directly reflect the transformed workload. GenDRAM uses 8 Search PUs and 24 Compute PUs on a 32 GB, 1024-layer M3D DRAM substrate, with Search PUs tailored to PTR/CAL lookups and filtering, and multiplier-less Compute PUs supporting 5 or 6-style dynamic programming (Lu et al., 27 Feb 2026). ViM-Q deploys a runtime-parameterizable FPGA accelerator with a LUT-based linear engine that replaces multiplications by shift-add operations for APoT weights, and a three-stage SSM engine that preserves sequential recurrence while parallelizing the state dimension (Lyu et al., 3 May 2026). AccLLM similarly uses an FPGA-based reconfigurable computing engine that switches between MM and VM modes to match prefill and decode in Llama-2-7B, while exploiting 2:4 pruning, 7-shaped attention, and W2A8KV4 (Liang et al., 7 Apr 2025).
A second strategy is heterogeneous robustness-aware execution. HybridAC partitions a convolution as
8
mapping sensitive input channels to digital hardware and less sensitive channels to analog PIM (Behnam et al., 2022). This structured split then enables lower ADC precision, peripheral reduction, and hybrid quantization. RAMAN organizes a related but numerically different stack: REAP MAC uses approximate posit(8,2) multiplication, the VEU provides reusable vector execution across ANN, CNN, and transformer-style workloads, and approximation-aware training estimates application-level accuracy under the induced arithmetic behavior (Khan et al., 26 Oct 2025).
A third strategy is to let physical storage constraints shape the algorithm itself. DDC-PIM transforms adjacent convolution filters into biased-complementary pairs using FCC-aware pre-training and FCC-aware QAT, then maps one member of the pair onto the 6T SRAM cell so that the complementary state 9 implicitly represents the other. Reconstruction uses
0
with the ARU adding back the mean term 1 (Duan et al., 2023). This is a particularly explicit case of data-algorithm-hardware isomorphism: the model is trained to satisfy a relation that the bitcell physics already enforces.
A fourth strategy is cross-IP integration at system level. Euphrates adds a motion controller beside a CNN accelerator, exports motion vectors from the ISP through frame-buffer metadata, and keeps the entire continuous-vision loop autonomous so the CPU need not wake on every frame (Zhu et al., 2018). CLO, on a different stack, adds a zero-copy transfer engine built on GDRCopy, persistent and transient HBM regions for KV heads, and GPU-centric synchronization so the GPU waits at the actual data dependency rather than the CPU waiting before launching later kernels (Yi et al., 18 Nov 2025). AutoSoC extends this system-level scope all the way to ASIC back end: Air Learning trains and validates policies, FlexACL maps them onto an accelerator template, and the flow continues through HLS, memory implementation choices, place-and-route, and floorplanned layout (Krishnan et al., 2021).
6. Empirical effects, limitations, and unresolved issues
The reported gains across these papers are substantial but heterogeneous. Euphrates achieves up to 66% SoC-level energy savings, or 2 for the vision computations, with only 1% accuracy loss in continuous mobile vision (Zhu et al., 2018). GenDRAM reports over 68x speedup on APSP and over 22x on the end-to-end genomics pipeline versus state-of-the-art GPU systems (Lu et al., 27 Feb 2026). ViM-Q reports an average 4.96x speedup and 59.8x energy efficiency gain over a quantized RTX 3090 baseline for low-batch ViM-t inference, while AccLLM reports 4.07x energy efficiency and 2.98x throughput over FlightLLM on the Xilinx Alveo U280 (Lyu et al., 3 May 2026, Liang et al., 7 Apr 2025). CLO reports 9.3%–66.6% decoding throughput improvement over prior KV offloading systems while preserving comparable accuracy (Yi et al., 18 Nov 2025). HybridAC reduces accuracy degradation from 60–90% without protection to 1–2% under variation as high as 50%, while also improving execution time, energy, area, and power relative to ISAAC and SRE (Behnam et al., 2022). DDC-PIM reports about 3 speedup on MobileNetV2 and 4 on EfficientNet-B0 with negligible accuracy loss compared with its PIM baseline, while achieving up to 5 and 6 improvements in weight density and area efficiency over prior SRAM-based PIM macros (Duan et al., 2023).
At the same time, the literature is explicit about several limitations. Many frameworks remain only partially data-centric. A3C3 states that it is primarily an algorithm-hardware co-design methodology with substantial workload/memory awareness rather than a full data-pipeline co-design system (Yildirim et al., 18 Jun 2026). AutoSoC does not provide a formal NAS algorithm and reports only partial exploration of all listed hardware design knobs because of time constraints (Krishnan et al., 2021). ViM-Q is manual rather than fully automated in several crucial choices, including 7 for smoothing, the APoT basis, and the decision to keep the SSM in high precision, while already using 80.81% of BRAM resources on ZCU102 (Lyu et al., 3 May 2026). AccLLM is centered on Llama-2-7B and does not present a complete automated exploration of pruning, quantization, attention policy, and hardware together (Liang et al., 7 Apr 2025). CLO assumes a PCIe-connected GPU with GDRCopy-like zero-copy support and enough CPU DRAM to hold offloaded KVCache, so its design is tied to a particular host-device memory model (Yi et al., 18 Nov 2025).
There are also more specific controversies. RAMAN contains an internal inconsistency: its abstract and conclusion state 31.28% power reduction, whereas the raw table values 8 to 9 imply a much larger reduction; the paper description explicitly treats this as a paper inconsistency rather than a derived property of the design (Khan et al., 26 Oct 2025). The supplied record for GCoD, despite its title and abstract, is identified as an IEEE conference template with no technical GCN content, so it cannot be used as evidence for graph irregularity mitigation or accelerator architecture (You et al., 2021). More broadly, the surveyed literature suggests that “Data-Algorithm Co-Design Engine” is not yet a settled field label. It is best understood as an umbrella description for systems that make data or workload structure a first-class optimization variable alongside algorithm and hardware, with different papers occupying different points on that spectrum (Yildirim et al., 18 Jun 2026).
Taken together, these works suggest that the most durable contribution of the paradigm is methodological. Efficient deployment emerges when the engine jointly decides what state to keep, what state to approximate, how to encode it, where to place it, and which hardware path should execute it. In some cases the decisive signal is motion metadata already computed in the ISP; in others it is head-wise KV reuse difficulty, tensor outlier structure, semiring-specific dataflow, tiered DRAM latency, or channel-wise sensitivity to analog variation (Zhu et al., 2018, Yi et al., 18 Nov 2025, Lyu et al., 3 May 2026, Lu et al., 27 Feb 2026, Behnam et al., 2022). A plausible synthesis is that future co-design engines will be distinguished less by whether they search over models or hardware, and more by whether they expose data representation and memory behavior as explicit search and control variables.