In-HBM Relay-Race Inference
- In-HBM relay-race inference is a dataflow architecture that uses high-bandwidth memory as both storage and state relay to enable concurrent, pipelined neural computations.
- It employs optimized scheduling, precise memory management, and compiler algorithms like knapsack offloading to balance compute load and latency.
- Empirical results in CNN and generative recommendation pipelines show significant speedups and improved tail-latency, validating the design's scalability.
In-HBM relay-race inference refers to a class of high-throughput inference system designs that leverage High-Bandwidth Memory (HBM) not only for storage or parameter transfer, but as a critical medium for state "handoff" across multi-stage or deeply pipelined neural computations. The approach draws on the metaphor of a relay-race: intermediate computational state—such as CNN activations, key/value caches, or precomputed model prefixes—is represented as a "baton" passed efficiently from one processing component to the next, with HBM acting as the interconnect and buffer to enable full pipeline concurrency and strict tail-latency guarantees. The paradigm has been realized in both hardware-accelerated convolutional neural networks (CNNs) and cross-stage production generative-recommendation (GR) pipelines (Doumet et al., 2024, Wang et al., 5 Jan 2026).
1. Conceptual Foundation: Relay-Race Pipeline Across Computation Layers and Stages
The relay-race pipeline is fundamentally a dataflow architecture where processing elements (PEs) corresponding to neural network layers or serving-pipeline stages are instantiated in a chain. Each PE computes as soon as its minimal data dependencies are satisfied; as soon as intermediate outputs are available, they are forwarded (baton-passed) to the next PE via FIFO buffers underpinned by HBM or on-chip memory. No stage waits for complete upstream computation before commencing, enabling deep overlap and maximizing hardware utilization.
- In CNN accelerators, each layer is mapped to a distinct PE, with outputs line-buffered and immediately consumed by the next PE. All layers operate concurrently, supported by activations and weights mapped to HBM or on-chip buffers (Doumet et al., 2024).
- In GR recommender pipelines, long-term user-behavior prefixes are pre-inferred and stored as key/value caches in HBM across retrieval, pre-processing, and ranking stages. The prefix is only extended for the final ranking, minimizing critical-path compute (Wang et al., 5 Jan 2026).
This architectural pattern is independent of model type and is characterized by fully pipelined, resource-efficient execution where HBM serves a dual role as high-bandwidth storage and as the state relay mechanism.
2. Technical Mechanisms: Compiler Algorithms, Bandwidth Models, and Memory Management
Modern implementations of in-HBM relay-race inference entail intricate algorithms for layer/resource allocation, precise modeling of memory bandwidth and latency profiles, and tiered cache management. The following elements are central:
CNN Accelerator Pipeline (Doumet et al., 2024)
- Per-Layer PE Microarchitecture: Each layer's PE is custom for kernel size, channel factors, and parallelism .
- Offloading via Knapsack Algorithm: A binary variable selects for each layer whether weights are mapped to HBM (1) or on-chip (0). The knapsack-style greedy selection uses heuristic scores to maximize on-chip memory savings per HBM bandwidth cost, subject to HBM pseudo-channel constraints:
- HBM Throughput and FIFO Sizing: Empirical profiling yields read efficiency for burst length , with and . FIFO depth is set to hide HBM latency such that .
GR Inference Pipeline (Wang et al., 5 Jan 2026)
- Sequence-Aware Trigger: Admission control ensures that only requests at risk of SLO violation are pre-inferred and cached, bounding total live HBM footprint and pre-inference load per device and cluster:
- Affinity-Aware Router and Memory-Aware Expander: Routing based on consistent hashing ensures both prefix caching and ranking RPCs resolve to the same instance, precluding remote fetches. A DRAM tier with at-most-once reload extends prefix cache lifetime and absorbs bursty access without cross-server HBM traffic.
3. Cross-Domain Realizations: Hardware and System Implementations
Distinct implementations of in-HBM relay-race inference have been developed for high-throughput CNN hardware and production GR systems:
| Domain | Pipeline Stages | HBM Role | Key Resource Controls |
|---|---|---|---|
| CNN FPGA Accelerators | L hardware PEs (layers) | Storage for weights/activations, relay between PEs | Knapsack offloading, pseudo-channel assignment |
| Generative Recommendation | Retrieval, Preproc, Rank | Prefix (KV) cache handoff, lifecycle cache | Admission bounds, affinity routing, DRAM expander |
- CNN accelerators (e.g., H2PIPE) instantiate a chain of layer PEs, with HBM providing both storage for large weights and the conduit for pipelined activation/weight exchange (Doumet et al., 2024).
- RelayGR, deployed on Huawei Ascend 910C NPUs (32 GB HBM), manages inference state as per-user caches in HBM, applying load balancer affinity and a multi-tier DRAM/HBM cache hierarchy (Wang et al., 5 Jan 2026).
4. End-to-End Scheduling and Buffer Allocation
Automated compilation and runtime scheduling are central to exploiting relay-race concurrency, responsible for:
- Balancing pipeline latencies by tuning per-layer parallelism so that is roughly constant.
- Allocating burst lengths and FIFO depths to match both compute and dataflow pressure, with burst length increased for bandwidth-bound layers.
- Assigning offloaded layers to physical pseudo-channels in round-robin (PC 0→15, then 31→16) to maximize controller utilization and minimize stalling.
- Integrating cache logic (credit-counted prefetch, multi-stage FIFOs) to synchronize HBM readiness with compute demand.
This regime enables near-peak utilization of both compute and memory fabric—ResNet-18 achieved 75% of the theoretical single-HBM-stack bandwidth under this design (Doumet et al., 2024).
5. Measured Performance and Scalability
Relay-race inference systems demonstrate significant throughput, SLO compliance, and scaling improvements across both domains:
| Model / Pipeline | Max Sequence / Batch-1 Throughput | SLO-Compliance (P99 tail) | Throughput Speedup vs Baseline |
|---|---|---|---|
| CNN (ResNet-18, all-HBM) | 1,811 img/s | — | 19.4× (vs. best prior FPGA) |
| CNN (ResNet-18, hybrid) | 4,174 img/s | — | 2.3× (vs. all-HBM), 2.2× over theoretical bound |
| RelayGR (Baseline) | 4K tokens, ~100 QPS | P99 > 135 ms at ~100 concurrency | — |
| RelayGR (In-HBM, 0% DRAM) | 6K tokens (1.5×), ~260 QPS | P99 < 135 ms at ~200 concurrency | 2.6× (QPS gain at SLO) |
| RelayGR (+DRAM, 10% hit) | 8K tokens (2×), ~360 QPS | P99 < 135 ms at ~300 concurrency | 3.6× (QPS gain at SLO) |
For deep CNNs, hybrid HBM+on-chip scheduling provides >19× speed-up over previous FPGAs and can outperform a single HBM-stack’s theoretical maximum via optimal bottleneck buffering. For GR inference, in-HBM relay-race pipelines unlock up to 1.5–2× longer user-history support and 3.6× higher throughput under strict ranking-stage SLOs—enabled by aggressive pipelining, cache affinity, and HBM/DRAM memory hierarchy (Doumet et al., 2024, Wang et al., 5 Jan 2026).
6. Systemic Constraints, Scaling, and Future Directions
Critical to relay-race inference viability are explicit memory, bandwidth, and compute budget enforcement:
- Admission control ties live HBM cache population to hard device limits, preventing overload and ensuring all admitted state persists until consumed.
- Late-binding placement enforces instance affinity for cache consumers/producers, avoiding tail-latency penalties of remote memory access.
- DRAM expanders rate-limit DRAM-to-HBM reloads to avert new critical-path contention, supporting repeated reuse of user caches across server localities without cross-node traffic.
- Autoscaling and tiered deployment strategies match special instance pools and memory reservations to dynamic workload fractions and sequence-length distributions.
A plausible implication is that relay-race inference may generalize to other domains requiring multi-stage, low-latency ML inference over stateful pipelines, provided sufficient HBM or equivalent high-bandwidth memory fabric is available.
7. Conclusion
In-HBM relay-race inference, as exemplified by H2PIPE for deeply pipelined CNNs (Doumet et al., 2024) and RelayGR for production-scale generative recommendation (Wang et al., 5 Jan 2026), represents a mature architectural approach for maximizing inference throughput and concurrency while meeting stringent tail-latency requirements. Through careful orchestration of memory, computation, and dataflow backed by empirical profiling and algorithmic allocation, this paradigm effectively decouples heavy compute stages from critical-path serving latency by leveraging high-bandwidth HBM as both data storage and relay medium. This approach supports higher model complexity, longer sequences, and deeper pipelines under real-world systems constraints.