Papers
Topics
Authors
Recent
Search
2000 character limit reached

In-HBM Relay-Race Inference

Updated 12 January 2026
  • In-HBM relay-race inference is a dataflow architecture that uses high-bandwidth memory as both storage and state relay to enable concurrent, pipelined neural computations.
  • It employs optimized scheduling, precise memory management, and compiler algorithms like knapsack offloading to balance compute load and latency.
  • Empirical results in CNN and generative recommendation pipelines show significant speedups and improved tail-latency, validating the design's scalability.

In-HBM relay-race inference refers to a class of high-throughput inference system designs that leverage High-Bandwidth Memory (HBM) not only for storage or parameter transfer, but as a critical medium for state "handoff" across multi-stage or deeply pipelined neural computations. The approach draws on the metaphor of a relay-race: intermediate computational state—such as CNN activations, key/value caches, or precomputed model prefixes—is represented as a "baton" passed efficiently from one processing component to the next, with HBM acting as the interconnect and buffer to enable full pipeline concurrency and strict tail-latency guarantees. The paradigm has been realized in both hardware-accelerated convolutional neural networks (CNNs) and cross-stage production generative-recommendation (GR) pipelines (Doumet et al., 2024, Wang et al., 5 Jan 2026).

1. Conceptual Foundation: Relay-Race Pipeline Across Computation Layers and Stages

The relay-race pipeline is fundamentally a dataflow architecture where processing elements (PEs) corresponding to neural network layers or serving-pipeline stages are instantiated in a chain. Each PE computes as soon as its minimal data dependencies are satisfied; as soon as intermediate outputs are available, they are forwarded (baton-passed) to the next PE via FIFO buffers underpinned by HBM or on-chip memory. No stage waits for complete upstream computation before commencing, enabling deep overlap and maximizing hardware utilization.

  • In CNN accelerators, each layer is mapped to a distinct PE, with outputs line-buffered and immediately consumed by the next PE. All layers operate concurrently, supported by activations and weights mapped to HBM or on-chip buffers (Doumet et al., 2024).
  • In GR recommender pipelines, long-term user-behavior prefixes are pre-inferred and stored as key/value caches in HBM across retrieval, pre-processing, and ranking stages. The prefix is only extended for the final ranking, minimizing critical-path compute (Wang et al., 5 Jan 2026).

This architectural pattern is independent of model type and is characterized by fully pipelined, resource-efficient execution where HBM serves a dual role as high-bandwidth storage and as the state relay mechanism.

2. Technical Mechanisms: Compiler Algorithms, Bandwidth Models, and Memory Management

Modern implementations of in-HBM relay-race inference entail intricate algorithms for layer/resource allocation, precise modeling of memory bandwidth and latency profiles, and tiered cache management. The following elements are central:

  • Per-Layer PE Microarchitecture: Each layer's PE is custom for kernel size, channel factors, and parallelism (pil,pol)(p_i^l, p_o^l).
  • Offloading via Knapsack Algorithm: A binary variable olo_l selects for each layer whether weights are mapped to HBM (1) or on-chip (0). The knapsack-style greedy selection uses heuristic scores sls_l to maximize on-chip memory savings per HBM bandwidth cost, subject to HBM pseudo-channel constraints:

sl=(Wl/20,4802)  ×  outWl/18pilpol80s_l = \frac{\bigl(\bigl\lceil W_l/20{,}480\bigr\rceil - 2\bigr)\;\times\;\lceil\mathrm{outW}_l/18\rceil}{p_i^l\,p_o^l\,80}

max  lslols.t.l(pilpol)olNpc×3\max\;\sum_l s_l\,o_l \quad \text{s.t.} \quad \sum_l (p_i^l\,p_o^l)\,o_l \le N_{\rm pc}\times 3

  • HBM Throughput and FIFO Sizing: Empirical profiling yields read efficiency ηrd(B)\eta_{\rm rd}(B) for burst length BB, with ηrd(8)83%\eta_{\rm rd}(8) \approx 83\% and ηrd(32)93%\eta_{\rm rd}(32) \approx 93\%. FIFO depth DD is set to hide HBM latency such that Dmax×RdD \geq \ell_{\max} \times R_d.
  • Sequence-Aware Trigger: Admission control ensures that only requests at risk of SLO violation are pre-inferred and cached, bounding total live HBM footprint and pre-inference load per device and cluster:

Lkv99r1HBML \cdot \text{kv}_{99} \leq r_1 \cdot \text{HBM}

QadmitQmMQ_{\text{admit}} \leq Q_m \cdot M

  • Affinity-Aware Router and Memory-Aware Expander: Routing based on consistent hashing ensures both prefix caching and ranking RPCs resolve to the same instance, precluding remote fetches. A DRAM tier with at-most-once reload extends prefix cache lifetime and absorbs bursty access without cross-server HBM traffic.

3. Cross-Domain Realizations: Hardware and System Implementations

Distinct implementations of in-HBM relay-race inference have been developed for high-throughput CNN hardware and production GR systems:

Domain Pipeline Stages HBM Role Key Resource Controls
CNN FPGA Accelerators L hardware PEs (layers) Storage for weights/activations, relay between PEs Knapsack offloading, pseudo-channel assignment
Generative Recommendation Retrieval, Preproc, Rank Prefix (KV) cache handoff, lifecycle cache Admission bounds, affinity routing, DRAM expander
  • CNN accelerators (e.g., H2PIPE) instantiate a chain of layer PEs, with HBM providing both storage for large weights and the conduit for pipelined activation/weight exchange (Doumet et al., 2024).
  • RelayGR, deployed on Huawei Ascend 910C NPUs (32 GB HBM), manages inference state as per-user caches in HBM, applying load balancer affinity and a multi-tier DRAM/HBM cache hierarchy (Wang et al., 5 Jan 2026).

4. End-to-End Scheduling and Buffer Allocation

Automated compilation and runtime scheduling are central to exploiting relay-race concurrency, responsible for:

  • Balancing pipeline latencies by tuning per-layer parallelism so that khlkwlcilcolpilpol\frac{k_h^l k_w^l c_i^l c_o^l}{p_i^l p_o^l} is roughly constant.
  • Allocating burst lengths BB and FIFO depths DlD_l to match both compute and dataflow pressure, with burst length increased for bandwidth-bound layers.
  • Assigning offloaded layers to physical pseudo-channels in round-robin (PC 0→15, then 31→16) to maximize controller utilization and minimize stalling.
  • Integrating cache logic (credit-counted prefetch, multi-stage FIFOs) to synchronize HBM readiness with compute demand.

This regime enables near-peak utilization of both compute and memory fabric—ResNet-18 achieved 75% of the theoretical single-HBM-stack bandwidth under this design (Doumet et al., 2024).

5. Measured Performance and Scalability

Relay-race inference systems demonstrate significant throughput, SLO compliance, and scaling improvements across both domains:

Model / Pipeline Max Sequence / Batch-1 Throughput SLO-Compliance (P99 tail) Throughput Speedup vs Baseline
CNN (ResNet-18, all-HBM) 1,811 img/s 19.4× (vs. best prior FPGA)
CNN (ResNet-18, hybrid) 4,174 img/s 2.3× (vs. all-HBM), 2.2× over theoretical bound
RelayGR (Baseline) 4K tokens, ~100 QPS P99 > 135 ms at ~100 concurrency
RelayGR (In-HBM, 0% DRAM) 6K tokens (1.5×), ~260 QPS P99 < 135 ms at ~200 concurrency 2.6× (QPS gain at SLO)
RelayGR (+DRAM, 10% hit) 8K tokens (2×), ~360 QPS P99 < 135 ms at ~300 concurrency 3.6× (QPS gain at SLO)

For deep CNNs, hybrid HBM+on-chip scheduling provides >19× speed-up over previous FPGAs and can outperform a single HBM-stack’s theoretical maximum via optimal bottleneck buffering. For GR inference, in-HBM relay-race pipelines unlock up to 1.5–2× longer user-history support and 3.6× higher throughput under strict ranking-stage SLOs—enabled by aggressive pipelining, cache affinity, and HBM/DRAM memory hierarchy (Doumet et al., 2024, Wang et al., 5 Jan 2026).

6. Systemic Constraints, Scaling, and Future Directions

Critical to relay-race inference viability are explicit memory, bandwidth, and compute budget enforcement:

  • Admission control ties live HBM cache population to hard device limits, preventing overload and ensuring all admitted state persists until consumed.
  • Late-binding placement enforces instance affinity for cache consumers/producers, avoiding tail-latency penalties of remote memory access.
  • DRAM expanders rate-limit DRAM-to-HBM reloads to avert new critical-path contention, supporting repeated reuse of user caches across server localities without cross-node traffic.
  • Autoscaling and tiered deployment strategies match special instance pools and memory reservations to dynamic workload fractions and sequence-length distributions.

A plausible implication is that relay-race inference may generalize to other domains requiring multi-stage, low-latency ML inference over stateful pipelines, provided sufficient HBM or equivalent high-bandwidth memory fabric is available.

7. Conclusion

In-HBM relay-race inference, as exemplified by H2PIPE for deeply pipelined CNNs (Doumet et al., 2024) and RelayGR for production-scale generative recommendation (Wang et al., 5 Jan 2026), represents a mature architectural approach for maximizing inference throughput and concurrency while meeting stringent tail-latency requirements. Through careful orchestration of memory, computation, and dataflow backed by empirical profiling and algorithmic allocation, this paradigm effectively decouples heavy compute stages from critical-path serving latency by leveraging high-bandwidth HBM as both data storage and relay medium. This approach supports higher model complexity, longer sequences, and deeper pipelines under real-world systems constraints.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-HBM Relay-Race Inference.