In-HBM Relay-Race Inference

Updated 12 January 2026

In-HBM relay-race inference is a dataflow architecture that uses high-bandwidth memory as both storage and state relay to enable concurrent, pipelined neural computations.
It employs optimized scheduling, precise memory management, and compiler algorithms like knapsack offloading to balance compute load and latency.
Empirical results in CNN and generative recommendation pipelines show significant speedups and improved tail-latency, validating the design's scalability.

In-HBM relay-race inference refers to a class of high-throughput inference system designs that leverage High-Bandwidth Memory (HBM) not only for storage or parameter transfer, but as a critical medium for state "handoff" across multi-stage or deeply pipelined neural computations. The approach draws on the metaphor of a relay-race: intermediate computational state—such as CNN activations, key/value caches, or precomputed model prefixes—is represented as a "baton" passed efficiently from one processing component to the next, with HBM acting as the interconnect and buffer to enable full pipeline concurrency and strict tail-latency guarantees. The paradigm has been realized in both hardware-accelerated convolutional neural networks (CNNs) and cross-stage production generative-recommendation (GR) pipelines (Doumet et al., 2024, Wang et al., 5 Jan 2026).

1. Conceptual Foundation: Relay-Race Pipeline Across Computation Layers and Stages

The relay-race pipeline is fundamentally a dataflow architecture where processing elements (PEs) corresponding to neural network layers or serving-pipeline stages are instantiated in a chain. Each PE computes as soon as its minimal data dependencies are satisfied; as soon as intermediate outputs are available, they are forwarded (baton-passed) to the next PE via FIFO buffers underpinned by HBM or on-chip memory. No stage waits for complete upstream computation before commencing, enabling deep overlap and maximizing hardware utilization.

In CNN accelerators, each layer is mapped to a distinct PE, with outputs line-buffered and immediately consumed by the next PE. All layers operate concurrently, supported by activations and weights mapped to HBM or on-chip buffers (Doumet et al., 2024).
In GR recommender pipelines, long-term user-behavior prefixes are pre-inferred and stored as key/value caches in HBM across retrieval, pre-processing, and ranking stages. The prefix is only extended for the final ranking, minimizing critical-path compute (Wang et al., 5 Jan 2026).

This architectural pattern is independent of model type and is characterized by fully pipelined, resource-efficient execution where HBM serves a dual role as high-bandwidth storage and as the state relay mechanism.

2. Technical Mechanisms: Compiler Algorithms, Bandwidth Models, and Memory Management

Modern implementations of in-HBM relay-race inference entail intricate algorithms for layer/resource allocation, precise modeling of memory bandwidth and latency profiles, and tiered cache management. The following elements are central:

Per-Layer PE Microarchitecture: Each layer's PE is custom for kernel size, channel factors, and parallelism $(p_i^l, p_o^l)$ .
Offloading via Knapsack Algorithm: A binary variable $o_l$ selects for each layer whether weights are mapped to HBM (1) or on-chip (0). The knapsack-style greedy selection uses heuristic scores $s_l$ to maximize on-chip memory savings per HBM bandwidth cost, subject to HBM pseudo-channel constraints:

$s_l = \frac{\bigl(\bigl\lceil W_l/20{,}480\bigr\rceil - 2\bigr)\;\times\;\lceil\mathrm{outW}_l/18\rceil}{p_i^l\,p_o^l\,80}$

$\max\;\sum_l s_l\,o_l \quad \text{s.t.} \quad \sum_l (p_i^l\,p_o^l)\,o_l \le N_{\rm pc}\times 3$

HBM Throughput and FIFO Sizing: Empirical profiling yields read efficiency $\eta_{\rm rd}(B)$ for burst length $B$ , with $\eta_{\rm rd}(8) \approx 83\%$ and $\eta_{\rm rd}(32) \approx 93\%$ . FIFO depth $D$ is set to hide HBM latency such that $o_l$ 0.

Sequence-Aware Trigger: Admission control ensures that only requests at risk of SLO violation are pre-inferred and cached, bounding total live HBM footprint and pre-inference load per device and cluster:

$o_l$ 1

$o_l$ 2

Affinity-Aware Router and Memory-Aware Expander: Routing based on consistent hashing ensures both prefix caching and ranking RPCs resolve to the same instance, precluding remote fetches. A DRAM tier with at-most-once reload extends prefix cache lifetime and absorbs bursty access without cross-server HBM traffic.

3. Cross-Domain Realizations: Hardware and System Implementations

Distinct implementations of in-HBM relay-race inference have been developed for high-throughput CNN hardware and production GR systems:

Domain	Pipeline Stages	HBM Role	Key Resource Controls
CNN FPGA Accelerators	L hardware PEs (layers)	Storage for weights/activations, relay between PEs	Knapsack offloading, pseudo-channel assignment
Generative Recommendation	Retrieval, Preproc, Rank	Prefix (KV) cache handoff, lifecycle cache	Admission bounds, affinity routing, DRAM expander

CNN accelerators (e.g., H2PIPE) instantiate a chain of layer PEs, with HBM providing both storage for large weights and the conduit for pipelined activation/weight exchange (Doumet et al., 2024).
RelayGR, deployed on Huawei Ascend 910C NPUs (32 GB HBM), manages inference state as per-user caches in HBM, applying load balancer affinity and a multi-tier DRAM/HBM cache hierarchy (Wang et al., 5 Jan 2026).

4. End-to-End Scheduling and Buffer Allocation

Automated compilation and runtime scheduling are central to exploiting relay-race concurrency, responsible for:

Balancing pipeline latencies by tuning per-layer parallelism so that $o_l$ 3 is roughly constant.
Allocating burst lengths $o_l$ 4 and FIFO depths $o_l$ 5 to match both compute and dataflow pressure, with burst length increased for bandwidth-bound layers.
Assigning offloaded layers to physical pseudo-channels in round-robin (PC 0→15, then 31→16) to maximize controller utilization and minimize stalling.
Integrating cache logic (credit-counted prefetch, multi-stage FIFOs) to synchronize HBM readiness with compute demand.

This regime enables near-peak utilization of both compute and memory fabric—ResNet-18 achieved 75% of the theoretical single-HBM-stack bandwidth under this design (Doumet et al., 2024).

5. Measured Performance and Scalability

Relay-race inference systems demonstrate significant throughput, SLO compliance, and scaling improvements across both domains:

Model / Pipeline	Max Sequence / Batch-1 Throughput	SLO-Compliance (P99 tail)	Throughput Speedup vs Baseline
CNN (ResNet-18, all-HBM)	1,811 img/s	—	19.4× (vs. best prior FPGA)
CNN (ResNet-18, hybrid)	4,174 img/s	—	2.3× (vs. all-HBM), 2.2× over theoretical bound
RelayGR (Baseline)	4K tokens, ~100 QPS	P99 > 135 ms at ~100 concurrency	—
RelayGR (In-HBM, 0% DRAM)	6K tokens (1.5×), ~260 QPS	P99 < 135 ms at ~200 concurrency	2.6× (QPS gain at SLO)
RelayGR (+DRAM, 10% hit)	8K tokens (2×), ~360 QPS	P99 < 135 ms at ~300 concurrency	3.6× (QPS gain at SLO)

For deep CNNs, hybrid HBM+on-chip scheduling provides >19× speed-up over previous FPGAs and can outperform a single HBM-stack’s theoretical maximum via optimal bottleneck buffering. For GR inference, in-HBM relay-race pipelines unlock up to 1.5–2× longer user-history support and 3.6× higher throughput under strict ranking-stage SLOs—enabled by aggressive pipelining, cache affinity, and HBM/DRAM memory hierarchy (Doumet et al., 2024, Wang et al., 5 Jan 2026).

6. Systemic Constraints, Scaling, and Future Directions

Critical to relay-race inference viability are explicit memory, bandwidth, and compute budget enforcement:

Admission control ties live HBM cache population to hard device limits, preventing overload and ensuring all admitted state persists until consumed.
Late-binding placement enforces instance affinity for cache consumers/producers, avoiding tail-latency penalties of remote memory access.
DRAM expanders rate-limit DRAM-to-HBM reloads to avert new critical-path contention, supporting repeated reuse of user caches across server localities without cross-node traffic.
Autoscaling and tiered deployment strategies match special instance pools and memory reservations to dynamic workload fractions and sequence-length distributions.

A plausible implication is that relay-race inference may generalize to other domains requiring multi-stage, low-latency ML inference over stateful pipelines, provided sufficient HBM or equivalent high-bandwidth memory fabric is available.

7. Conclusion

In-HBM relay-race inference, as exemplified by H2PIPE for deeply pipelined CNNs (Doumet et al., 2024) and RelayGR for production-scale generative recommendation (Wang et al., 5 Jan 2026), represents a mature architectural approach for maximizing inference throughput and concurrency while meeting stringent tail-latency requirements. Through careful orchestration of memory, computation, and dataflow backed by empirical profiling and algorithmic allocation, this paradigm effectively decouples heavy compute stages from critical-path serving latency by leveraging high-bandwidth HBM as both data storage and relay medium. This approach supports higher model complexity, longer sequences, and deeper pipelines under real-world systems constraints.

Markdown Report Issue Upgrade to Chat

References (2)

H2PIPE: High throughput CNN Inference on FPGAs with High-Bandwidth Memory (2024)

RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-HBM Relay-Race Inference.

In-HBM Relay-Race Inference

1. Conceptual Foundation: Relay-Race Pipeline Across Computation Layers and Stages

2. Technical Mechanisms: Compiler Algorithms, Bandwidth Models, and Memory Management

CNN Accelerator Pipeline (Doumet et al., 2024)

GR Inference Pipeline (Wang et al., 5 Jan 2026)

3. Cross-Domain Realizations: Hardware and System Implementations

4. End-to-End Scheduling and Buffer Allocation

5. Measured Performance and Scalability

6. Systemic Constraints, Scaling, and Future Directions

7. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

In-HBM Relay-Race Inference

1. Conceptual Foundation: Relay-Race Pipeline Across Computation Layers and Stages

2. Technical Mechanisms: Compiler Algorithms, Bandwidth Models, and Memory Management

CNN Accelerator Pipeline (Doumet et al., 2024)

GR Inference Pipeline (Wang et al., 5 Jan 2026)

3. Cross-Domain Realizations: Hardware and System Implementations

4. End-to-End Scheduling and Buffer Allocation

5. Measured Performance and Scalability

6. Systemic Constraints, Scaling, and Future Directions

7. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics