Papers
Topics
Authors
Recent
Search
2000 character limit reached

RelayGR: Scalable Real-Time Recommendation

Updated 12 January 2026
  • RelayGR is a production system that enables scalable, real-time inference in generative recommendation models by pre-inferencing and caching user-behavior prefixes.
  • It leverages high-bandwidth memory and server-local DRAM to manage cache lifecycle and ensure low tail-latency even under high QPS conditions.
  • Empirical results demonstrate that RelayGR increases feasible sequence lengths and throughput by up to 3.6× while maintaining strict service-level objectives.

RelayGR is a production system designed to enable scalable, real-time inference for long-sequence generative recommendation (GR) models under strict tail-latency service-level objectives (SLOs). It exploits the observation that the majority of compute in contemporary Transformer-style recommenders is expended encoding user-behavior prefixes that are independent of candidate items. RelayGR achieves significant throughput and sequence-length improvements by pre-inferring, caching, and reusing these user-specific prefix states across the multi-stage recommender pipeline, operating within bounded high-bandwidth memory (HBM) and server-local DRAM constraints, and safely managing cache lifecycle under high query per second (QPS) conditions (Wang et al., 5 Jan 2026).

1. System Motivation and Architectural Overview

Modern industrial recommenders process queries in a three-stage cascade: retrieval, pre-processing (coarse ranking), and fine-grained ranking, with ranking typically constrained to P99 deadlines on the order of tens of milliseconds. Generative recommendation models benefit from ingesting user histories of up to several thousand tokens, but their practical online sequence length is tightly bounded by SLO budgets in the ranking stage.

RelayGR addresses the core efficiency bottleneck by pre-inferencing the user-behavior prefix—i.e., the per-user, candidate-independent key/value (KV) cache for the Transformer backbone—upstream in the pipeline. This prefix cache is kept in HBM, enabling the final ranking phase to reuse the computed state without recomputing or memory swapping. The approach transforms an otherwise intractable problem of universal caching at billion-user scale into a feasible, bounded-lifecycle cache design that always meets pipeline correctness constraints.

Key architectural challenges include:

  • Guaranteeing cross-stage survivability of the prefix cache across pipeline phases under late binding of requests.
  • Bounding the per-user cache footprint, which may reach multiple MBs per user, to avoid exceeding device HBM capacity.
  • Preventing pre-inference overload at high QPS and ensuring all admitted caches persist for the full lifecycle window without incurring queuing tail amplification.

RelayGR enforces two critical invariants:

  1. No remote fetch on the ranking critical path: All requests either find the prefix cache locally (in HBM or local DRAM) or fall back to full inference.
  2. Bounded cache survivability: The system limits both the number of live caches and pre-inference rate such that caches persist through the request lifecycle.

2. Core Algorithmic Components

RelayGR decomposes the cross-stage KV caching challenge into three principal modules operating in concert:

2.1 Sequence-Aware Trigger (Admission)

This component admits only those requests whose inline inference would risk exceeding the ranking P99 budget. During retrieval, lightweight metadata (actual prefix length, feature dimension, etc.) is used to estimate cost:

1
2
3
4
5
6
if cost_estimate(ℓ) < ranking_P99_budget:
    mark_request_normal()
else:
    if admission_controller.allow():
        send_preinfer_signal(userID)
    mark_request_special(userID)

Admission logic enforces:

  • HBM footprint budget: L×kvp99r1×HBML \times kv_{p99} \leq r_1 \times \mathrm{HBM}, where LL is live cache count, kvp99kv_{p99} is P99 cache size, and r1r_1 is the reserved HBM fraction.
  • Pre-infer workload cap: QadmitQmMQ_{\mathrm{admit}} \leq Q_m \cdot M, with QmQ_m per-model-slot throughput and MM model slots per “special” instance.

Table 1: Admission Policy Parameters

Parameter Meaning Constraint
LL Live cache count Lr1×HBM/kvp99L \leq r_1 \times \mathrm{HBM} / kv_{p99}
QadmitQ_{\mathrm{admit}} Pre-infer admission rate QadmitQmMQ_{\mathrm{admit}} \leq Q_m M

2.2 Affinity-Aware Router (Placement)

Requests marked “special” are routed, via consistent hashing on the userID, to the same instance for both pre-infer and subsequent ranking. This is achieved by injecting a consistency_hash_key in the request header, which passes unchanged through the load balancer and gateway layers.

Within a “special” instance, an in-memory hash table maps userID to HBM-resident prefix cache ψ(u)\psi(u). In the rare event of misrouting due to churn, the system falls back to full inference, never violating correctness but potentially losing reuse.

2.3 Memory-Aware Expander (Local Capacity Extension)

To extend usability beyond a single request lifecycle, RelayGR opportunistically spills used prefix caches ψ(u)\psi(u) into server-local DRAM. Upon a ranking request:

  1. Run a pseudo-pre-infer check: first HBM, then DRAM.
  2. On DRAM hit, trigger a single DRAM→HBM reload (H2D); lock and single-flight semantics prevent redundant reloads.
  3. After ranking, spill ψ(u)\psi(u) into DRAM if not already present.

DRAM/HBM load and spill operations are rate-limited to prevent resource contention, and all operations remain server-local, eliminating cross-server traffic.

3. Implementation on Huawei Ascend NPUs

RelayGR is realized in a production environment utilizing Ascend 910C NPUs:

  • Service partitioning: Two containerized services per host (“rank-normal” for short sequences, “rank-special” for long sequences) each pinned to a single Ascend device. Limitations on special instances per server (typically 1–2) mitigate PCIe contention.
  • HBM KV cache layout: Per-layer, per-user KV blobs (e.g., 8 layers × 256-dim embeddings, fp32) require ~32MB for a 2K-token prefix, with ~50% of HBM reserved for cache.
  • Concurrent inference: Each instance schedules up to MM parallel model slots (either pre-infer or ranking) with CPU-side C++ concurrency for admission logic/embedding lookup, orchestrated with the CANN runtime.
  • DRAM expander: Host-side hash map stores DRAM blobs and lock flags; batch I/O in 4MB pages with a token-bucket rate limiter manages concurrent H2D transfers.
  • Affinity routing: Istio/Gateway passes the consistency_hash_key header from client to backend, enabling consistent metadata-driven placement.

4. System Performance and Empirical Results

Extensive testing under mirrored production load, with genuine query patterns and load balancing, demonstrates significant improvements:

  • Sequence length: RelayGR increases the maximum feasible prefix from 4K to 6K tokens at P99 pipeline ≤ 135 ms and ≥ 99.9% success (1.5× increase).
  • Throughput: At a 6K-token fixed length and P99 ranking ≤ 50 ms, RelayGR achieves 2.5×–3.6× the throughput of baseline inline processing, depending on DRAM hit rate.

Table 2: Performance Comparison for 2K-Token HSTU Model

Configuration Max Seq Len SLO-Compliant Throughput Improvement
Baseline (inline GR) 4 K 30 QPS
RelayGR (0% DRAM hit) 6 K 75 QPS 2.5×
RelayGR (10% DRAM hit) 6 K 92 QPS 3.1×
RelayGR (100% DRAM hit) 6 K 108 QPS 3.6×

Detailed breakdowns at 80 in-flight queries show that pre-infer occurs off the critical path (~35 ms), DRAM→HBM loads take ~8 ms, and final ranking on cached KV requires ~7 ms, versus baseline exceeding 50 ms at ~50 QPS. RelayGR also sustains significant throughput when scaling to longer sequences (up to 15K tokens) and deeper/wider backbone models.

5. Formal Model and Correctness

The formal guarantee underpinning RelayGR is:

ψ=f([U,S,,]),f([U,S,S~,I])f([,,S~,I];ψ)ε\psi = f([\mathcal{U},\mathcal{S}_\ell,\emptyset,\emptyset]), \quad \left|f([\mathcal{U},\mathcal{S}_\ell,\tilde{\mathcal{S}},\mathcal{I}]) - f([\emptyset,\emptyset,\tilde{\mathcal{S}},\mathcal{I}];\,\psi)\right| \leq \varepsilon

This ensures that scoring with reused user prefix ψ\psi is equivalent, up to an error ε\varepsilon, to full end-to-end inference. The system enforces:

  • Live-cache budget:

L=Qadmit×Tlife,L×kvp99r1HBML = Q_{\mathrm{admit}} \times T_{\mathrm{life}}, \quad L \times kv_{p99} \leq r_1 \cdot \mathrm{HBM}

  • Pre-infer load cap:

QadmitQmM;Qmax(QmM)(r2N)Q_{\mathrm{admit}} \leq Q_m M; \quad Q_{\max} \leq (Q_m M)(r_2 N)

  • Fallback correctness: Any request failing to hit a cache in HBM/DRAM triggers full inference, preserving output integrity.

In practice, these constraints transform a fully unbounded caching problem into a tractable, lifecycle-bounded policy with no correctness or availability trade-offs.

6. Broader Context, Significance, and Relationship to Prior Work

RelayGR targets scaling bottlenecks particular to generative recommendation under SLO-driven (latency-tight) industrial serving conditions. By shifting compute and caching to the prefix (user) space and tightly coupling routing/admission with memory and compute-aware policies, it achieves performance unattainable by naive caching or recomputation strategies. These design principles are fundamentally distinct from classical channel relay or classical feedback optimization as studied in e.g., Grassmannian feedback MIMO relaying (0710.5758) or lattice-based decode-and-forward relay architectures (Chaaban et al., 2012), and instead are rooted in production inference engineering for foundation models.

A plausible implication is that similar cross-stage relay-race cache methods may be generalized to other domains where long-context models and real-time constraints interact—the core tension between resource-saturated model inference and stringent SLOs is prevalent across at-scale neural IR, dialogue, and search pipelines.

7. Limitations and Future Directions

RelayGR’s efficacy derives from the independence of user prefix encoding from candidate item context, a property holding for most Transformer-based recommendation pipelines but potentially less so for item-interactive modeling regimes. The system relies on server-local DRAM to extend cache lifetime, and further efficiency gains may be possible by fusing model compression, block-sparse cache layouts, or distributed backplane DRAM/HBM tiers. Integration with other accelerator families, and dynamic cost estimation beyond static/fixed regressor models, remain areas of future development.

Further analytical work could address: how cache eviction policies interact with varied retrieval recall sets, impact at extreme QPS regimes (>10\MakeLowercase{k}) and multi-tenant accelerator deployments, and extensions to federated or privacy-preserving model-serving paradigms. The core architectural strategies—admission gating based on fine-grained cost modeling, consistency-key-driven routing, and server-local cache expansion—establish a template for scaling long-context neural inference under severe latency constraints (Wang et al., 5 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RelayGR.