Sequence-Aware Trigger in Recommender Systems
- Sequence-aware trigger is an admission control mechanism that uses lightweight metadata to decide which long user sequences should undergo pre-inference caching.
- It evaluates features such as prefix length and dimension to predict SLO breaches, ensuring compute and memory resources are efficiently managed.
- Integration within relay-caching pipelines enhances throughput and supports longer sequences, achieving up to 1.5× performance improvement over baseline methods.
A sequence-aware trigger is an admission control mechanism designed to optimize pre-inference and caching for long user-behavior sequences in real-time generative recommender systems operating under strict tail-latency service-level objectives (SLOs). It arises in the context of in-HBM relay-race inference, where the generative recommendation (GR) model’s candidate-independent user-behavior prefix is pre-inferred and cached ahead of the latency-critical ranking stage (Wang et al., 5 Jan 2026). The sequence-aware trigger decides, as early as the retrieval phase, which requests are “at-risk” of exceeding the ranking P99 budget and thus should benefit from pre-inference and caching, while controlling for resource constraints on HBM footprint and pre-inference throughput.
1. Motivation and Function
Generative recommenders can leverage extensive user sequence histories, but online inference is constrained by P99 SLOs, capping sequence length and limiting ranking quality. Profiling reveals that most long user prefixes are independent of the per-candidate item list, making them amenable to pre-inference and reuse. Indiscriminate pre-inferencing across a petabyte-scale user population is infeasible due to HBM and compute limitations, especially under high-QPS, multi-stage serving workloads. The sequence-aware trigger’s function is to:
- Identify, using lightweight metadata, which requests are at risk of breaching SLOs if processed with baseline (no relay-caching) inference.
- Bound live in-HBM cache and pre-inference rates to never exceed reserved device resources.
- Emit auxiliary pre-infer signals only for admitted, “at-risk” requests, thus conserving resources and maintaining SLO compliance (Wang et al., 5 Jan 2026).
2. Metadata-Driven Risk Testing
The triggering logic initiates at the retrieval stage, where only metadata—namely prefix length and feature dimension —is available. The sequence-aware trigger evaluates whether, given , the request’s projected ranking P99 latency under baseline inference will exceed the SLO threshold . If the request is deemed “at-risk,” it is marked for pre-inference; otherwise, it passes through the pipeline without relay caching. This lightweight filtering avoids unnecessary computation and memory use for trivially short or low-dimensional prefixes.
3. Admission Control: Bounding Cache Footprint and Pre-Infer Rate
The sequence-aware trigger enforces formal resource constraints:
- HBM live-cache constraint:
where is the number of concurrent live caches, the P99 cache size, and the reserved HBM fraction.
- Compute capacity constraint:
for each “special” instance, enforcing pre-infer QPS within sustainable limits of model slots.
- System-wide constraint:
where is the number of special instances (Wang et al., 5 Jan 2026).
Only requests meeting all constraints are admitted; otherwise, they follow baseline inference.
4. Auxiliary Pre-infer Signal and Workflow Integration
For each admitted “at-risk” request , the trigger emits a response-free remote procedure call (RPC) ahead of the main ranking traffic. This RPC is formatted as:
1 2 |
header { service: "special_ranking", consistency_hash_key: userID }
body { user_id: userID, stage:"pre-infer" } |
5. Scheduling Logic and Pseudocode Representation
The operational logic is encapsulated in a lightweight, retrieval-stage hook:
1 2 3 4 |
function retrieval_hook(u, metadata):
if risk_test(metadata) == True:
if admissible_rate_and_footprint():
rpc_fire_and_forget(preinfer_RPC(u)) |
risk_test evaluates metadata against pre-tuned SLO thresholds, while admissible_rate_and_footprint() checks live cache and QPS constraints. This architecture ensures that auxiliary pre-infer load and state remain within controllable device and system budgets, stabilizing QPS and preventing SLO violations under high-concurrency workloads.
6. Role within Cross-Stage Relay-Caching Pipelines
The sequence-aware trigger is the first component in the three-stage RelayGR design, acting in concert with the affinity-aware router (which ensures placement co-location of cache production and consumption using consistent hashing) and the memory-aware expander (which controls short-term reuse via two-tier HBM/DRAM caching). This joint design enables provable bounds on cache correctness and resource usage, with the trigger orchestrating selective cache admission based on real-time system and workload characteristics (Wang et al., 5 Jan 2026).
7. Practical Impact and Evaluation Results
Deployment of the sequence-aware trigger within RelayGR enables up to longer sequences and higher SLO-compliant throughput compared to baseline under identical P99 SLOs. RelayGR, with the sequence-aware trigger as its gatekeeper, sustains QPS and acceptable tail latency as sequence length, model depth, and embedding width scale. Admission thresholds provided by the trigger allow adaptive tuning of the trade-off between latency headroom and overall compute or transfer cost, maintaining system stability under highly concurrent production scenarios (Wang et al., 5 Jan 2026).