Papers
Topics
Authors
Recent
Search
2000 character limit reached

SilverTorch: Unified GPU-based DLRM Serving

Updated 21 November 2025
  • SilverTorch is a unified, model-based system that integrates the entire deep learning recommendation pipeline into a single GPU-executable tensor graph, replacing multiple CPU-based services.
  • It leverages GPU-optimized Bloom filters and fused Int8 ANN search to enhance throughput, reduce latency, and lower costs in large-scale recommendation serving.
  • The co-design of filtering, ANN indexes, and multi-task OverArch scoring delivers substantial production gains, achieving up to 23.7× throughput improvement and 11.4× lower latency.

SilverTorch is a unified, model-based system designed for serving large-scale deep learning recommendation models (DLRM) entirely on GPUs, replacing traditional multi-service pipelines dependent on CPU-based approximate nearest neighbor (ANN) and filtering systems. SilverTorch encodes the entire retrieval and ranking stack as a single tensorized model graph, compiled for GPU execution, thereby enabling significant efficiency, scalability, and modeling improvements in production recommendation systems (Xue et al., 18 Nov 2025).

1. Model-Based Recommendation Serving Architecture

SilverTorch formalizes a model-based serving paradigm in which all retrieval, filtering, multi-task score aggregation, and early-stage ranking operations are expressed as tensor operators within a single PyTorch-derived computation graph. This approach eliminates the need for separately managed microservices for indexing, ANN search, and filter logic. The system consolidates the following model components:

  • User Tower: Computes user embedding.
  • Bloom-index Layer: Performs feature-level filtering using a GPU-optimized Bloom filter.
  • Tensor-native Int8 ANN Layer: Executes approximate kNN search via quantized inner products.
  • OverArch Scoring Layer: Captures complex user-item interactions and supports multi-task outputs.
  • Value Model: Aggregates multi-task outputs into single scores through in-model logic.
  • Embedding Cache: Stores precomputed item embeddings for efficient early-stage ranking (ESR).

End-to-end, the pipeline involves offline training and quantization of multi-tower models, construction of IVF+Int8 ANN and Bloom-filter indexes as GPU-resident tensors, TorchScript model artifact baking, and a serving-time “single forward pass” executing all retrieval, filtering, scoring, and aggregation within one model invocation (Xue et al., 18 Nov 2025).

2. GPU Bloom Index Algorithm for Feature Filtering

SilverTorch utilizes a GPU-optimized Bloom index for pre-retrieval feature filtering. Each item ii is assigned an MM-bit signature VBiVB_i. Features are hashed via kk independent hash functions; the query’s features yield an analogous query vector QBQB. An item passes the filter if

(QBVBi)=QB(QB \wedge VB_i) = QB

False-positive probability for nn distinct query features is

Pfp=(1ekn/m)kP_{fp} = (1 - e^{-kn/m})^k

Typical deployment parameters are m=1024m=1024 bits and k=5k=5, yielding Pfp0.07%P_{fp} \approx 0.07\%. Item signatures are laid out in transposed u64u64-aligned memory for warp-efficient bitwise computation. Insertions are O(Nitemsk)O(N_{items}\cdot k), and queries are O((M/64)(Nitems/64))O((M/64)\cdot(N_{items}/64)) bit-operations. The Bloom mask’s compact representation enables contiguous loading and masking across large candidate sets in a fully parallelized fashion (Xue et al., 18 Nov 2025).

3. Fused Int8 ANN Search: Indexing and Query Pipeline

Index construction leverages KMeans++ partitioning into CC centroids in RD\mathbb{R}^D, quantizing all item and centroid embeddings to int8 values:

xi(8)=128xixminxmaxxmin[128,127]x_i^{(8)} = \left\lfloor 128 \cdot \frac{x_i - x_{min}}{x_{max} - x_{min}} \right\rceil \in [-128,127]

At query time, user embeddings are similarly quantized; a fused batched matmul computes B×CB \times C int8 inner products via the dp4a instruction:

Sb,c=d=1Dqb,d(8)×cc,d(8)S_{b,c} = \sum_{d=1}^D q^{(8)}_{b,d} \times c^{(8)}_{c,d}

The top PP centroids are probed; cluster items’ int8 embeddings are streamed for further dp4a-based scoring with the quantized user embedding. No intermediate gather buffer is required, maximizing tensor throughput. The global top-K0K_0 heap is maintained in registers/shared memory. Overall complexity per query is O(PclusterD/32)O(P\cdot|\text{cluster}| \cdot D / 32) dp4a ops. Int8 quantization halves the ANN memory footprint compared to float32 (Xue et al., 18 Nov 2025).

4. Co-Design of Filtering and ANN Indexes

In SilverTorch, Bloom and ANN indexes are co-designed as GPU tensors, permitting direct application of the Bloom bitmask within the ANN matmul/gather operation. This eliminates the need to instantiate a full per-item boolean mask:

  • Memory requirement for masking reduces from NitemsN_{items} bytes (bool8) to Nitems/8N_{items}/8 bytes (bit-packed).
  • The effective scanned-item count after Bloom filtering is proportional to the filter’s selectivity α\alpha: if α0.1\alpha \approx 0.1, the number of items scored by ANN is reduced by approximately 90%, saving dp4a operations and bandwidth.
  • Zeroing out scores for non-matching items is accomplished in-place by direct bitmask application during the fused computation (Xue et al., 18 Nov 2025).

5. Multi-Task OverArch Layer and Value Model Aggregation

Post-ANN, SilverTorch retrieves cached float16 item embeddings ERK0×DE \in \mathbb{R}^{K_0 \times D}, concatenates user, item, and cross features, and applies the OverArch block—either a shallow MLP or Mixture-of-Logits (MoL)—to produce TT logits for TT distinct tasks. The Value Model aggregates these via a weighted sum or specified logic:

si=t=1Tαtfθ(t)(u,i)s_i = \sum_{t=1}^T \alpha_t f_\theta^{(t)}(u,i)

A plausible implication is that arbitrary business logic or external JSON-driven rules can be encoded in the Value Model. In training, a multi-task objective is used:

L=t=1TαtLt(fθ(t)(u,i),yt)\mathcal{L} = \sum_{t=1}^T \alpha_t \mathcal{L}_t(f_\theta^{(t)}(u,i), y_t)

OverArch and Value Model are trained offline within the main two-tower setup. After aggregation, the top-K1K_1 items are passed to downstream ESR scoring, again leveraging embedding caching (Xue et al., 18 Nov 2025).

6. Embedding Caching for Early-Stage Ranking Acceleration

For Early-Stage Ranking (ESR), precomputing and caching all item embeddings EiE_i during publish time, and storing them as a contiguous (Nitems×D)(N_{items} \times D) GPU tensor, enables basket gathers at inference:

EESR=embedding_cache[top-K0 IDs]E_{ESR} = \mathsf{embedding\_cache}[\text{top-}K_0\ \text{IDs}]

This obviates per-item recomputation and GPU–GPU fetch overhead, resulting in up to 10×10\times fewer GPU cycles and 10×10\times higher query-per-second (QPS) rates in ESR. The throughput advantage scales linearly with K0K_0 (Xue et al., 18 Nov 2025).

7. Empirical Evaluation and Deployment Outcomes

Evaluation on two datasets (10M and 80M items, D=128D=128, A100-40GB GPU; 5,000 request replay) demonstrates substantial improvements:

Configuration Throughput (QPS) p99 Latency (ms) Cost/1k req Recall@200 (E-task)
CPU-ANN + filter 51 ~160 $0.158 0.291
GPU-ANN + filter (Faiss) 340 ~40 $0.0272 -
ST-Retrieval 1,210 ~14 $0.0077 0.291
ST-Retrieval+OverArch 771 ~28 $0.012 0.331

SilverTorch achieves up to 23.7×23.7\times greater throughput and 11.4×11.4\times lower p99 latency than CPU baselines; cost-efficiency improvement is 13.35×13.35\times. OverArch+ValueModel delivers recall gains (“E-task” Recall@200 increases from 0.291 to 0.331, or +14.6%+14.6\%). SilverTorch is deployed across hundreds of retrieval and ESR models in production, recommending content to billions of daily active users (Xue et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SilverTorch.