SilverTorch: Unified GPU-based DLRM Serving

Updated 21 November 2025

SilverTorch is a unified, model-based system that integrates the entire deep learning recommendation pipeline into a single GPU-executable tensor graph, replacing multiple CPU-based services.
It leverages GPU-optimized Bloom filters and fused Int8 ANN search to enhance throughput, reduce latency, and lower costs in large-scale recommendation serving.
The co-design of filtering, ANN indexes, and multi-task OverArch scoring delivers substantial production gains, achieving up to 23.7× throughput improvement and 11.4× lower latency.

SilverTorch is a unified, model-based system designed for serving large-scale deep learning recommendation models (DLRM) entirely on GPUs, replacing traditional multi-service pipelines dependent on CPU-based approximate nearest neighbor (ANN) and filtering systems. SilverTorch encodes the entire retrieval and ranking stack as a single tensorized model graph, compiled for GPU execution, thereby enabling significant efficiency, scalability, and modeling improvements in production recommendation systems (Xue et al., 18 Nov 2025).

1. Model-Based Recommendation Serving Architecture

SilverTorch formalizes a model-based serving paradigm in which all retrieval, filtering, multi-task score aggregation, and early-stage ranking operations are expressed as tensor operators within a single PyTorch-derived computation graph. This approach eliminates the need for separately managed microservices for indexing, ANN search, and filter logic. The system consolidates the following model components:

User Tower: Computes user embedding.
Bloom-index Layer: Performs feature-level filtering using a GPU-optimized Bloom filter.
Tensor-native Int8 ANN Layer: Executes approximate kNN search via quantized inner products.
OverArch Scoring Layer: Captures complex user-item interactions and supports multi-task outputs.
Value Model: Aggregates multi-task outputs into single scores through in-model logic.
Embedding Cache: Stores precomputed item embeddings for efficient early-stage ranking (ESR).

End-to-end, the pipeline involves offline training and quantization of multi-tower models, construction of IVF+Int8 ANN and Bloom-filter indexes as GPU-resident tensors, TorchScript model artifact baking, and a serving-time “single forward pass” executing all retrieval, filtering, scoring, and aggregation within one model invocation (Xue et al., 18 Nov 2025).

2. GPU Bloom Index Algorithm for Feature Filtering

SilverTorch utilizes a GPU-optimized Bloom index for pre-retrieval feature filtering. Each item $i$ is assigned an $M$ -bit signature $VB_i$ . Features are hashed via $k$ independent hash functions; the query’s features yield an analogous query vector $QB$ . An item passes the filter if

$(QB \wedge VB_i) = QB$

False-positive probability for $n$ distinct query features is

$P_{fp} = (1 - e^{-kn/m})^k$

Typical deployment parameters are $m=1024$ bits and $k=5$ , yielding $P_{fp} \approx 0.07\%$ . Item signatures are laid out in transposed $u64$ -aligned memory for warp-efficient bitwise computation. Insertions are $O(N_{items}\cdot k)$ , and queries are $O((M/64)\cdot(N_{items}/64))$ bit-operations. The Bloom mask’s compact representation enables contiguous loading and masking across large candidate sets in a fully parallelized fashion (Xue et al., 18 Nov 2025).

3. Fused Int8 ANN Search: Indexing and Query Pipeline

Index construction leverages KMeans++ partitioning into $C$ centroids in $\mathbb{R}^D$ , quantizing all item and centroid embeddings to int8 values:

$x_i^{(8)} = \left\lfloor 128 \cdot \frac{x_i - x_{min}}{x_{max} - x_{min}} \right\rceil \in [-128,127]$

At query time, user embeddings are similarly quantized; a fused batched matmul computes $B \times C$ int8 inner products via the dp4a instruction:

$S_{b,c} = \sum_{d=1}^D q^{(8)}_{b,d} \times c^{(8)}_{c,d}$

The top $P$ centroids are probed; cluster items’ int8 embeddings are streamed for further dp4a-based scoring with the quantized user embedding. No intermediate gather buffer is required, maximizing tensor throughput. The global top- $K_0$ heap is maintained in registers/shared memory. Overall complexity per query is $O(P\cdot|\text{cluster}| \cdot D / 32)$ dp4a ops. Int8 quantization halves the ANN memory footprint compared to float32 (Xue et al., 18 Nov 2025).

4. Co-Design of Filtering and ANN Indexes

In SilverTorch, Bloom and ANN indexes are co-designed as GPU tensors, permitting direct application of the Bloom bitmask within the ANN matmul/gather operation. This eliminates the need to instantiate a full per-item boolean mask:

Memory requirement for masking reduces from $N_{items}$ bytes (bool8) to $N_{items}/8$ bytes (bit-packed).
The effective scanned-item count after Bloom filtering is proportional to the filter’s selectivity $\alpha$ : if $\alpha \approx 0.1$ , the number of items scored by ANN is reduced by approximately 90%, saving dp4a operations and bandwidth.
Zeroing out scores for non-matching items is accomplished in-place by direct bitmask application during the fused computation (Xue et al., 18 Nov 2025).

5. Multi-Task OverArch Layer and Value Model Aggregation

Post-ANN, SilverTorch retrieves cached float16 item embeddings $E \in \mathbb{R}^{K_0 \times D}$ , concatenates user, item, and cross features, and applies the OverArch block—either a shallow MLP or Mixture-of-Logits (MoL)—to produce $T$ logits for $T$ distinct tasks. The Value Model aggregates these via a weighted sum or specified logic:

$s_i = \sum_{t=1}^T \alpha_t f_\theta^{(t)}(u,i)$

A plausible implication is that arbitrary business logic or external JSON-driven rules can be encoded in the Value Model. In training, a multi-task objective is used:

$\mathcal{L} = \sum_{t=1}^T \alpha_t \mathcal{L}_t(f_\theta^{(t)}(u,i), y_t)$

OverArch and Value Model are trained offline within the main two-tower setup. After aggregation, the top- $K_1$ items are passed to downstream ESR scoring, again leveraging embedding caching (Xue et al., 18 Nov 2025).

6. Embedding Caching for Early-Stage Ranking Acceleration

For Early-Stage Ranking (ESR), precomputing and caching all item embeddings $E_i$ during publish time, and storing them as a contiguous $(N_{items} \times D)$ GPU tensor, enables basket gathers at inference:

$E_{ESR} = \mathsf{embedding\_cache}[\text{top-}K_0\ \text{IDs}]$

This obviates per-item recomputation and GPU–GPU fetch overhead, resulting in up to $10\times$ fewer GPU cycles and $10\times$ higher query-per-second (QPS) rates in ESR. The throughput advantage scales linearly with $K_0$ (Xue et al., 18 Nov 2025).

7. Empirical Evaluation and Deployment Outcomes

Evaluation on two datasets (10M and 80M items, $D=128$ , A100-40GB GPU; 5,000 request replay) demonstrates substantial improvements:

Configuration	Throughput (QPS)	p99 Latency (ms)	Cost/1k req	Recall@200 (E-task)
CPU-ANN + filter	51	~160	$0.158	0.291
GPU-ANN + filter (Faiss)	340	~40	$0.0272	-
ST-Retrieval	1,210	~14	$0.0077	0.291
ST-Retrieval+OverArch	771	~28	$0.012	0.331

SilverTorch achieves up to $23.7\times$ greater throughput and $11.4\times$ lower p99 latency than CPU baselines; cost-efficiency improvement is $13.35\times$ . OverArch+ValueModel delivers recall gains (“E-task” Recall@200 increases from 0.291 to 0.331, or $+14.6\%$ ). SilverTorch is deployed across hundreds of retrieval and ESR models in production, recommending content to billions of daily active users (Xue et al., 18 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SilverTorch.