DeepRecInfra: Optimizing Neural Recommendation Inference

Updated 16 June 2026

DeepRecInfra is an end-to-end framework that models and benchmarks eight state-of-the-art neural recommendation models for datacenter-scale inference.
It employs DeepRecSched, a dynamic scheduling algorithm that tunes CPU and GPU batch sizes and offload thresholds to meet strict p95 latency SLOs.
Empirical evaluations demonstrate up to 5.8× throughput improvements and significant latency reductions, enhancing power efficiency and scalability.

DeepRecInfra is an end-to-end modeling, scheduling, and evaluation framework designed to optimize neural recommendation inference at datacenter scale. Developed with a co-design methodology, DeepRecInfra integrates a suite of industry-representative models, a dynamic workload generator modeled after real-world traffic, a scheduler (DeepRecSched) for hardware- and workload-aware batch and accelerator decision-making, and implementations tailored for modern CPU and GPU server nodes. The platform targets high-throughput, low-latency serving of large-scale deep learning recommendation models (DLRMs), facilitating rapid exploration and deployment of optimizations in production environments (Gupta et al., 2020).

1. Architecture and Modeling Scope

DeepRecInfra explicitly targets the diverse, stringent requirements of production recommendation systems. It is structured around three core pillars:

Model Suite: Implements and benchmarks eight state-of-the-art neural recommendation models, including NCF, Wide & Deep (WnD), MT-WnD, Facebook DLRM-RMC1/2/3, and Alibaba DIN/DIEN, covering a spectrum of real-world topologies in Caffe2 with CPU (Intel MKL) and GPU (cuDNN) backends.
Latent Service-Level Objectives (SLOs): Incorporates strict tail-latency (p95) constraints, with real service SLOs collected for each model. Latency targets are swept across three ranges (Low, Medium, High, ±50% relative).
Workload Generator: Employs a Poisson process for inter-arrival times, with query working-sets extracted to match the heavy-tailed distributions observed in production datacenters.

This explicit modeling of both the model and operational context allows DeepRecInfra to match the per-query behavior and statistical risk factors critical to large-scale commercial deployments (Gupta et al., 2020).

2. Dynamic Scheduling: DeepRecSched Formalism

At the heart of DeepRecInfra is DeepRecSched, a query-batch scheduling and accelerator offload algorithm designed to maximize throughput under SLO constraints. The scheduling objective is formalized as:

$\begin{aligned} & \text{Maximize}\quad QPS(b,s) \ & \text{Subject to:}\quad \text{Latency}_{p95}(b, s) \leq L_{SLO} \ &\quad 1 \leq b \leq C_{max} \ &\quad 0 \leq s \leq Q_{max} \end{aligned}$

where $b$ is per-request batch size, $s$ the query-size threshold for GPU offload, $QPS(b,s)$ achieved queries/s, and $L_{SLO}$ the p95 latency bound.

The optimization is performed in two phases: (1) CPU batch-size tuning for small queries, and (2) offload threshold tuning to route large queries to GPU. The system models queries as a mixture drawn from measured heavy-tailed production traces, and the scheduler leverages a hill climbing search to select $(b, s)$ maximizing throughput while remaining within latency SLO bounds. Hardware heterogeneity, such as SIMD width and memory architecture, is incorporated via benchmarking and modeling of per-device performance curves (Gupta et al., 2020).

3. Hardware-Aware Model Implementations

DeepRecInfra supports both CPU and GPU execution, with model kernels engineered for hardware efficiency:

CPU Optimizations: MLPs are vectorized using Intel MKL, exploiting AVX-256 (Broadwell) and AVX-512 (Skylake) instructions; embedding lookups use table reads and pooling to minimize indirection and saturate DRAM bandwidth.
GPU Kernels: Models leverage cuDNN and custom data-copy streams, explicitly modeling and measuring PCIe/GPU memory transfer bottlenecks. The GPU path is invoked for compute-heavy, large queries exceeding a device-specific threshold.

The system dynamically maps small queries to the CPU (to minimize copy overhead and exploit thread-level parallelism) and large queries to the GPU (to maximize compute and memory bandwidth utility), with empirical batch size selection per device and workload. For embedding-heavy models, large batch sizes are favored on the CPU, while compute-bound models benefit from aggressive GPU offload (Gupta et al., 2020).

4. Empirical Evaluation and Datacenter Deployment

DeepRecInfra has been validated on dual-socket Broadwell and Skylake machines (28/40 cores, AVX-256/512) and a GPU emulator (GTX 1080Ti). The system uses the eight-model suite with measured production traffic patterns and SLO targets. Key experimental findings include:

Throughput: DeepRecSched-CPU yields 1.7×–2.7× QPS improvement over a static scheduler; DeepRecSched-GPU further raises throughput to 4.0×–5.8× due to effective offload of large queries.
Power Efficiency: QPS/W increases by 1.7–2.7× for the CPU scheduler; incorporating GPU yields 2.0–2.9× improvements, though only compute-dominated models consistently benefit.
Latency: Production deployment across hundreds of datacenter servers resulted in p95 latency reduction by 1.39× and p99 by 1.31×, directly enabling higher QPS or headcount reduction.

A notable design finding is that a small cluster can replicate datacenter tail-latency statistics within ∼10%, significantly accelerating prototyping without risking production SLOs (Gupta et al., 2020).

5. Integration with Emerging Storage and Computation Infrastructures

DeepRecInfra has become a foundational suite for evaluating and co-designing scheduling and hardware solutions for recommendation models. For instance:

RecSSD: Employs DeepRecInfra’s benchmarks to demonstrate that near-data processing at the SSD level yields up to 2× end-to-end inference latency improvement versus commodity SSDs, with fully NVMe-compatible host integration (Wilkening et al., 2021).
SCRec: Measures ≥10.4×–55.8× throughput and up to 13.35× energy efficiency advantage for DLRM inference with SmartSSD+FPGA computational storage, using DeepRecInfra’s Criteo/Meta-MELS test suite (Yang et al., 1 Apr 2025).

This demonstrates that DeepRecInfra is not limited to software runtime scheduling but extends to full-stack co-design and hardware-in-the-loop benchmarking.

6. Practical Observations, Limitations, and Future Directions

Observations from DeepRecInfra’s design and deployment include:

Query Size Distributions: Production workloads are heavier tailed than log-normal; assuming classical distributions underestimates tail-impact and depresses attainable throughput by up to 1.7×.
Batching Strategies: The optimal batch size is jointly determined by model bottleneck (memory vs. compute), SLO regime, and hardware microarchitecture.
GPU Offloading: Offload is not universally optimal; gains occur when queries are both large and compute-heavy, and SLOs are sufficiently tight to amortize copy costs.
Proxy Fleets: Mini-fleet experimentation enables routine exploration of scheduling/design improvements without production risk.
Open Source Availability: DeepRecInfra’s release with load generators, model implementations, hardware models, and scheduler enables reproducibility and community extension.

Limitations include a focus on batch size, SLO, and power analysis, with fine-grained energy modeling and SLA-adaptive mechanisms left for future work. The GPU “accelerator” is a simulation/emulation model; adaptation to next-generation AI ASICs or online feedback-based adaptation to diurnal load shifts is an open direction (Gupta et al., 2020).

The DeepRecInfra approach, emphasizing algorithm-system co-design and explicit SLO-awareness, contrasts with single-device or static scheduling frameworks. Integration with hardware-aware deduplication (e.g., RecD for end-to-end feature de-duplication during DLRM training (Zhao et al., 2022)), near-data embedding aggregation (RecSSD (Wilkening et al., 2021)), and compute-in-storage paradigms (SCRec (Yang et al., 1 Apr 2025)) shows that DeepRecInfra constitutes both a benchmarking suite and a methodology driving state-of-the-art efficiency in datacenter-scale recommendation systems.

System/Project	Role in DeepRecInfra Ecosystem	Key Contribution
DeepRecInfra	Scheduling, model suite, load generation	End-to-end datacenter-scale neural recommendation
RecSSD	Storage-layer NDP, relies on DeepRecInfra benchmarks	2× speedup embedding-dominated inference
SCRec	Compute/storage co-design on SmartSSD, tested on DeepRecInfra workloads	>10× throughput, >6× energy efficiency over GPUs
RecD	Deduplication for DLRM training	2×–3× training throughput; batch-level feature compression

These results collectively demonstrate that DeepRecInfra provides a rigorous, extensible analytic foundation for both software-centric and hardware-centric optimization in large-scale recommendation inference.