DLRM: Deep Learning Recommendation Model

Updated 17 October 2025

DLRM is a neural network architecture designed for large-scale personalized recommendations, integrating dense feature processing with embedding-based representations.
It computes explicit pairwise feature interactions by concatenating outputs from bottom MLPs and embedding tables, enhancing predictive accuracy without combinatorial complexity.
DLRM employs hybrid parallelism with model and data parallel strategies to efficiently scale complex embedding operations and MLP computations.

Deep Learning Recommendation Model (DLRM) is a neural network architecture designed to efficiently process large-scale personalization and recommendation tasks characterized by heterogeneous feature types—especially the need to handle both dense numerical data and high-cardinality categorical inputs. Engineered for industrial relevance, DLRM integrates embedding-based representations, feature interaction modeling, and multi-layer perceptrons within a distinct compositional workflow, and is equipped with tailored parallelization schemes to meet the memory and compute demands of modern recommenders (Naumov et al., 2019). It serves not only as a highly performant modeling solution but also as a foundational benchmark for algorithmic and systems co-design efforts in recommendation systems.

1. Mathematical Architecture and Feature Handling

DLRM’s architecture is fundamentally defined by its explicit support for both dense and sparse (categorical) input modalities. Categorical features are transformed via embedding tables: for a categorical variable $x_i$ (typically represented as a one-hot or multi-hot vector $e_i$ ), the corresponding embedding is obtained by

$w_i^\top = e_i^\top W$

where $W \in \mathbb{R}^{m \times d}$ is the embedding matrix for $m$ categories, each mapped to a $d$ -dimensional dense vector. For batch processing or multi-hot features, the lookup generalizes to

$S = A^\top W$

with $A$ aggregating several (multi-)hot vectors.

Numerical (dense) features are passed into a bottom MLP, producing a dense feature vector. Subsequently, DLRM computes explicit pairwise dot-product interactions between all dense and embedded features (akin to the second-order terms in matrix factorization and Factorization Machine models): $\hat{y} = b + w^\top x + x^\top \mathrm{upper}(VV^\top)x$ with the “upper” operator selecting strictly upper-triangular elements.

All original dense features (output by the bottom MLP) and computed interaction terms are concatenated and input into a top MLP, followed by a sigmoid activation to output probabilities (typically for binary classification).

The end-to-end workflow is as follows:

Data Type	Transformation	Output
Categorical features	Embedding lookups	Dense vectors
Numerical features	Bottom MLP	Dense vector
Both	Pairwise dot product (interact)	Concatenated vector
Final	Top MLP + sigmoid	Prediction

This structure generalizes and extends the expressive power of conventional FM and wide&deep models while retaining computational tractability.

2. Implementation in PyTorch and Caffe2 Frameworks

DLRM is implemented both in PyTorch and Caffe2, mapping core architectural components onto framework-specific constructs:

PyTorch:
- Embeddings: nn.EmbeddingBag supports both one-hot and multi-hot categories.
- MLPs: nn.Linear for individual layers, using addmm for efficient batched matrix multiplication.
- Loss: nn.CrossEntropyLoss for binary classification.
- Data parallelism is natively supported (nn.DataParallel or nn.DistributedDataParallel), efficiently replicating MLP parameters and distributing workloads across devices.
- Model parallelism (across embedding tables) requires custom embedding sharding. No out-of-the-box operator exists; practitioners partition large tables manually for distributed memory placement.
Caffe2:
- Embeddings: SparseLengthSum operator.
- MLPs: FC layers.
- Feature interactions: BatchMatMul.
- Loss: Dedicated cross-entropy operator.
- Caffe2 also lacks native support for hybrid parallelism, necessitating manual communication primitives when fusing replicated MLPs and sharded embeddings.

Both frameworks highlight that while dense neural operations map cleanly onto native primitives, real-world systems require careful orchestration for large embedding tables.

3. Parallelization Strategies for Scale

DLRM adopts a hybrid parallelization scheme to address the memory and compute demands of massive recommendation models:

Model parallelism is applied to embedding tables: each device or worker node manages a disjoint partition of the full embedding parameter set. This enables scaling to multi-gigabyte or terabyte-scale models without redundant memory copies.
Data parallelism is employed for MLPs: compute layers are replicated across devices, each replica processing separate mini-batch partitions. Gradients are aggregated using collective operations (e.g., allreduce via NCCL or Gloo).
Personalized all-to-all communication handles the fan-out induced by feature interaction beneath the top MLP. The “butterfly shuffle” partitions and exchanges embedding outputs such that each device receives precisely the subset of embedded vectors corresponding to its batch partition.

No mainstream deep learning framework (as of the referenced work) natively supports this fine-grained combination of model- and data-parallelism with application-optimized collectives. Explicit scheduling of computation and communication is required.

4. Empirical Evaluation and Profiling

On the Big Basin AI platform (dual Xeon 6138 CPUs, eight V100 16GB GPUs), DLRM’s performance was benchmarked against existing models, including the Deep & Cross Network (DCN), primarily on the Criteo Ad Kaggle dataset:

With both SGD and Adagrad optimizers, DLRM achieved slightly higher training and validation accuracy than DCN, without extensive hyperparameter search.
Run-time profiling for a representative model (with eight categorical features and 512 continuous features) demonstrated that:
- On CPUs, fully connected (MLP) layers dominate compute time.
- On GPUs, the bottleneck shifts to embedding lookup and cross-device communication.
- Single-CPU execution: ~256 s; GPU: ~62 s.
DLRM’s explicit (rather than implicit) second-order interaction modeling provides computational efficiency without accuracy loss compared to polynomial expansions or higher-order cross methods.

The architecture emphasizes efficient memory access, manageable embedding table communication, and suitability as a systems benchmark.

5. System Benchmarks and Co-design Implications

DLRM’s open-source implementations and systematic structure position it as a benchmark for joint algorithm-system research:

The explicit architectural separation between embedding lookup, interaction, and dense computation layers clarifies bottlenecks for both algorithmic (optimization, compression) and system-level (memory, communication, parallelism) innovation.
Hybrid parallelism requirements have inspired the development of more advanced data/model parallel runtimes.
DLRM underlines the importance of optimizing both higher-order feature interactions and the logistics of embedding table storage, access, and exchange.
The architecture motivates future deep learning systems to add native support for mixed parallelism and efficient, flexible communication collectives.

Emerging work in the years following DLRM (not included in the defining paper) leverages this benchmark to compare performance optimizations, hardware co-designs, compression schemes, and new training/inference paradigms. The DLRM workflow thus catalyzes both the improvement of underlying computational infrastructure and the advancement of new modeling techniques.

6. Research Insights and Future Directions

DLRM’s design isolates the two central challenges in learning-based recommendation:

Efficient handling of categorical data through embedding tables, which dominate memory and communication.
Explicit and efficient modeling of feature interactions to maximize predictive power without combinatorial explosion.

The findings from DLRM reinforce that

Advanced interaction modeling can be combined with scalable embedding lookup to achieve state-of-the-art recommendation quality.
System and algorithm designers must carefully balance the capacity, bandwidth, and synchronization costs incurred by large embedding tables and nonlinear interaction modeling.

The DLRM paper concludes by positioning its design as a foundation for continued research on scalable, accurate, and efficient personalized recommendation systems, particularly in the areas of memory-centric and hybrid architectures, compressed representations, and parallel system support (Naumov et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Deep Learning Recommendation Model for Personalization and Recommendation Systems (2019)

Follow Topic

Get notified by email when new papers are published related to Deep Learning Recommendation Model (DLRM).