Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems (2008.13535v2)

Published 19 Aug 2020 in cs.IR, cs.LG, and stat.ML

Abstract: Learning effective feature crosses is the key behind building recommender systems. However, the sparse and large feature space requires exhaustive search to identify effective crosses. Deep & Cross Network (DCN) was proposed to automatically and efficiently learn bounded-degree predictive feature interactions. Unfortunately, in models that serve web-scale traffic with billions of training examples, DCN showed limited expressiveness in its cross network at learning more predictive feature interactions. Despite significant research progress made, many deep learning models in production still rely on traditional feed-forward neural networks to learn feature crosses inefficiently. In light of the pros/cons of DCN and existing feature interaction learning approaches, we propose an improved framework DCN-V2 to make DCN more practical in large-scale industrial settings. In a comprehensive experimental study with extensive hyper-parameter search and model tuning, we observed that DCN-V2 approaches outperform all the state-of-the-art algorithms on popular benchmark datasets. The improved DCN-V2 is more expressive yet remains cost efficient at feature interaction learning, especially when coupled with a mixture of low-rank architecture. DCN-V2 is simple, can be easily adopted as building blocks, and has delivered significant offline accuracy and online business metrics gains across many web-scale learning to rank systems at Google.

Citations (401)

Summary

  • The paper presents an improved cross network that models bounded-degree feature interactions by combining explicit cross layers with a deep neural network.
  • It introduces a mixture-of-low-rank experts strategy that balances prediction accuracy and computational cost in large-scale ranking systems.
  • Empirical evaluations and theoretical analyses confirm that DCN-V2 achieves lower log loss and higher AUC, offering actionable insights for production deployments.

The paper introduces DCN‑V2, a new architecture designed for learning high‑order feature interactions in a web‑scale learning‑to‑rank setting. Its central innovation is an improved cross network that explicitly and efficiently models bounded‑degree feature crosses while remaining amenable to production constraints such as low latency and limited memory. The work not only presents a more expressive variant than the original Deep Cross Network (DCN) but also shares practical lessons gathered after deploying the method in real‑world ranking systems.

Key ideas and contributions include:

  • Explicit and Implicit Interaction Modeling
    • DCN‑V2 begins with an embedding layer that converts categorical and dense features into lower‑dimensional representations. From this common embedding, the model applies a sequence of cross layers to automatically generate explicit feature interaction terms. Each cross layer uses a simple formula where the input vector is “crossed” with itself (or with the base input), thereby forming higher‑order polynomial features. In parallel or in stacked form, these explicit interactions are combined with a deep neural network (DNN) that models complementary implicit interactions.
  • Improved Expressiveness with Bounded Degree
    • Unlike typical DNNs (e.g., those based on ReLU activations) that learn interactions implicitly, DCN‑V2 explicitly models bounded‑order feature crosses. The authors show theoretically that an l‑layer cross network can reproduce all interaction terms up to order l+1. This explicit formulation allows the model to efficiently capture combinatorial interactions without the need for manual feature engineering.
  • Mixture of Low‑Rank Experts
    • Noticing that the learned weight matrices in the cross layers are numerically low‑rank, the paper proposes a low‑rank approximation strategy. In addition, the authors extend this idea with a mixture‑of‑experts formulation where several low‑rank “experts” learn interactions in different subspaces. A dynamic gating mechanism then combines these experts. This approach improves the trade‑off between prediction quality and computational cost, making the model more attractive for deployment in production systems where resources are constrained.
  • Theoretical Analysis
    • The paper provides detailed proofs that the cross network not only captures element‑wise (bit‑wise) interactions but also, when the input is viewed as composed of feature embeddings, naturally models feature‑wise interactions. This analysis explains how the architecture effectively parameterizes polynomial functions up to a specific degree using only O(d) parameters per layer, where d is the embedding size.
  • Empirical Evaluations
    • A broad experimental paper is conducted using several public benchmark datasets (such as Criteo and MovieLen‑1M) as well as a web‑scale production dataset. Extensive hyper‑parameter search and ablation studies confirm that (a) explicit high‑order feature crossing is beneficial over standard DNN layers, (b) DCN‑V2 achieves lower log loss and higher AUC compared to alternatives such as DeepFM, xDeepFM, AutoInt, and even standalone large DNNs, and (c) the mixture of low‑rank experts (DCN‑Mix) maintains accuracy while reducing model size and latency.
  • Production Lessons
    • The authors describe practical strategies they adopted during productionization at Google. For example, inserting one or two cross layers near the input (prior to several hidden layers) provides a good balance, and stacking or concatenating cross layers both work well. They also report that replacing standard ReLU layers with cross layers can yield improvements when memory is a limiting factor and when the target interaction structure is polynomial in nature. These insights illustrate that careful architectural design and system optimization are essential when scaling to billions of training examples in real‑world ranking applications.

In summary, DCN‑V2 represents a significant advance in designing networks that explicitly learn feature interactions through simple and effective cross layers, enhanced by low‑rank approximations and expert mixtures. This design not only offers theoretical guarantees regarding polynomial representation but also delivers practical gains in deployment, making it an attractive option for learning‑to‑rank systems and other recommendation tasks in large‑scale industrial settings.