CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU (2204.06240v3)

Published 13 Apr 2022 in cs.LG and cs.IR

Abstract: The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at https://github.com/bytedance/LargeBatchCTR.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces the CowClip method that drastically shortens CTR model training from 12 hours to 10 minutes on one GPU.
It employs adaptive column-wise gradient clipping to adjust hyperparameters based on categorical feature frequency for stable large-batch training.
Empirical results on Criteo and Avazu datasets show improved AUC and efficient resource utilization without sacrificing accuracy.

Essay on "CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU"

The paper "CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU" presents an innovative method for accelerating the training of Click-Through Rate (CTR) prediction models by leveraging a novel optimization strategy called adaptive Column-wise Clipping (CowClip). The primary focus of the research is to address the challenges associated with scaling up the batch size of such models to significantly reduce training time without compromising accuracy.

CTR prediction is a critical component of modern recommendation systems, where timely updates of model predictions are essential to adapt to rapidly changing user behavior and content. Historically, training CTR prediction models involves handling vast amounts of data with lengthy training times, traditionally carried out over multiple hours on multiple GPUs. However, the proposed CowClip methodology enables the training of CTR models using a large batch size on a single GPU, drastically cutting down the training time from 12 hours to just ten minutes.

The paper addresses a key problem observed with large batch training in previous literature: the loss of accuracy due to improper scaling of hyperparameters such as learning rate and L2-regularization weight. The authors identify that conventional scaling rules do not account for the frequency imbalance of categorical feature IDs used in CTR tasks. Unlike image or text data, these IDs can greatly vary in occurrence frequency, making traditional hyperparameter scaling methods ineffective.

The authors propose the CowClip method which includes a modified scaling rule that specifically caters to the sparse and imbalanced nature of CTR prediction model datasets. CowClip is an adaptive column-wise gradient clipping method that stabilizes training by setting individualized gradient clipping thresholds based on the norm of ID embedding vectors. Each threshold is further adjusted by the occurrence count of the IDs in the batch, ensuring that infrequent features are not unfairly penalized or overemphasized during training.

The paper provides empirical evidence to support their claims. The CowClip method is evaluated on well-known datasets like Criteo and Avazu, demonstrating the capability to scale batch sizes up to 128 times with significant improvements in speed and without loss of accuracy. Specifically, an over 0.1% improvement in AUC was observed for the DeepFM CTR model when trained on the Criteo dataset.

The theoretical implications of the research revolve around better understanding and addressing the unique challenges posed by heterogeneous feature frequencies in large-scale CTR prediction tasks. Practically, adopting CowClip in recommendation systems could lead to more efficient use of computational resources and faster iterations of model updates, directly impacting the economics of online advertising and recommendation services.

In terms of future developments, CowClip opens avenues for exploring large batch training techniques across other machine learning domains that involve large embedding tables, such as NLP. Additionally, since the strategy can stabilize training without system-specific optimizations, it potentially offers robust scalability benefits in distributed multi-GPU settings.

Overall, the proposed approach offers a significant contribution to the field of CTR prediction model training, particularly in reducing computational resources and enhancing operational efficiency in real-world applications.

PDF Markdown

Related Papers

GitHub

GitHub - bytedance/LargeBatchCTR: Large batch training of CTR models based on DeepCTR with CowClip. (156 stars)