Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 85 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation (2502.17494v7)

Published 20 Feb 2025 in cs.IR, cs.AI, and cs.LG

Abstract: Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the ExFM framework that efficiently serves trillions-parameter models for online ads recommendation by leveraging external distillation and data augmentation.
  • The methodology addresses strict latency constraints and dynamic streaming data challenges by amortizing a foundation model over multiple vertical models.
  • Experimental evaluation demonstrates significant performance improvements, including a 1000X inference efficiency gain across various service stages.

Efficient Serving of Trillions-Parameter Models in Online Ads Recommendation

Introduction

The paper "External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation" discusses the challenges and solutions in deploying large foundation models for ads recommendation systems. The focus is on overcoming restrictions in training and inference budgets and addressing data distribution shifts due to streaming data. Large models have shown significant performance improvements, yet their practical deployment in industrial settings is hindered by latency constraints and dynamic data environments.

Core Concept: ExFM Framework

The proposed framework, External Large Foundation Model (ExFM), implements a novel approach using external distillation techniques combined with a Data Augmentation Service (DAS) to maintain high-performance models without incurring extra computational costs. The ExFM framework utilizes a teacher model akin to a foundation model (FM) to serve multiple vertical models (VMs), effectively amortizing the cost of model building and maintenance. Figure 1

Figure 1: The proposed ExFM framework that enables trillions-parameter model serving with a designed data augmentation system (DAS) and external distillation.

Challenges in Industrial-Scale Applications

Two primary challenges addressed in the paper include:

  • C1: Restricted Training and Inference Latency: Ensuring models meet latency requirements is critical for user experience during serving.
  • C2: Streaming Data With Shifting Distributions: Continuous data arrival requires models to adapt to real-time changes to avoid overfitting and maintain predictive accuracy.

Components of the ExFM Framework

1. Data Augmentation Service (DAS)

DAS manages FM supervision logging and integrates it with VMs’ training data across distributed settings efficiently. Figure 2

Figure 2: Data Augmentation Service (DAS) strategically enhances training data preparation for VMs.

2. Auxiliary Head (AH) and Student Adapter (SA)

These components are designed to mitigate the data distribution gap between FM and VMs:

  • Auxiliary Head (AH): Implements a separate task architecture to consume FM supervision, proving theoretically to reduce bias transfer effectively. Figure 3

    Figure 3: Auxiliary Head (AH) and Student Adapter (SA) illustrate the integration into ExFM to mitigate data distribution issues.

  • Student Adapter (SA): Learns to transform FM predictions before being used for VMs, reducing the freshness gap linked to model staleness.

Performance and Experimental Evaluation

The paper evaluates ExFM using industrial-scale internal datasets and public datasets, demonstrating substantial performance gains across different models and application contexts: Figure 4

Figure 4: Inference NE gain of 1000X, 3.2T FM on cross-stage VMs, showing the impact of large-scale foundation models.

  • Effectiveness across Multiple VMs: ExFM enhances VM performance across various service stages and domains, highlighting versatility in real-world applications.

Key Experimental Insights and Hyper-Parameter Impacts

The paper identifies hyperparameters like Gradient Scaling (GS), Label Scaling (LS), and Loss Weighting (LW) as critical factors impacting ExFM performance: Figure 5

Figure 5: Joint impact of LW, LS, and GS showing their influence on VM performance.

Adjusting these parameters appropriately can optimize the transfer of FM benefits to VMs.

Conclusion

ExFM offers a robust solution to effectively deploy large foundation models in online ads recommendation systems, achieving efficient model serving and significant performance improvements. Its design principles and experimental validations provide a promising pathway for scaling AI systems in dynamic industrial environments. The framework demonstrates potential scalability in serving models of LLM magnitude without compromising industry latency constraints. Future research might explore further refinements in the integration measures and expand the application scope of ExFM in other domains.

X Twitter Logo Streamline Icon: https://streamlinehq.com