Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

133 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws (2410.11820v1)

Published 15 Oct 2024 in cs.LG

Abstract: The composition of pretraining data is a key determinant of foundation models' performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow complexity and computational overhead. In this paper, we introduce Adaptive Data Optimization (ADO), an algorithm that optimizes data distributions in an online fashion, concurrent with model training. Unlike existing techniques, ADO does not require external knowledge, proxy models, or modifications to the model update. Instead, ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly, making it more scalable and easier to integrate. Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. Beyond its practical benefits, ADO also provides a new perspective on data collection strategies via scaling laws.

References (75)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces ADO, a dynamic algorithm that optimizes data sampling in real time by leveraging per-domain scaling laws.
It employs domain-specific scaling laws to predict loss trajectories, emphasizing data with higher learning potential during training.
Empirical tests on up to 1.3B parameter models show ADO enhances zero-shot accuracy and achieves performance on par with or better than existing methods.

Overview of Adaptive Data Optimization in Foundation Model Training

The paper presents an innovative approach called Adaptive Data Optimization (ADO), addressing the intricacies of optimizing data mixtures for training foundation models. Foundation models, characterized as vast neural networks pre-trained on extensive datasets, hinge on the effective composition of pretraining data to achieve optimal performance. This research outlines a methodology that transcends existing data selection strategies by tackling the computational overhead typically associated with proxy models and multi-staged training.

Key Concepts and Methodology

The primary innovation in this work is ADO, which dynamically optimizes data distributions in an online fashion, synchronous with the model training process. Unlike conventional methods that require extensive preliminary experiments with smaller models or depend on proxy models, ADO leverages per-domain scaling laws to evaluate the learning potential of each data domain in real-time. This algorithm introduces a streamlined and scalable solution, integrating easily into existing training workflows without necessitating changes to model update processes.

Domain Scaling Laws: ADO employs domain-specific scaling laws to predict the model's loss trajectory across distinct domains, utilizing a power law formulation. This is significant as it estimates how each domain contributes to the overall learning process, factoring in the reducible loss and the inherent learning speed of each domain.

Adaptive Distribution: The method continuously refines the data sampling distribution, allowing domains that promise greater learning potential—defined by higher learning speed and reducible loss—to receive more emphasis. This approach is akin to curriculum learning but is distinctive in its online and automated adaptation throughout the training period.

Empirical Results

The experimental phase evaluated ADO on the Pile dataset using LLMs at scales up to 1.3 billion parameters. The results demonstrated that ADO achieves comparable or superior performance to current techniques such as DoReMi and ODM across various benchmarks. Importantly, ADO accomplished this with minimal additional computational time, maintaining computational efficiency and scalability across different model sizes.

ADO's ability to enhance model performance was apparent in multiple dimensions:

Validation Loss: While ADO slightly underperforms on the Pile validation set, it improves the validation loss on SlimPajama and FineWeb subsets, suggesting a tilt towards higher-quality data selection even without explicit curation.
Zero-shot Performance: ADO's dynamic data mixture significantly improves zero-shot accuracy across diverse benchmarks, indicating better generalization capabilities.

Implications and Future Directions

The implications of this work are multifaceted, advancing both practical and theoretical aspects of AI research. Practically, ADO's seamless integration into existing training pipelines presents an accessible mechanism for improving model efficacy without substantial computational investments. Theoretically, it highlights the potential of online adaptive learning mechanisms, posing new questions about their application across broader AI domains.

Future research could extend ADO to even larger models and datasets, exploring its adaptability and efficacy in more complex environments. Moreover, incorporating sophisticated scaling laws that account for inter-domain interactions or adaptive learning rate schedules could further refine the model's selection capabilities. Overall, ADO represents a significant milestone in automating and optimizing data selection in the rapidly evolving field of AI and machine learning.

PDF Markdown

Tweets

https://twitter.com/AlbalakAlon/status/1847009447958474841

https://twitter.com/SadhikaMalladi/status/1848696629765018103

https://twitter.com/yidingjiang/status/1848505859921355070

https://twitter.com/SadhikaMalladi/status/1914395715423416550