Dice Question Streamline Icon: https://streamlinehq.com

Scalability of meta-training DataRater with fully dense inner updates at extreme model scales

Determine whether meta-training the DataRater model via meta-gradients can scale to extremely large foundation models when the inner model updates are fully dense, and, if necessary, develop scalable bilevel optimisation methods that enable such meta-training at these scales.

Information Square Streamline Icon: https://streamlinehq.com

Background

DataRater is a meta-learned data valuation model trained with meta-gradients by back-propagating through multiple inner updates of a LLM. This bilevel optimisation setup requires computing second-order derivatives, which is computationally and memory intensive.

The paper demonstrates feasibility up to populations of 400M-parameter inner models (and limited experiments with larger models using low-rank updates such as LoRA for supervised fine-tuning), aided by techniques like MixFlow-MG to reduce memory usage. However, extending meta-training to extremely large foundation models with fully dense inner updates remains uncertain and may require new algorithmic advances in scalable bilevel optimisation.

References

However, the scalability of meta-training DataRater models for extremely large foundation models with fully dense inner updates remains an open question, and may require further algorithmic advancements in scalable bilevel optimisation.

DataRater: Meta-Learned Dataset Curation (2505.17895 - Calian et al., 23 May 2025) in Appendix: Limitations