DavIR: Data Selection via Implicit Reward for Large Language Models (2310.13008v2)

Published 16 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce DavIR, a model-based data selection method for post-training LLMs. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal LLMing, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (28)

Authors (8)

Haotian Zhou (8 papers)
Tingkai Liu (9 papers)
Qianli Ma (77 papers)
Jianbo Yuan (33 papers)
Pengfei Liu (191 papers)
Yang You (173 papers)
Hongxia Yang (130 papers)
Yufeng Zhang (67 papers)

Citations (6)

View on Semantic Scholar

DavIR: Data Selection via Implicit Reward for Large Language Models (2310.13008v2)

Related Papers