COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework
Abstract: In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e., fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose a Conditioned One-Shot fine-tuning framework (COS-DPO) that extends the Direct Preference Optimization technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By direct conditioning on the weight across auxiliary objectives, our Weight-COS-DPO method enjoys an efficient one-shot training process for profiling the Pareto front and is capable of achieving comprehensive trade-off solutions even in the post-training stage. Based on our theoretical findings on the linear transformation properties of the loss function, we further propose the Temperature-COS-DPO method that augments the temperature parameter to the model input, enhancing the flexibility of post-training control over the trade-offs between the main and auxiliary objectives. We demonstrate the effectiveness and efficiency of the COS-DPO framework through its applications to various tasks, including the Learning-to-Rank (LTR) and LLM alignment tasks, highlighting its viability for large-scale ML deployments.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Learning to rank with nonsmooth cost functions. Advances in neural information processing systems, 19, 2006.
- Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136, 2007.
- Multi-objective ranking optimization for product search using stochastic label aggregation. In Proceedings of The Web Conference 2020, pp. 373–383, 2020.
- Preference learning algorithms do not learn preference rankings. arXiv preprint arXiv:2405.19534, 2024.
- Controllable multi-objective re-ranking with policy hypernetworks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3855–3864, 2023.
- Multi-objective deep learning with adaptive reference vectors. Advances in Neural Information Processing Systems, 35:32723–32735, 2022.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- Multi-objective optimization in learning to rank. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 1241–1242, 2011.
- Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- Multi-objective optimal control: An overview. In 2007 IEEE international conference on control applications, pp. 170–175. IEEE, 2007.
- A general theoretical paradigm to understand learning from human preferences. arXiv e-prints, pp. arXiv–2310, 2023.
- Controllable preference optimization: Toward controllable multi-objective alignment. arXiv preprint arXiv:2402.19085, 2024.
- Improving pareto front learning via multi-sample hypernetworks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7875–7883, 2023.
- Collaborative multi-objective ranking. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1363–1372, 2018.
- Pareto optimization or cascaded weighted sum: A comparison of concepts. Algorithms, 7(1):166–185, 2014.
- Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
- Pku-saferlhf: A safety alignment preference dataset for llama family models. arXiv preprint arXiv:2406.15513, 2024a.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024b.
- Bayesian optimization algorithms for multi-objective optimization. In International Conference on Parallel Problem Solving from Nature, pp. 298–307. Springer, 2002.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Pareto multi-task learning. Advances in neural information processing systems, 32, 2019.
- Controllable pareto multi-task learning. arXiv preprint arXiv:2010.06313, 2020.
- The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Annals of Operations Research, pp. 1–30, 2021.
- Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878, 2024.
- Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
- Multi-task learning with user preferences: Gradient descent with controlled ascent in pareto optimization. In International Conference on Machine Learning, pp. 6597–6607. PMLR, 2020.
- Multi-label learning to rank through multi-objective optimization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4605–4616, 2023a.
- Querywise fair learning to rank through multi-objective optimization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1653–1664, 2023b.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Learning the pareto front with hypernetworks. arXiv preprint arXiv:2010.04104, 2020.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
- Introducing letor 4.0 datasets. arXiv preprint arXiv:1306.2597, 2013.
- Are neural rankers still outperformed by gradient boosted decision trees? In International Conference on Learning Representations (ICLR), 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36, 2024.
- Multi-objective optimization via wasserstein-fisher-rao gradient flow. In International Conference on Artificial Intelligence and Statistics, pp. 3862–3870. PMLR, 2024.
- Scalable pareto front approximation for deep multi-objective learning. In 2021 IEEE international conference on data mining (ICDM), pp. 1306–1311. IEEE, 2021.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
- Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023.
- Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 18990–18998, 2024.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Pirank: Scalable learning to rank via differentiable sorting. Advances in Neural Information Processing Systems, 34:21644–21654, 2021.
- Multi-objective learning to rank by model distillation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5783–5792, 2024a.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024b.
- Ma Guadalupe Castillo Tapia and Carlos A Coello Coello. Applications of multi-objective evolutionary algorithms in economics and finance: A survey. In 2007 IEEE congress on evolutionary computation, pp. 532–539. IEEE, 2007.
- Softrank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 77–86, 2008.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pp. 25–54. PMLR, 2013.
- Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023.
- A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216, 2024.
- β𝛽\betaitalic_β-dpo: Direct preference optimization with dynamic β𝛽\betaitalic_β. arXiv preprint arXiv:2407.08639, 2024a.
- Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024b.
- Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
- Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024.
- Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm and evolutionary computation, 1(1):32–49, 2011.
- Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. arXiv preprint ArXiv:2310.03708, 2023.
- Indicator-based selection in multiobjective search. In International conference on parallel problem solving from nature, pp. 832–842. Springer, 2004.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.