Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
136 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample Design Engineering: An Empirical Study of What Makes Good Downstream Fine-Tuning Samples for LLMs (2404.13033v1)

Published 19 Apr 2024 in cs.CL

Abstract: In the burgeoning field of LLMs like ChatGPT and LLaMA, Prompt Engineering (PE) is renowned for boosting zero-shot or in-context learning (ICL) through prompt modifications. Yet, the realm of the sample design for downstream fine-tuning, crucial for task-specific LLM adaptation, is largely unexplored. This paper introduces Sample Design Engineering (SDE), a methodical approach to enhancing LLMs' post-tuning performance by refining input, output, and reasoning designs. We conduct a series of in-domain (ID) and out-of-domain (OOD) experiments to assess the impact of various design options on LLMs' downstream performance, revealing several intriguing patterns that hold consistently across different LLMs. Based on these insights, we propose an integrated SDE strategy, combining the most effective options, and validate its consistent superiority over heuristic sample designs in complex downstream tasks like multi-aspect sentiment analysis, event extraction, and nested entity recognition. Additionally, analyses of LLMs' inherent prompt/output perplexity, zero-shot, and ICL abilities illustrate that good PE strategies may not always translate to good SDE strategies. Code available at https://github.com/beyondguo/LLM-Tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Arie Ben-David. 2008. Comparison of classification accuracy using cohen’s weighted kappa. Expert Systems with Applications, 34(2):825–832.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Evaluation metrics for language models.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. J Cohen. 1968. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213–220.
  8. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  9. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  10. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  11. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8):1761–1776.
  12. Demystifying prompts in language models via perplexity estimation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10136–10148.
  13. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.
  14. Genius: Sketch-based language model pre-training via extreme and selective masking for text generation and augmentation. arXiv preprint arXiv:2211.10330.
  15. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  16. Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors. arXiv preprint arXiv:2305.14450.
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12685–12708.
  19. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  20. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  21. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  22. Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760.
  23. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  24. Gpt understands, too. AI Open.
  25. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR.
  26. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487.
  27. The genia corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the human language technology conference, pages 73–77. Citeseer.
  28. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927.
  29. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
  30. Stanford alpaca: An instruction-following llama model.
  31. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Prompt engineering for healthcare: Methodologies and applications. ArXiv, abs/2304.14670.
  35. Maven: A massive general domain event detection dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1652–1671.
  36. Self-instruct: Aligning language models with self-generated instructions. In The 61st Annual Meeting Of The Association For Computational Linguistics.
  37. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 294–313.
  38. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  40. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
  41. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  42. Ayfer Ezgi Yilmaz and Haydar Demirhan. 2023. Weighted kappa measures for ordinal multi-class classification performance. Applied Soft Computing, 134:110020.
  43. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37.
  44. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets