Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning (2312.13772v2)

Published 21 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Following the standard supervised fine-tuning (SFT) paradigm, in-context learning (ICL) has become an efficient approach propelled by the recent advancements in LLMs, yielding promising performance across various tasks in few-shot data setups. However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration), especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration, as well as their interplay. Through extensive controlled experiments, we find that simultaneous gains for both task performance and calibration are difficult to achieve, and the problem of miscalibration exists across all learning methods in low-resource scenarios. To address this challenging trade-off between performance and calibration, we then investigate the potential of self-ensembling techniques applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies). We justify the feasibility of self-ensembling on SFT in addition to ICL, to make the predictions more calibrated and have comparable or even better performance. Our work sheds light on which learning paradigm to choose and how to enhance both task performance and calibration of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. NLU++: A multi-label, slot-rich, generalisable dataset for natural language understanding in task-oriented dialogue. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1998–2013, Seattle, United States. Association for Computational Linguistics.
  3. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719–730, Dublin, Ireland. Association for Computational Linguistics.
  4. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
  7. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  10. Exploring the relationship between in-context learning and instruction tuning. arXiv preprint arXiv:2311.10367.
  11. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, Toronto, Canada. Association for Computational Linguistics.
  12. Adam Gleave and Geoffrey Irving. 2022. Uncertainty estimation for language reward models. arXiv preprint arXiv:2203.07472.
  13. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4849–4870, Toronto, Canada. Association for Computational Linguistics.
  14. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  15. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
  16. Towards a unified view of parameter-efficient transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  17. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  18. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  19. Calibrated language model fine-tuning for in- and out-of-distribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1326–1340, Online. Association for Computational Linguistics.
  20. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  21. The manifesto data collection. manifesto project (mrg/cmp/marpor). version 2023a.
  22. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 22631–22648. PMLR.
  23. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
  24. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
  25. Ammar Mohammed and Rania Kora. 2023. A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University - Computer and Information Sciences, 35(2):757–774.
  26. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12284–12314, Toronto, Canada. Association for Computational Linguistics.
  27. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  28. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  29. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
  30. Language models are unsupervised multitask learners.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  32. Sqatin: Supervised instruction tuning meets question answering for improved dialogue nlu. arXiv preprint arXiv:2311.09502.
  33. The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 83–94, Marseille, France. European Language Resources Association.
  34. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
  35. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  36. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  37. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  38. Quantifying uncertainty in foundation models via ensembles. In NeurIPS 2022 Workshop on Robustness in Sequence Modeling.
  39. How does in-context learning help prompt tuning? arXiv preprint arXiv:2302.11521.
  40. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  41. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.
  42. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  43. Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035.
  44. Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968–979, Singapore. Association for Computational Linguistics.
  45. Hyperparameter ensembles for robustness and uncertainty quantification. Advances in Neural Information Processing Systems, 33:6514–6527.
  46. More samples or more prompt inputs? exploring effective in-context sampling for llm few-shot prompt engineering. arXiv preprint arXiv:2311.09782.
  47. FiD-ICL: A fusion-in-decoder approach for efficient in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8158–8185, Toronto, Canada. Association for Computational Linguistics.
  48. In-context instruction learning. arXiv preprint arXiv:2302.14691.
  49. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
  50. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249.
  51. Autopeft: Automatic configuration search for parameter-efficient fine-tuning. arXiv preprint arXiv:2301.12132.
  52. Survival of the most influential prompts: Efficient black-box prompt search via clustering and pruning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13064–13077, Singapore. Association for Computational Linguistics.
  53. Clean-eval: Clean evaluation on contaminated large language models. arXiv preprint arXiv:2311.09154.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chengzu Li (11 papers)
  2. Han Zhou (72 papers)
  3. Goran Glavaš (82 papers)
  4. Anna Korhonen (90 papers)
  5. Ivan Vulić (130 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com