Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Supervised Knowledge Makes Large Language Models Better In-context Learners (2312.15918v2)

Published 26 Dec 2023 in cs.CL and cs.AI
Supervised Knowledge Makes Large Language Models Better In-context Learners

Abstract: LLMs exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned LLMs (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

Introduction

LLMs like ChatGPT and Llama 2 have shown adeptness across a broad spectrum of natural language processing tasks. Their in-context learning (ICL) capabilities have been particularly noteworthy, allowing them to carry out tasks with minimal additional training. Despite these advances, LLMs continue to face challenges with generalization and avoiding generating factually incorrect information — an issue referred to as "hallucination." Prior research in ICL has mainly focused on prompt engineering to direct model behavior but has not substantially explored how task-specific fine-tuned LLMs (SLMs) might improve in-context learning performance during inference. This paper introduces SuperContext, a framework that leverages supervised knowledge from SLMs to enhance the in-context learning capabilities of LLMs.

Methodology

The methodology section outlines the baseline in-context learning approach and introduces the SuperContext strategy. Traditional ICL leverages examples within a prompt to steer the generation of LLM responses. SuperContext augments this by adding a "receipt" of supervised knowledge from an SLM, which includes the SLM's predictive outputs and confidence scores, into the LLM prompt. This integrates task-specific knowledge, aiding the LLM in making more reliable decisions, particularly with out-of-distribution data. The paper also describes the formulae that underpin this assumption, suggesting that the predictions of LLMs prompted with SLM receipts are invariant to the specific choice of few-shot examples. Additionally, a thrice resampling strategy is employed to mitigate the impact of example selection and order on model performance.

Experiments and Results

Experiments were conducted across various natural language understanding (NLU) and question answering (QA) tasks. ELECTRA-large and RoBERTa-large were used as SLMs while ChatGPT and Llama2 served as the backbone LLMs. In NLU tasks, SuperContext consistently outperformed traditional ICL methods and the LLMs themselves, signaling its effectiveness in OOD generalization and factual accuracy. In QA, SuperContext particularly excelled in reducing hallucinations, showing substantial improvements in both zero-shot and few-shot settings over existing methods.

Analysis and Discussion

The analysis section offers insights into how the integration of SLMs alters the LLMs' decision-making. Notably, a small but significant proportion of SLM results influenced LLM output reversals, leading to improved accuracy. Furthermore, a post-hoc analysis suggests that SuperContext enables better use of in-context examples and provides more aligned rationales with human expectations. The paper also observes that LLM performance correlates positively with the SLMs' confidence, supporting the approach's validity. Lastly, while practical and effective, the framework is not without limitations, such as its reliance on the complementarity of SLMs and LLMs and the need for broader exploration across different LLMs.

Conclusion

In summary, SuperContext demonstrates the potential of task-specific SLMs to significantly improve the reliability of LLMs in in-context learning tasks, especially in out-of-distribution scenarios. It provides a new dimension to the interaction between SLMs and LLMs, suggesting a promising avenue for enhancing the performance of generative LLMs without the need for extensive fine-tuning or reliance on large, potentially unwieldy external knowledge bases. Future work may expand the application of SuperContext to other text generation tasks and explore its utility in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. In-context examples selection for machine translation. arXiv preprint arXiv:2212.02437, 2022.
  2. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.  2206–2240. PMLR, 2022.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Harrison Chase. LangChain, October 2022. URL https://github.com/hwchase17/langchain.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
  10. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  11. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16477–16508, 2023.
  12. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4849–4870, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.267. URL https://aclanthology.org/2023.acl-long.267.
  13. Understanding in-context learning via supportive pretraining data. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12660–12673, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.708. URL https://aclanthology.org/2023.acl-long.708.
  14. Diverse retrieval-augmented in-context learning for dialogue state tracking. arXiv preprint arXiv:2307.01453, 2023.
  15. Evaluating out-of-distribution performance on document image classifiers. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=uDlkiCI5N7Y.
  16. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv e-prints, pp.  arXiv–2305, 2023a.
  17. Contrastive decoding: Open-ended text generation as optimization. In ACL, 2023b.
  18. Chain of knowledge: A framework for grounding large language models with structured knowledge bases. arXiv preprint arXiv:2305.13269, 2023c.
  19. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804, 2021.
  20. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  22. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  23. Z-icl: Zero-shot in-context learning with pseudo-demonstrations. arXiv preprint arXiv:2212.09865, 2022.
  24. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  25. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  26. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
  27. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  28. Contrastive decoding improves reasoning in large language models. arXiv preprint arXiv:2309.09117, 2023.
  29. OpenAI. https://chat.openai.com.chat, 2023a.
  30. OpenAI. Gpt-4 technical report, 2023b.
  31. What in-context learning “learns” in-context: Disentangling task recognition and task learning. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  8298–8319, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.527. URL https://aclanthology.org/2023.findings-acl.527.
  32. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070, 2021.
  33. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  34. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021.
  35. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  36. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  37. Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483, 2020.
  38. Prompting gpt-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2022.
  39. Evaluating the zero-shot robustness of instruction-tuned language models. arXiv preprint arXiv:2306.11270, 2023a.
  40. Text classification via large language models. arXiv preprint arXiv:2305.08377, 2023b.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  42. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  43. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160, 2023a.
  44. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  45. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916, 2023b.
  46. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  47. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  48. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1423–1436, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.79. URL https://aclanthology.org/2023.acl-long.79.
  49. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  50. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848, 2023a.
  51. Reprompting: Automated chain-of-thought prompt inference through gibbs sampling. arXiv preprint arXiv:2305.09993, 2023b.
  52. Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073, 2022.
  53. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  54. Fid-icl: A fusion-in-decoder approach for efficient in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8158–8185, 2023.
  55. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296, 2023.
  56. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  57. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp.  12697–12706. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Linyi Yang (52 papers)
  2. Shuibai Zhang (4 papers)
  3. Zhuohao Yu (15 papers)
  4. Guangsheng Bao (17 papers)
  5. Yidong Wang (43 papers)
  6. Jindong Wang (150 papers)
  7. Ruochen Xu (35 papers)
  8. Wei Ye (110 papers)
  9. Xing Xie (220 papers)
  10. Weizhu Chen (128 papers)
  11. Yue Zhang (618 papers)
Citations (12)