Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Instruction Tuning (2402.11138v2)

Published 17 Feb 2024 in cs.CL, cs.AI, and cs.LG
Contrastive Instruction Tuning

Abstract: Instruction tuning has been used as a promising approach to improve the performance of LLMs on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy. Code is available at https://github.com/luka-group/CoIN.

Enhancing Robustness to Instruction Variations in LLMs with Contrastive Instruction Tuning (Coin)

Introduction

LLMs have made significant strides in understanding and executing diverse human instructions through instruction tuning. However, their application in real-world scenarios is hindered by their sensitivity to slight variations in instruction phrasing or style, often leading to inconsistent outputs. To address this challenge, we present a novel approach, Contrastive Instruction Tuning (Coin), which effectively improves LLMs' performance and robustness across different instruction variations without sacrificing accuracy or increasing inconsistency.

The Core of Coin

Coin introduces a method to enhance the consistency and robustness of LLMs to variations in textual instructions. The approach leverages contrastive learning, focusing on two main aspects:

  • Positive Sample Augmentation: By paraphrasing task instructions, Coin creates positive instruction-instance pairs that, although textually different, are semantically equivalent.
  • Hard Negative Sampling: Coin identifies hard negative samples by pairing the same instruction with different instance inputs and outputs, providing a stronger learning signal that encourages the model to discern finer semantic differences.

Key Contributions and Findings

The implementation and evaluation of Coin reveal several critical insights and advancements in the field of instruction-tuned LLMs. The contributions and findings include:

  • Improved Robustness: Coin consistently enhances LLMs' robustness against unseen instructions with variations at multiple levels, demonstrating an average accuracy improvement of +2.5% on the PromptBench benchmark.
  • Augmented Instruction Dataset: The FLAN collection is expanded with paraphrased instructions, contributing 52k entries and 104k instructions to facilitate future research in instruction robustness.
  • Efficacy Across Tasks: Experimental results show significant improvements, especially in tasks like paraphrase identification and grammar correctness, illustrating Coin's ability to enhance model sensitivity to semantic nuances.

Empirical Insights

Through extensive experiments, several empirical insights have emerged:

  • Alignment of Hidden Representations: Visualization of hidden representations using UMAP reveals Coin's effectiveness in clustering semantically equivalent instructions closer together, reducing the impact of textual variation.
  • Significant Gains on Specific Tasks: Coin notably improves performance on tasks that benefit from refined semantic understanding, such as paraphrase identification and grammar correctness, by +5.4% and +6.3%, respectively.
  • Optimized Loss Weighting: The paper finds an optimal balance for the contrastive loss weight, λ=1,000, ensuring the contrastive loss neither dominates nor diminishes in the total loss landscape, optimizing model performance and robustness.

Future Directions and Limitations

While Coin represents a significant step forward, it opens up several avenues for future research and acknowledges certain limitations. For instance, Coin's reliance on paraphrasing for positive sample creation could be expanded to include a broader array of semantic-invariant augmentation methods. Additionally, extending the evaluation to more instruction-tuned models, datasets, and downstream tasks could further validate Coin's effectiveness. Finally, exploring alternative evaluation metrics and conditions could offer a more comprehensive view of Coin's impact.

Conclusion

The introduction of Contrastive Instruction Tuning (Coin) marks a notable advancement in the quest to enhance LLMs' robustness and reliability in interpreting and executing varied human instructions. By leveraging contrastive learning, Coin not only achieves significant improvements in model performance across a range of tasks but also contributes valuable resources and insights to the field. As LLMs continue to play a pivotal role in AI-based applications, methods like Coin, which enhance model robustness and consistency, will be crucial in realizing their full potential in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Enhancing Logical Reasoning of Large Language Models through Logic-Driven Data Augmentation. ArXiv:2305.12599 [cs].
  2. The second PASCAL recognising textual entailment challenge.
  3. The Fifth PASCAL Recognizing Textual Entailment Challenge.
  4. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
  5. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, Lecture Notes in Computer Science, pages 177–190, Berlin, Heidelberg. Springer.
  7. William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  8. Exploring the Limits of Out-of-Distribution Detection. In Advances in Neural Information Processing Systems, volume 34, pages 7068–7081. Curran Associates, Inc.
  9. Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. ArXiv:1801.04354 [cs].
  10. SimCSE: Simple Contrastive Learning of Sentence Embeddings. ArXiv:2104.08821 [cs].
  11. The Third PASCAL Recognizing Textual Entailment Challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 1–9, Prague. Association for Computational Linguistics.
  12. Twitter Sentiment Classification using Distant Supervision.
  13. Robustness of Learning from Task Instructions. ArXiv:2212.03813 [cs].
  14. Toward Semantics-Based Answer Pinpointing. In Proceedings of the First International Conference on Human Language Technology Research.
  15. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. ArXiv:1907.11932 [cs].
  16. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, New Orleans, Louisiana. Association for Computational Linguistics.
  17. Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks. ArXiv:2311.00288 [cs].
  18. ContrastNER: Contrastive-based Prompt Tuning for Few-shot NER. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 241–249. ArXiv:2305.17951 [cs].
  19. The Winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pages 552–561, Rome, Italy. AAAI Press.
  20. Xin Li and Dan Roth. 2002. Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
  21. Exploring Format Consistency for Instruction Tuning. ArXiv:2307.15504 [cs].
  22. How Good Are Large Language Models at Out-of-Distribution Detection? ArXiv:2308.10261 [cs].
  23. BRIO: Bringing Order to Abstractive Summarization. ArXiv:2203.16804 [cs].
  24. Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models. ArXiv:2308.07847 [cs].
  25. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. ArXiv:2301.13688 [cs].
  26. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  27. The CommitmentBank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107–124. Number: 2.
  28. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv:1802.03426 [cs, stat].
  29. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  30. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  31. Learning to Generalize for Cross-domain QA. ArXiv:2305.08208 [cs].
  32. Training language models to follow instructions with human feedback. ArXiv:2203.02155 [cs].
  33. Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
  34. Controllable Natural Language Generation with Contrastive Prefixes. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2912–2924, Dublin, Ireland. Association for Computational Linguistics.
  35. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  36. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  37. Multitask Prompted Training Enables Zero-Shot Task Generalization. ArXiv:2110.08207 [cs].
  38. Proximal Policy Optimization Algorithms. ArXiv:1707.06347 [cs].
  39. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  40. Evaluating the Zero-shot Robustness of Instruction-tuned Language Models. ArXiv:2306.11270 [cs].
  41. Stanford Alpaca: An Instruction-following LLaMA Model. Original-date: 2023-03-10T23:33:09Z.
  42. LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs].
  43. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  44. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models.
  45. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective.
  46. How far can camels go? exploring the state of instruction tuning on open resources. In Advances in Neural Information Processing Systems.
  47. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. ArXiv:2204.07705 [cs].
  48. Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7:625–641. Place: Cambridge, MA Publisher: MIT Press.
  49. Finetuned Language Models Are Zero-Shot Learners. ArXiv:2109.01652 [cs].
  50. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  51. Contrastive Training for Improved Out-of-Distribution Detection. ArXiv:2007.05566 [cs, stat].
  52. ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization. ArXiv:2201.06910 [cs].
  53. Instruction Tuning for Large Language Models: A Survey. ArXiv:2308.10792 [cs].
  54. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  55. PAWS: Paraphrase Adversaries from Word Scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
  56. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. ArXiv:2306.04528 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Fei Wang (573 papers)
  2. James Y. Huang (11 papers)
  3. Wenxuan Zhou (61 papers)
  4. Fan Yin (34 papers)
  5. Aram Galstyan (142 papers)
  6. Wenpeng Yin (69 papers)
  7. Muhao Chen (159 papers)
  8. Tianyi Lorena Yan (3 papers)
Citations (1)