Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Model Instruction Following: A Survey of Progresses and Challenges (2303.10475v8)

Published 18 Mar 2023 in cs.CL

Abstract: Task semantics can be expressed by a set of input-output examples or a piece of textual instruction. Conventional machine learning approaches for NLP mainly rely on the availability of large-scale sets of task-specific examples. Two issues arise: first, collecting task-specific labeled examples does not apply to scenarios where tasks may be too complicated or costly to annotate, or the system is required to handle a new task immediately; second, this is not user-friendly since end-users are probably more willing to provide task description rather than a set of examples before using the system. Therefore, the community is paying increasing interest in a new supervision-seeking paradigm for NLP: learning to follow task instructions, i.e., instruction following. Despite its impressive progress, there are some common issues that the community struggles with. This survey paper tries to summarize and provide insights to the current research on instruction following, particularly, by answering the following questions: (i) What is task instruction, and what instruction types exist? (ii) How to model instructions? (iii) What are popular instruction following datasets and evaluation metrics? (iv) What factors influence and explain the instructions' performance? (v) What challenges remain in instruction following? To our knowledge, this is the first comprehensive survey about instruction following.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (163)
  1. Communicating natural programs to humans and machines. arXiv preprint arXiv:2106.07824.
  2. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv preprint arXiv:1907.05019.
  3. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1:49–62.
  4. Learning to interpret natural language instructions. In Proceedings of the Second Workshop on Semantic Interpretation in an Actionable Context, pages 1–6.
  5. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104.
  6. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
  7. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 751–761.
  8. Learning to win by reading manuals in a monte-carlo framework. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 268–277.
  9. Reading between the lines: Learning to map high-level instructions to commands. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 1268–1277.
  10. Smash: One-shot model architecture search through hypernetworks. ArXiv, abs/1708.05344.
  11. Language Models are Few-shot Learners. Advances in neural information processing systems, 33:1877–1901.
  12. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
  13. Cognitively guided instruction: A knowledge base for reform in primary mathematics instruction. The elementary school journal.
  14. Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. arXiv preprint arXiv:2210.13669.
  15. David Chen. 2012. Fast online lexicon learning for grounded language acquisition. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 430–439.
  16. David Chen and Raymond Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 859–865.
  17. David L Chen and Raymond J Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning, pages 128–135.
  18. Knowprompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. In Proceedings of the ACM Web Conference 2022, pages 2778–2788.
  19. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  20. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311.
  21. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416.
  22. Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, pages 18–27.
  23. Hello dolly: Democratizing the magic of chatgpt with open models.
  24. Template-based named entity recognition using bart. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1835–1845.
  25. Boosting Natural Language Generation from Instructions with Meta-Learning. arXiv preprint arXiv:2210.11617.
  26. RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391.
  27. Mind2web: Towards a generalist agent for the web.
  28. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  29. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  30. A Survey on In-Context Learning. arXiv preprint arXiv:2301.00234.
  31. Carolin Dudschig and Barbara Kaup. 2018. How does “not left” become “right”? electrophysiological evidence for a dynamic conflict-bound negation processing account. Journal of Experimental Psychology: Human Perception and Performance, 44(5):716.
  32. Editeval: An instruction-based benchmark for text improvements. arXiv preprint arXiv:2209.13331.
  33. Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982.
  34. Reading to learn: Constructing features from semantic abstracts. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 958–967.
  35. A longitudinal study of learning to use children’s thinking in mathematics instruction. Journal for research in mathematics education.
  36. David Gaddy and Dan Klein. 2019. Pre-learning environment representations for data-efficient neural instruction following. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1946–1956.
  37. Making Pre-Trained Language Models Better Few-Shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830.
  38. Koala: A dialogue model for academic research. Blog post.
  39. Ben Goertzel. 2014. Artificial General Intelligence: Concept, State of The Art, and Future Prospects. Journal of Artificial General Intelligence, 5(1):1.
  40. Dan Goldwasser and Dan Roth. 2014. Learning from natural instructions. Machine learning, 94(2):205–232.
  41. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037.
  42. Robustness of Learning from Task Instructions. arXiv preprint arXiv:2212.03813.
  43. Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539.
  44. InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 505–525.
  45. Hypernetworks. arXiv preprint arXiv:1609.09106.
  46. Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1884–1895.
  47. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
  48. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  49. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. arXiv preprint arXiv:2212.09689.
  50. Instruction Induction: From Few Examples to Natural Language Task Descriptions. arXiv preprint arXiv:2205.10782.
  51. Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1301–1312.
  52. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  53. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  54. In-context Learning for Few-shot Dialogue State Tracking. arXiv preprint arXiv:2203.08568.
  55. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards Reasoning in Large Language Models: A Survey. arXiv preprint arXiv:2212.10403.
  56. A Survey of NLP-related Crowdsourcing Hits: What Works and What Does Not. arXiv preprint arXiv:2111.05241.
  57. HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation. arXiv preprint arXiv:2212.10315.
  58. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv preprint arXiv:2212.12017.
  59. Exploring the Benefits of Training Expert Language Models over Instruction Tuning. arXiv preprint arXiv:2302.03202.
  60. Can large language models truly understand prompts? a case study with negated prompts. arXiv preprint arXiv:2209.12711.
  61. Language to network: Conditional parameter adaptation with natural language descriptions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6994–7007.
  62. Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818.
  63. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  64. Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3631–3643.
  65. UNIFIEDQA: Crossing Format Boundaries with a Single QA System. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907.
  66. Joohyun Kim and Raymond Mooney. 2012. Unsupervised pcfg induction for grounded language learning with highly ambiguous supervision. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 433–444.
  67. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
  68. Nikita Kitaev and Dan Klein. 2018. Constituency Parsing with a Self-Attentive Encoder. arXiv preprint arXiv:1805.01052.
  69. Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint arXiv:2304.08460.
  70. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  71. Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics, 1:193–206.
  72. Guiding a reinforcement learner with natural language advice: Initial results in robocup soccer. In The AAAI-2004 workshop on supervisory control of learning and adaptive systems. San Jose, CA.
  73. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  74. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  75. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  76. Ultra-Fine Entity Typing with Indirect Supervision from Natural Language Inference. Transactions of the Association for Computational Linguistics, 10:607–622.
  77. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
  78. Maqa: A multimodal qa benchmark for negation. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
  79. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387.
  80. Interactive task learning from gui-grounded natural language instructions and demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 215–223.
  81. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  82. Prompt-Driven Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2022.
  83. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
  84. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99.
  85. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  86. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638.
  87. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 55(9):1–35.
  88. Gpt understands, too. arXiv preprint arXiv:2103.10385.
  89. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv preprint arXiv:2301.13688.
  90. A joint model of language and perception for grounded attribute learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1435–1442.
  91. MetaICL: Learning to Learn In Context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809.
  92. Rethinking the Role of Demonstrations: What Makes In-context Learning Work? arXiv preprint arXiv:2202.12837.
  93. Reframing Instructional Prompts to GPTk’s Language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612.
  94. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487.
  95. Swaroop Mishra and Elnaz Nouri. 2022. Help me think: A simple prompting strategy for non-experts to create customized content with models. arXiv preprint arXiv:2208.08232.
  96. Expbert: Representation engineering with natural language explanations. arXiv preprint arXiv:2005.01932.
  97. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353.
  98. OpenAI. 2022. Chatgpt.
  99. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  100. Non-proportional parametrizations for stable hypernetwork learning. arXiv preprint arXiv:2304.07645.
  101. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  102. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. arXiv preprint arXiv:2304.03279.
  103. Don’t blame the annotator: Bias already starts in the annotation instructions. arXiv preprint arXiv:2205.00415.
  104. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  105. GRIPS: Gradient-Free, Edit-Based Instruction Search for Prompting Large Language Models. arXiv preprint arXiv:2203.07281.
  106. Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters. arXiv preprint arXiv:2007.03001.
  107. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502.
  108. Reasoning with Language Model Prompting: A Survey. arXiv preprint arXiv:2212.09597.
  109. Guanghui Qin and Jason Eisner. 2021. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212.
  110. Language Models are Unsupervised Multitask Learners. OpenAI blog.
  111. Exploring the Limits of Transfer Learning with a Unified Text-To-Text Transformer. J. Mach. Learn. Res., 21(140):1–67.
  112. Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2439–2455.
  113. Label Verbalization and Entailment for Effective Zero and Few-Shot Relation Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1199–1212.
  114. Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations.
  115. Timo Schick and Hinrich Schütze. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269.
  116. Timo Schick and Hinrich Schütze. 2021b. Few-shot Text Generation with Natural Language Instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 390–402.
  117. Timo Schick and Hinrich Schütze. 2021c. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352.
  118. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 1527–1536.
  119. Zero-shot learning of classifiers from natural language quantification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 306–316.
  120. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650.
  121. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. arXiv preprint arXiv:2212.09741.
  122. Xiaobing Sun and Wei Lu. 2022. Implicit n-grams induced by recurrence. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1624–1639.
  123. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  124. Unifying Language Learning Paradigms. arXiv preprint arXiv:2205.05131.
  125. Approaching the symbol grounding problem with probabilistic graphical models. AI magazine, 32(4):64–76.
  126. Attention is All You Need. Advances in neural information processing systems, 30.
  127. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 806–814.
  128. InstructionNER: A Multi-Task Instruction-Based Generative Framework for Few-Shot NER. arXiv preprint arXiv:2203.03903.
  129. Pei Wang and Ben Goertzel. 2007. Introduction: Aspects of Artificial General Intelligence. In Proceedings of the 2007 conference on Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms: Proceedings of the AGI Workshop 2006, pages 1–16.
  130. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  131. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560.
  132. Benchmarking Generalization via In-context Instructions on 1,600+ Language Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  133. Learning from explanations with neural execution tree. In ICLR.
  134. Albert Webson and Ellie Pavlick. 2022. Do Prompt-based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344.
  135. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
  136. Symbol tuning improves in-context learning in language models. arXiv preprint arXiv:2305.08298.
  137. Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375.
  138. HuggingFace’s Transformers: State-of-the-Art Natural Language Processing. arXiv preprint arXiv:1910.03771.
  139. Hui Wu and Xiaodong Shi. 2022. Adversarial Soft Prompt Tuning for Cross-Domain Sentiment Analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2438–2447.
  140. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402.
  141. Incremental Few-Shot Text Classification with Multi-Round New Classes: Formulation, Dataset and System. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1351–1360.
  142. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  143. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  144. A universal discriminator for zero-shot generalization. arXiv preprint arXiv:2211.08099.
  145. ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization. arXiv preprint arXiv:2201.06910.
  146. OpenStance: Real-World Zero-Shot Stance Detection. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 314–324.
  147. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773.
  148. Teaching machine comprehension with compositional explanations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1599–1615.
  149. CrossFit: A Few-Shot Learning Challenge for Cross-Task Generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7163–7189.
  150. Qinyuan Ye and Xiang Ren. 2021. Learning to Generate Task-Specific Adapters from Task Description. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 646–653.
  151. In-context instruction learning. arXiv preprint arXiv:2302.14691.
  152. Retrieval of soft prompt enhances zero-shot task generalization. arXiv preprint arXiv:2210.03029.
  153. Guess the Instruction! Making Language Models Stronger Zero-Shot Learners. arXiv preprint arXiv:2210.02969.
  154. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. arXiv preprint arXiv:2305.14327.
  155. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923.
  156. ConTinTin: Continual Learning from Task Instructions. In ACL, pages 3062–3072.
  157. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
  158. Analogous process structure induction for sub-event sequence prediction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1541–1550.
  159. Aligning instruction tasks unlocks large language models as zero-shot relation extractors. arXiv preprint arXiv:2305.11159.
  160. Magicbrush: A manually annotated dataset for instruction-guided image editing.
  161. Learning to decompose and organize complex tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2726–2735.
  162. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2856–2878.
  163. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Renze Lou (18 papers)
  2. Kai Zhang (542 papers)
  3. Wenpeng Yin (69 papers)
Citations (17)

Summary

A Comprehensive Survey on Instruction Following

The paper "A Comprehensive Survey on Instruction Following" presents an extensive exploration of instruction-following paradigms in NLP. This survey focuses on the emerging paradigm of utilizing task instructions over traditional example-based learning, addressing multiple challenges and nuances in this evolving field.

Overview of Instruction Following

The authors categorize instruction following into three primary types: NLI-oriented, LLM-oriented, and Human-oriented instructions. Each type addresses the need for indirect supervision in distinct ways:

  • NLI-oriented Instructions transform target NLP problems into natural language inference tasks, leveraging indirect supervision from existing NLI datasets.
  • LLM-oriented Instructions or prompts, focus on converting inputs into a format conducive to the pretrained objectives of LLMs, optimizing for zero-shot and few-shot tasks.
  • Human-oriented Instructions are complex, user-friendly descriptions used for human-generated data labeling, and pose challenges in encoding and model understanding.

Key Modeling Strategies

For modeling these instructions, the paper outlines various strategies. Semantic parsing converts instructions into logical forms but is limited to specific applications. The flatten-and-concatenation approach lacks efficiency and relies heavily on the scale of training data. HyperNetwork models provide a more structured encoding method by transforming instructions into model parameters. Finally, reinforcement learning from human feedback is recognized for optimizing alignment of LLM outputs with human preferences, albeit at significant cost in human labor.

Datasets and Evaluation

The paper distinguishes between human-annotated and LLM-synthetic datasets for instruction tuning. Human-annotated datasets are high-quality but limited in diversity, while LLM-synthetic datasets offer diversity at the potential cost of accuracy. Evaluation techniques are divided into task-centric and human-centric approaches. Each method presents particular challenges and trade-offs regarding the subjective nature of instruction effectiveness and alignment with human expectations.

Influencing Factors

Performance of instruction following is influenced by several dimensions, including model scale, instruction diversity, and task scale. Interestingly, the synergy between scaling model size and task diversity, referred to as dual-track scaling, highlights the interdependence of these factors in achieving impressive few/zero-shot generalization.

Challenges and Future Directions

Critical challenges remain in negation handling, vulnerability to adversarial instruction attacks, and the interpretability-versus-performance trade-off in instruction engineering. Future research directions include enhancing LLMs' robustness to negated information, defending against adversarial attacks, and balancing human and model alignment in instruction design.

Conclusion

By tracing its development from early machine learning days to the modern advent of LLMs, this survey provides a well-rounded perspective on instruction following. As the first comprehensive survey of its kind, it offers valuable insights and guidance for researchers aiming to push the boundaries of cross-task generalization through instruction-tuned systems. The outlined challenges and potential research directions suggest a promising pathway for future advancements in AI-driven NLP solutions.