Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning Large Language Models with Sequential Instructions (2403.07794v3)

Published 12 Mar 2024 in cs.CL
Fine-tuning Large Language Models with Sequential Instructions

Abstract: Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.

Fine-Tuning LLMs with Sequential Instructions: A Comprehensive Overview

The paper "Fine-Tuning LLMs with Sequential Instructions" addresses a significant challenge in the capabilities of LLMs—the ability to follow and process a sequence of instructions in a single query. Traditional instruction datasets often contain straightforward, singular tasks, which limit models from navigating multi-step interactions effectively. The authors introduce a novel methodology termed Sequential Instruction Tuning (SIT), designed to enhance the models' competence in executing multiple tasks sequentially, a critical need for complex downstream tasks involving reasoning, multilingual, and multimodal scenarios.

Sequential Instruction Tuning and Its Implications

The central contribution of this research is the SIT paradigm, which broadens the scope of instruction tuning to encompass sequential sub-task executions. This method augments existing instruction datasets by interspersing tasks with intermediary steps, no longer requiring additional human annotations—a significant advantage in accelerating model training processes. For instance, intermediate tasks like translation or image captioning provide a composed step-by-step reasoning framework, facilitating improved LLM performance in cross-lingual and cross-modal tasks.

The paper details numerical results demonstrating SIT's superior performance over conventional instruction tuning. Noteworthy improvements are observed across various benchmarks: a +6% improvement on CommonsenseQA, a +17% boost for the XQuAD multilingual task, and a +2.1% enhancement in visual question answering tasks such as VQA and GQA. These outcomes underscore SIT's efficacy in enhancing both the instruction-following capabilities of LLMs and their downstream task performance, further confirmed by the paper's qualitative analyses.

Methodological Extensions and Evaluation

The SIT approach is experimentally validated using prominent LLMs, including LLaMA-2 70B and Mixtral-8×7B, fine-tuned on diversified datasets containing both genuine and synthetic intermediate tasks. The authors extend existing datasets (e.g., Alpaca) by concatenating additional tasks, which are then amended with corresponding outputs. This procedural innovation supports a broader array of tasks such as reasoning and cross-lingual processing even under unseen task conditions.

As part of their comprehensive evaluation, the authors demonstrate the robustness of SIT models when prompted with unseen templates and varying input lengths. The SIT models maintain high sequential task accuracy even when intermediate task steps are varied or when additional tasks are introduced during testing. These adaptability features indicate that SIT models generalize beyond their trained settings, confirming their utility in real-world applications requiring flexible and complex task executions.

Theoretical and Practical Implications

Theoretically, the introduction of SIT extends the conceptual framework and interpretation of instruction tuning by highlighting the importance of task order and intermediate processing steps in multi-step reasoning. It suggests a new dimension in LLM training that involves task chaining, which could be pivotal for future explorations into cognitive aspects of LLM behavior.

Practically, SIT enhances the applicability of LLMs in environments where complex, sequential decision-making is required, such as virtual assistants, autonomous systems, and multilingual conversational agents. This improved instruction-handling capability can potentially reduce human intervention, enabling more autonomous task executions and responses driven by structured curricular learning.

Future Directions

Given its promising results, future work could explore further diversification of intermediate tasks beyond those demonstrated in the paper, such as more intricate dummy tasks or context-specific intermediate sub-tasks. Additionally, integrating SIT with multilingual and multimodal datasets could provide transformative insights into scalable LLM applications.

In summary, this research advances the LLM field by proposing a targeted mechanism for sequential instruction execution, heralding a step forward in model efficiency, generalization, and applicability in increasingly complex computational tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Unifying cross-lingual transfer across scenarios of resource scarcity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  2. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
  3. Revisiting machine translation for cross-lingual classification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 2020.
  5. Monolingual or multilingual instruction tuning: Which makes a better Alpaca. arXiv preprint arXiv:2309.08958, 2023.
  6. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. Online blog, 2023.
  7. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
  8. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM. Online blog, 2023.
  9. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  10. Do multilingual language models think better in English? arXiv preprint arXiv:2308.01223, 2023.
  11. Think before you speak: Training language models with pause tokens. arXiv preprint arXiv:2310.02226, 2023.
  12. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Conference on Computer Vision and Pattern Recognition, 2017.
  13. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  14. Not all languages are created equal in LLMs: Improving multilingual capability by cross-lingual-thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023.
  15. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
  16. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  17. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, 2022.
  18. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
  19. LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023a.
  20. Bactrian-X: A multilingual replicable instruction-following model with low-rank adaptation. arXiv preprint arXiv:2305.15011, 2023b.
  21. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004.
  22. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  23. The Flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  24. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
  25. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
  26. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
  27. Modular deep learning. Transactions on Machine Learning Research, 2023.
  28. Modelling latent translations for cross-lingual transfer. arXiv preprint arXiv:2107.11353, 2021.
  29. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. Computational Linguistics, 2019.
  30. Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  31. Language models are unsupervised multitask learners. Online blog, 2019.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  33. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
  34. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2023.
  35. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  36. Stanford Alpaca: An instruction-following LLaMA model. Github repository, 2023.
  37. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv preprint arXiv:2307.09288, 2023b.
  39. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023.
  40. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a.
  41. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022b.
  42. AI Chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022.
  43. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations, 2020.
  44. PLUG: Leveraging pivot language in cross-lingual instruction tuning. arXiv preprint arXiv:2311.08711, 2023.
  45. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hanxu Hu (9 papers)
  2. Pinzhen Chen (27 papers)
  3. Edoardo M. Ponti (24 papers)
  4. Simon Yu (14 papers)
Citations (10)