Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain-of-Thought Tuning: Masked Language Models can also Think Step By Step in Natural Language Understanding (2310.11721v1)

Published 18 Oct 2023 in cs.CL and cs.LG

Abstract: Chain-of-Thought (CoT) is a technique that guides LLMs to decompose complex tasks into multi-step reasoning through intermediate steps in natural language form. Briefly, CoT enables LLMs to think step by step. However, although many Natural Language Understanding (NLU) tasks also require thinking step by step, LLMs perform less well than small-scale Masked LLMs (MLMs). To migrate CoT from LLMs to MLMs, we propose Chain-of-Thought Tuning (CoTT), a two-step reasoning framework based on prompt tuning, to implement step-by-step thinking for MLMs on NLU tasks. From the perspective of CoT, CoTT's two-step framework enables MLMs to implement task decomposition; CoTT's prompt tuning allows intermediate steps to be used in natural language form. Thereby, the success of CoT can be extended to NLU tasks through MLMs. To verify the effectiveness of CoTT, we conduct experiments on two NLU tasks: hierarchical classification and relation extraction, and the results show that CoTT outperforms baselines and achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Tacred revisited: A thorough evaluation of the tacred relation extraction task. ACL.
  2. Language models are few-shot learners. NeurIPS.
  3. Hierarchy-aware label semantics matching network for hierarchical text classification. In ACL.
  4. A simple framework for contrastive learning of visual representations. In ICML.
  5. Exploring logically dependent multi-task learning with causal inference. In EMNLP.
  6. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In WWW.
  7. Adaprompt: Adaptive model training for prompt-based NLP. In Findings of EMNLP.
  8. Palm: Scaling language modeling with pathways. CoRR.
  9. HTCInfoMax: A global model for hierarchical text classification via information maximization. In NAACL.
  10. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL.
  11. Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL.
  12. Accurate use of label dependency in multi-label text classification through the lens of causality. Applied Intelligence.
  13. Improving the out-of-distribution generalization capability of language models: Counterfactually-augmented data is not enough. In ICASSP.
  14. Unlock the potential of counterfactually-augmented data in out-of-distribution generalization. Expert Systems With Applications.
  15. Complexity-based prompting for multi-step reasoning. ICLR.
  16. Making pre-trained language models better few-shot learners. In ACL.
  17. Simcse: Simple contrastive learning of sentence embeddings. In EMNLP.
  18. PPT: pre-trained prompt tuning for few-shot learning. In ACL.
  19. WARP: word-level adversarial reprogramming. In ACL.
  20. PTR: prompt tuning with rules for text classification. CoRR.
  21. How can we know When language models know? on the calibration of language models for question answering. TACL.
  22. Spanbert: Improving pre-training by representing and predicting spans. TACL.
  23. Carlos Nascimento Silla Jr. and Alex Alves Freitas. 2011. A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov.
  24. Unifiedqa: Crossing format boundaries with a single QA system. In Findings of EMNLP.
  25. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  26. Chatgpt: Jack of all trades, master of none. CoRR.
  27. Large language models are zero-shot reasoners. In NeurIPS.
  28. Hdltex: Hierarchical deep learning for text classification. In ICMLA.
  29. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR.
  30. ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR.
  31. Vasiliy Andreevich Laptev. 2021. Artificial intelligence in court (judicial ai): the legal basis and prospects for its work. Russian justice.
  32. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CSUR.
  33. GPT understands, too. CoRR.
  34. Roberta: A robustly optimized BERT pretraining approach. CoRR.
  35. OpenAI. 2023. GPT-4 technical report. CoRR.
  36. True few-shot learning with language models. In NeurIPS.
  37. Knowledge enhanced contextual word representations. In EMNLP.
  38. Language models as knowledge bases? In EMNLP.
  39. Trust and medical ai: The challenges we face and the expertise needed to overcome them. JAMIA.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.
  41. Learning to retrieve prompts for in-context learning. In NAACL.
  42. Automatically identifying words that can serve as labels for few-shot text classification. In COLING.
  43. Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In EACL.
  44. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In NAACL.
  45. Re-tacred: Addressing shortcomings of the tacred dataset. In AAAI.
  46. Paradigm shift in natural language processing. Int. J. Autom. Comput.
  47. Llama: Open and efficient foundation language models. CoRR.
  48. Grigorios Tsoumakas and Ioannis Manousos Katakis. 2007. Multi-label classification: An overview. Int. J. Data Warehous. Min.
  49. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS.
  50. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
  51. GPT-NER: named entity recognition via large language models. CoRR.
  52. Incorporating hierarchy into text encoder: a contrastive learning approach for hierarchical text classification. In ACL.
  53. HPT: hierarchy-aware prompt tuning for hierarchical text classification. EMNLP.
  54. Emergent abilities of large language models. Trans. Mach. Learn. Res.
  55. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  56. Huggingface’s transformers: State-of-the-art natural language processing. CoRR.
  57. Harnessing the power of llms in practice: A survey on chatgpt and beyond. CoRR.
  58. SGM: sequence generation model for multi-label classification. In COLING.
  59. Star: Bootstrapping reasoning with reasoning. In NeurIPS.
  60. GLM-130B: an open bilingual pre-trained model. ICLR.
  61. Aligning instruction tasks unlocks large language models as zero-shot relation extractors. ACL.
  62. Position-aware attention and supervised data improve slot filling. In EMNLP.
  63. Automatic chain of thought prompting in large language models. CoRR.
  64. A survey of large language models. CoRR.
  65. Least-to-most prompting enables complex reasoning in large language models. ICLR.
  66. Wenxuan Zhou and Muhao Chen. 2021. An improved baseline for sentence-level relation extraction. CoRR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Caoyun Fan (8 papers)
  2. Jidong Tian (13 papers)
  3. Yitian Li (9 papers)
  4. Wenqing Chen (16 papers)
  5. Hao He (99 papers)
  6. Yaohui Jin (40 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.