Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners (2108.13161v7)

Published 30 Aug 2021 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG

Abstract: Large-scale pre-trained LLMs have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small LLMs into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained LLM and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained LLMs; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.

PDF Abstract

Overview of "Differentiable Prompt Makes Pre-trained LLMs Better Few-shot Learners"

The paper presents a novel approach, DifferentiAble pRompT (DART), which enhances the few-shot learning capabilities of pre-trained LLMs (PLMs) by optimizing both the prompt templates and labels in a differentiable manner. This method is designed to address the limitations posed by the scalability, deployment difficulty, and manual prompt engineering required by LLMs, such as GPT-3, for effective few-shot learning.

Contributions and Methodology

The key contributions of this research are twofold: the introduction of a method for differentiable prompt optimization and the demonstration of its effectiveness across a broad range of NLP tasks.

Differentiable Template and Label Optimization: The approach involves reformulating tasks as LLM problems and updating the prompts in a continuous space rather than the traditional discrete token search. This is accomplished by utilizing unused tokens in the LLM and training them through gradient descent. This reformulation allows for more expressive templates and an optimal set of label token embeddings which can substantially reduce the need for labeled data, making it more applicable to varied domains.
Fluency Constraint Objective: An auxiliary fluency constraint is introduced to maintain the representational ability of LLMs, coupling prompt embeddings to ensure contextual integrity. This constraint is achieved without additional model parameters, preserving the parameter efficiency of DART.
Evaluation Across Multiple NLP Tasks: The authors conduct extensive experiments over 15 tasks, including sentiment analysis, natural language inference, classification, and more domain-specific tasks such as relation and event extraction. Notably, DART demonstrated performance improvements when compared with traditional fine-tuning methods and even achieved results comparable to those involving additional generation models like T5 in the LM-BFF framework.
Robust Application Beyond Specific Models: Interestingly, DART can be applied to both BERT and GPT architectures, showing its versatility and potential for broad applicability in transforming small LLMs into competent few-shot learners.

Strong Results and Future Directions

A highlight of the findings is the significant performance boosts on tasks involving complex label spaces, such as relation extraction, with reported improvements of up to 23.28% over conventional techniques in few-shot settings. Moreover, in supervised scenarios, DART still demonstrated modest gains, supporting its use even in data-abundant settings.

The paper suggests several avenues for future work, including intermediate training to address domain shifts in the corpus distribution and potentially expanding differentiable prompting strategies to other types of language tasks such as dialogue systems and cross-lingual applications. The paper opens the floor for the exploration of adaptive and continuously optimized language representations that align closely with human intuitive understanding in real-world applications.

In conclusion, DART presents a methodological advancement in prompt engineering, offering an efficient, model-agnostic strategy for enhancing the adaptability of pre-trained LLMs. This work has profound implications for deploying smaller, more economical models in diverse applications without sacrificing performance, thereby driving the accessible implementation of AI in resource-constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Ningyu Zhang (148 papers)
Luoqiu Li (5 papers)
Xiang Chen (343 papers)
Shumin Deng (65 papers)
Zhen Bi (67 papers)
Chuanqi Tan (56 papers)
Fei Huang (408 papers)
Huajun Chen (198 papers)

Citations (160)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - zjunlp/DART: Code for the ICLR2022 paper "Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners" (132 stars)