Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Train No Evil: Selective Masking for Task-Guided Pre-Training (2004.09733v2)

Published 21 Apr 2020 in cs.CL

Abstract: Recently, pre-trained LLMs mostly follow the pre-train-then-fine-tuning paradigm and have achieved great performance on various downstream tasks. However, since the pre-training stage is typically task-agnostic and the fine-tuning stage usually suffers from insufficient supervised data, the models cannot always well capture the domain-specific and task-specific patterns. In this paper, we propose a three-stage framework by adding a task-guided pre-training stage with selective masking between general pre-training and fine-tuning. In this stage, the model is trained by masked LLMing on in-domain unsupervised data to learn domain-specific patterns and we propose a novel selective masking strategy to learn task-specific patterns. Specifically, we design a method to measure the importance of each token in sequences and selectively mask the important tokens. Experimental results on two sentiment analysis tasks show that our method can achieve comparable or even better performance with less than 50% of computation cost, which indicates our method is both effective and efficient. The source code of this paper can be obtained from https://github.com/thunlp/SelectiveMasking.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuxian Gu (21 papers)
  2. Zhengyan Zhang (46 papers)
  3. Xiaozhi Wang (51 papers)
  4. Zhiyuan Liu (433 papers)
  5. Maosong Sun (337 papers)
Citations (54)

Summary

Overview of "Train No Evil: Selective Masking for Task-Guided Pre-Training"

The paper "Train No Evil: Selective Masking for Task-Guided Pre-Training" introduces a novel three-stage framework for improving the efficiency and performance of pre-trained LLMs (PLMs) on downstream tasks. The authors highlight the limitations of the conventional pre-train-then-fine-tune paradigm, noting its task-agnostic nature during pre-training and the challenges posed by insufficient supervised data during fine-tuning. The proposed framework introduces a task-guided pre-training stage, incorporating a selective masking strategy to enhance the model's capability in capturing domain-specific and task-specific patterns on in-domain unsupervised data.

Methodological Advancements

The paper's central innovation is the selective masking strategy deployed during task-guided pre-training. This strategy involves evaluating the importance of each token in sequences based on their contribution to downstream tasks and selectively masking those deemed to be more critical. The hypothesis is that this targeted approach allows models to learn task-specific patterns more effectively compared to random masking typically used in masked LLMs (MLMs).

The methodological process includes:

  • General Pre-Training (GenePT): Standard pre-training on large, general-domain corpora akin to BERT's methodology.
  • Task-Guided Pre-Training (TaskPT): An intermediary stage using in-domain unsupervised data, focusing on selectively masking tokens based on their importance for the downstream task.
  • Fine-Tuning: Adaptation of the model to specific downstream tasks, following conventional practices.

Experimental Findings

Empirical results from sentiment analysis tasks demonstrate that the proposed framework is both effective and efficient, achieving comparable or superior performance with less than 50% of the computation cost typically required for conventional PLMs. Notably, selective masking consistently outperformed random masking strategies across different task settings, underscoring its effectiveness in capturing task-specific language patterns.

The experiments encompass various setups, combining downstream datasets (like MR and SemEval14) with in-domain datasets (such as Yelp and Amazon reviews). Results indicate that the similarity between the domain of in-domain data and downstream tasks significantly influences the model's performance, with greater similarity yielding better outcomes.

Practical and Theoretical Implications

The practical implications of this research are significant, especially in scenarios where computational resources are constrained and domain-specific data is abundant. By reducing pre-training costs and improving task adaptation, this method offers a viable path toward more efficient deployment of PLMs in specialized applications.

Theoretically, the findings provoke further exploration into domain adaptation strategies and the potential for more refined token importance metrics beyond basic classification confidences. Future research could delve into developing alternative token scoring mechanisms that do not rely solely on downstream task outcomes.

Speculations on Future Developments

This work lays groundwork for several potential advancements in AI and NLP:

  • Refinement of token importance metrics to improve the efficiency of selective masking strategies.
  • Exploration of similar task-guided pre-training approaches in domains where labeled data is scarce.
  • Developments in unsupervised learning techniques to better capture the nuances of task-specific patterns.

Overall, the paper does not merely present a marginal improvement to PLMs but rather offers a systematic approach to bridging the gap between general language pre-training and domain-specific demands. This contribution paves the way for more resource-effective models tailored to particular applications without compromising performance.