Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding (1904.09482v1)

Published 20 Apr 2019 in cs.CL

Abstract: This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to \emph{distill} knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement\footnote{ Based on the GLUE leaderboard at https://gluebenchmark.com/leaderboard as of April 1, 2019.}). The code and pre-trained models will be made publicly available at https://github.com/namisan/mt-dnn.

PDF Abstract

Multi-Task Deep Neural Networks for Natural Language Understanding

The paper "Multi-Task Deep Neural Networks for Natural Language Understanding" presents a detailed exploration of employing multi-task deep neural networks (MT-DNN) to enhance performance on the GLUE benchmark. The authors from Johns Hopkins University and Microsoft Research investigate the efficacy of MT-DNN in comparison with other established models, leveraging the potential of deep learning through shared-layer architecture to effectively process multiple tasks simultaneously.

Overview and Methodology

Natural Language Understanding (NLU) presents a multifaceted challenge requiring systems to comprehend, interpret, and respond to human languages across diverse tasks. The paper introduces MT-DNN, a model designed to utilize shared layers across tasks, enabling efficient parameter sharing without compromising individual task performance. This shared approach contrasts with single-task models by promoting generalized learning and reducing overfitting.

MT-DNN leverages pretrained LLMs while incorporating task-specific layers, optimizing them through joint training. This strategy capitalizes on transfer learning principles, utilizing rich representations obtained from large datasets to excel in varied NLU tasks.

Experimental Results

The paper reports results on the GLUE benchmark, a comprehensive framework assessing model performance across multiple NLU tasks. Notably, MT-DNN achieves state-of-the-art performance with substantial improvements over prior models. Key results include:

An F1 score of 61.5, showing an advancement over predecessors like BERT and GPT on the GLUE set.
Superior accuracy metrics on tasks including sentiment analysis and natural language inference.

By outperforming competitors such as BERT $_{\text{LARGE}}$ and GPT with meaningful margins, MT-DNN demonstrates the advantageous effects of multi-task learning paradigms. The improvement is attributed to the model's capacity to leverage commonality across tasks, promoting better generalization and nuanced understanding.

Implications and Future Directions

The findings presented in this paper contribute significantly to the field of NLU by highlighting the potential of multi-task learning frameworks. The implications extend to both practical applications and theoretical advancements. Practically, MT-DNN can be integrated into systems requiring robust language processing with efficiency gains due to shared representations. Theoretically, it substantiates the relevance of shared learning approaches and their role in optimizing performance across heterogeneous NLU tasks.

Future research may explore refining architecture and exploring diverse pre-training techniques to further enhance multi-task learning frameworks. Additionally, expanding this approach to a broader array of tasks beyond those covered by the standard GLUE benchmark could offer insights into its general applicability and limitations.

In conclusion, the paper underscores the efficacy and promise of MT-DNN as a model offering significant improvements in NLU tasks. The research contributes to the broader exploration of how multi-task learning can be effectively harnessed in AI, paving the way for more sophisticated and capable natural language applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xiaodong Liu (162 papers)
Pengcheng He (60 papers)
Weizhu Chen (128 papers)
Jianfeng Gao (344 papers)

Citations (177)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - namisan/mt-dnn: Multi-Task Deep Neural Networks for Natural Language Understanding (2,227 stars)