Lifelong Language Knowledge Distillation (2010.02123v1)

Published 5 Oct 2020 in cs.CL, cs.AI, and cs.LG

Abstract: It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.

View on arXiv

Authors (3)

Yung-Sung Chuang (37 papers)
Shang-Yu Su (20 papers)
Yun-Nung Chen (104 papers)

Citations (44)

View on Semantic Scholar

Summary

Lifelong Language Knowledge Distillation

The paper, "Lifelong Language Knowledge Distillation", by Chuang, Su, and Chen, presents a novel approach aimed at addressing the persistent issue of catastrophic forgetting in lifelong language learning (LLL). The proposed methodology, Lifelong Language Knowledge Distillation (L2KD), effectively leverages knowledge distillation to enhance performance retention across sequential task learning without necessitating increased model complexity or computational costs associated with multi-task learning strategies.

Key Contributions

The primary innovation lies in incorporating a teacher model to impart knowledge to a lifelong learning model whenever a new task is encountered. This differs from traditional approaches where catastrophic forgetting is mitigated by either joint training of tasks or maintaining a memory buffer for replay. The teacher model is specifically trained on the new task and subsequently discarded post-distillation, ensuring memory efficiency—ideal for real-world NLP systems requiring updates with evolving language usage.

Experiments demonstrate that L2KD consistently outperforms both straightforward finetuning and existing state-of-the-art LAMOL frameworks in handling task sequences that include sequence generation and text classification. Notably, the performance gap with multi-task models is considerably reduced, positioning L2KD as a viable alternative to multitasking approaches.

Results and Findings

The experiments were conducted across diverse datasets encompassing sequence generation tasks among different domains and text classification challenges. Key findings include:

Word-KD and Seq-KD strategies prove effective for tasks with relatively simple structured outputs, such as MultiWOZ and WikiSQL, where soft targets facilitate better adaptation of the LLL model to new task distributions. Conversely, Seq-KD excels in scenarios with noisy datasets like CNN/DailyMail, reducing the complexity inherent in summarization tasks significantly.
Order Robustness: L2KD exhibited lower standard deviations across different task order permutations than LAMOL, suggesting enhanced robustness to the sequences in which data is presented—an important consideration for practical LLL implementations.
For Text Classification: Though primarily targeted at sequence generation, L2KD showed substantial promise in classification contexts as well, nearing multitask learning performance while maintaining efficiency.

Implications and Future Directions

L2KD redefines the paradigm of lifelong learning by establishing teacher models that propagate learned information only on new tasks, thus mitigating forgetting without adding layers to existing architectures. While primarily demonstrated in NLP, the concept shows promise for broader applications, particularly in domains constrained by resource limitations or requiring frequent updates.

The paper hints at future explorations such as hybrid models that combine memory replay techniques with distillation strategies, potentially yielding further robustness in lifelong learning. Additionally, extending distillation-based techniques to reinforcement tasks presents an intriguing avenue, given the success seen with sequence modeling.

In conclusion, the paper effectively bridges a gap between lifelong and multitask learning, showcasing a scalable model capable not only of handling continuous language evolution but also adapting across a variety of complex tasks. Such advancements underscore the ongoing potential within AI systems to operate resiliently in dynamic, real-world environments.

PDF Markdown

Lifelong Language Knowledge Distillation (2010.02123v1)

Summary

Lifelong Language Knowledge Distillation

Key Contributions

Results and Findings

Implications and Future Directions

Related Papers

GitHub

YouTube