Latent Multi-task Architecture Learning (1705.08142v3)

Published 23 May 2017 in stat.ML, cs.AI, cs.CL, cs.LG, and cs.NE

Abstract: Multi-task learning (MTL) allows deep neural networks to learn from related tasks by sharing parameters with other networks. In practice, however, MTL involves searching an enormous space of possible parameter sharing architectures to find (a) the layers or subspaces that benefit from sharing, (b) the appropriate amount of sharing, and (c) the appropriate relative weights of the different task losses. Recent work has addressed each of the above problems in isolation. In this work we present an approach that learns a latent multi-task architecture that jointly addresses (a)--(c). We present experiments on synthetic data and data from OntoNotes 5.0, including four different tasks and seven different domains. Our extension consistently outperforms previous approaches to learning latent architectures for multi-task problems and achieves up to 15% average error reductions over common approaches to MTL.

PDF Abstract

Latent Multi-task Architecture Learning: An Overview

The paper "Latent Multi-task Architecture Learning" by Sebastian Ruder et al. presents a novel approach to multi-task learning (MTL) in deep neural networks that simultaneously addresses three integral components: parameter sharing, optimal amount of sharing, and determining the relative weightings of task losses. Rather than approaching these elements in isolation, the paper integrates them into a latent architecture learning framework, enhancing both computational efficiency and learning outcomes in comparison to traditional methods.

Key Contributions and Methodology

The authors introduce a meta-architecture dubbed the "sluice network," which generalizes multiple existing architectures for both MTL and transfer learning. This design is particularly useful for sequence tagging tasks across varying domains. The sluice network architecture incorporates two core features:

Subspaces: Each layer in the deep recurrent neural network is partitioned into subspaces, allowing the model to represent both task-specific and shared features. This partitioning facilitates selective sharing by introducing trainable parameters (α and β) which dictate the interactions among these subspaces.
Mixture of Experts via Skip Connections: Inspired by residual learning, the architecture includes skip-connections controlled by β parameters, which aggregate layer outputs into a mixture model for better task-specific predictions.

The sluice network offers a flexible and configurable approach, enabling parameter sharing at different granularity levels across tasks. The selection of these parameters is determined through a matrix regularization process, where sparsity and orthogonality constraints guide the division of labor among subspaces.

Experimental Results

The empirical evaluation conducted on both synthetic data and real-world datasets, specifically OntoNotes 5.0, demonstrates the superiority of sluice networks across several tasks—chunking, named entity recognition (NER), semantic role labeling (SRL), and part-of-speech (POS) tagging. Key findings include:

Error Reduction: The sluice network achieves up to 15% average error reductions over conventional MTL approaches, reflecting its robust capacity to learn optimal sharing configurations.
Out-of-Domain Generalization: The architecture enhances model performance significantly in out-of-domain settings, indicating better generalization capabilities than standard MTL methods.
Performance Across Tasks: Evaluations show that sluice networks often outperform single-task models and other MTL architectures, proving particularly effective in syntactic tasks where low-level sharing is beneficial.

Theoretical and Practical Implications

The implications of latent MTL architecture learning are profound both in theory and practice:

Theoretical Contributions: The paper marks a significant advancement by formalizing a broad, parameter-learnable approach to multi-task architecture learning. It consolidates various strategies into a cohesive framework capable of dynamically optimizing task sharing.
Practical Applicability: The architecture holds promise for numerous NLP applications requiring efficient handling of syntactic and semantic tasks. It demonstrates potential for improving transfer learning scenarios, where auxiliary task information can be leveraged more effectively for enhanced primary task performance.

Future Directions

The innovative design of sluice networks opens pathways for further exploration in AI research. Future work may explore optimizing subspace configurations or investigating how model performance scales with more significant task quantities or more diverse datasets. Exploration of other architectural variants and regularization techniques may also yield further improvements in learning efficiency and transferability.

In summary, latent multi-task architecture learning as presented in this paper extends the frontier of MTL and paves the way for more adaptive and resource-efficient neural network designs. The depth and rigor of this research provide a robust foundation for subsequent advancements in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sebastian Ruder (93 papers)
Joachim Bingel (4 papers)
Isabelle Augenstein (131 papers)
Anders Søgaard (120 papers)

Citations (168)

View on Semantic Scholar