Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners (2204.07689v1)

Published 16 Apr 2022 in cs.LG and cs.CL

Abstract: Traditional multi-task learning (MTL) methods use dense networks that use the same set of shared weights across several different tasks. This often creates interference where two or more tasks compete to pull model parameters in different directions. In this work, we study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning by specializing some weights for learning shared representations and using the others for learning task-specific information. To this end, we devise task-aware gating functions to route examples from different tasks to specialized experts which share subsets of network weights conditioned on the task. This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model. We demonstrate such sparse networks to improve multi-task learning along three key dimensions: (i) transfer to low-resource tasks from related tasks in the training mixture; (ii) sample-efficient generalization to tasks not seen during training by making use of task-aware routing from seen related tasks; (iii) robustness to the addition of unrelated tasks by avoiding catastrophic forgetting of existing tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shashank Gupta (57 papers)
  2. Subhabrata Mukherjee (59 papers)
  3. Krishan Subudhi (2 papers)
  4. Eduardo Gonzalez (7 papers)
  5. Damien Jose (7 papers)
  6. Ahmed H. Awadallah (7 papers)
  7. Jianfeng Gao (344 papers)
Citations (42)

Summary

We haven't generated a summary for this paper yet.