Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Memory-Efficient Differentiable Transformer Architecture Search (2105.14669v1)

Published 31 May 2021 in cs.LG and cs.CL

Abstract: Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden size and more candidate operations. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech. Experimental results show that our network consistently outperforms standard Transformers across the tasks. Moreover, our method compares favorably with big-size Evolved Transformers, reducing search computation by an order of magnitude.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuekai Zhao (1 paper)
  2. Li Dong (154 papers)
  3. Yelong Shen (83 papers)
  4. Zhihua Zhang (118 papers)
  5. Furu Wei (291 papers)
  6. Weizhu Chen (128 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.