Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Path Transformer is Better: A Case Study on Neural Machine Translation (2305.05948v1)

Published 10 May 2023 in cs.CL and cs.AI

Abstract: For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ye Lin (20 papers)
  2. Shuhan Zhou (8 papers)
  3. Yanyang Li (22 papers)
  4. Anxiang Ma (4 papers)
  5. Tong Xiao (119 papers)
  6. Jingbo Zhu (79 papers)

Summary

We haven't generated a summary for this paper yet.