MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning (1911.09483v1)

Published 17 Nov 2019 in cs.CL and cs.LG

Abstract: In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism. Code will be available at https://github.com/lancopku/MUSE

Citations (48)

View on Semantic Scholar

Summary

The paper introduces the MUSE model, which combines self-attention with depth-separable convolution in parallel to capture both local and global dependencies.
It achieves state-of-the-art results, including a BLEU score of 43.5 on WMT14 En-Fr and 31% faster inference compared to standard Transformer models.
The approach leverages parallel multi-scale operations to enhance sequence representation, promising broader applications in natural language and related tasks.

MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

The paper "MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning" introduces a novel approach to addressing the limitations found in employing self-attention mechanisms for sequence-to-sequence tasks, specifically in machine translation. Despite the proven efficacy of self-attention in capturing long-range dependencies, the authors identify a critical drawback: its tendency to over-concentrate attention on individual tokens in deeper layers. This phenomenon results in insufficient utilization of local information and challenges in representing extensive sequences accurately.

In response, the authors propose the MUSE model, comprising a unique Parallel Multi-Scale Attention framework, incorporating both MUSE-simple and an enhanced variant, MUSE. MUSE operates by harnessing parallel multi-scale sequence representation learning, effective in encoding sequences through the employment of self-attention, pointwise transformation, and convolution mechanisms in parallel. The approach is innovative in blending self-attention with depth-separable convolution to capture various scale dependencies, creating a more balanced and comprehensive sequence representation.

The model is tested through significant machine translation tasks and substantiates substantial performance improvements over the traditional Transformer models, especially when handling longer sequences. The outcomes presented in the paper exhibit state-of-the-art results, with MUSE outperforming existing models in standard evaluation settings across three primary machine translation benchmarks: WMT14 En-Fr, WMT14 En-De, and IWSLT De-En datasets.

Numerically, MUSE achieves a BLEU score of 43.5 on the WMT14 En-Fr test set, which marks a notable improvement over other leading models, such as the DynamicConv and Transformer (relative position), which secured lesser scores. Similarly, in the context of smaller datasets like IWSLT De-En and IWSLT En-Vi, MUSE continues to push competitive boundaries with superior BLEU scores of 36.3 and 31.3, respectively.

Experimentally, the paper demonstrates practical advantages beyond translation quality. MUSE also promises enhanced computational efficiency, realizing an approximately 31% acceleration in inference speed over the standard Transformer architecture on GPUs. This computational leverage is harnessed through the parallel execution of operations facilitated by the model structure, coupling self-attention with convolution in a seamless and expedient manner, thereby optimizing the latent computational capacities of modern hardware.

The paper also meticulously examines the configurations underpinning the model's success, underscoring the significance of shared projection in the fusion of self-attention and convolution operations within a unified semantic space. The model's design also extrapolates kernel selections dynamically, refining the local and global feature extraction processes effectively.

The MUSE architecture emerges as not only a performance enhancer for sequence-to-sequence learning tasks but also an innovation that could inspire further advancements in multi-modal attention models, potentially spill-over into related fields such as speech recognition and computer vision. The conceptual generalizability and parallel execution potential bodes well for extending parallel multi-scale attention frameworks beyond natural language processing.

In conclusion, while the research advances state-of-the-art machine translation capabilities and introduces significant methodological contributions, its adaptability to broader AI applications remains an exciting avenue for future exploration and development.

PDF Markdown

Related Papers

GitHub

GitHub - lancopku/Prime: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance. (86 stars)