The Evolved Transformer (1901.11117v4)

Published 30 Jan 2019 in cs.LG, cs.CL, cs.NE, and stat.ML

Abstract: Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters.

PDF Abstract

Overview of "The Evolved Transformer"

The paper "The Evolved Transformer" presents an exploration of using Neural Architecture Search (NAS) to improve upon the existing Transformer model for sequence-to-sequence tasks, specifically in NLP. The authors focus on leveraging evolutionary algorithms to discover superior architectures by refining the Transformer, seeded initially as a strong baseline model.

Methodology

The researchers utilized a robust combination of techniques to facilitate their search:

Neural Architecture Search: The NAS was employed to navigate a vast search space informed by the latest developments in sequence models. The search space allowed flexibility in designing encoder and decoder structures akin to the Transformer.
Progressive Dynamic Hurdles (PDH): They introduce PDH, a method that strategically allocates computational resources towards promising models by dynamically assessing their performance during training. This approach discards less promising candidates early, optimizing search efficiency.
Warm Starting: The search was initialized with the Transformer model to anchor the process to a known effective architecture, enhancing the search quality and convergence speed.

Results

The outcome of the NAS was the "Evolved Transformer" (ET), which demonstrated consistent enhancements over the original Transformer on multiple NLP tasks. Key performance improvements include:

Achieving a BLEU score of 29.8 on the WMT 2014 English-German translation task, setting a new state-of-the-art.
Matching the original "big" Transformer’s quality using 37.6% fewer parameters, and even outperforming it at smaller model sizes, notably achieving a 0.7 BLEU score increase at a compact $\sim$ 7M parameter size.

Analysis

The architecture of ET incorporates several refinements:

Utilization of wide depth-wise separable convolutions and Gated Linear Units, enhancing its computational efficiency and translation quality.
Introduction of branched structures and Swish activations, adding to the model's capacity for complex sequence representation.

Implications

The findings underscore the efficacy of NAS, particularly evolutionary methods, to autonomously evolve architectures beyond human-designed models. This approach highlights how automated systems can innovate within established frameworks to yield improved performance on crucial NLP tasks, such as translation and LLMing.

Future Directions

Continued exploration can focus on:

Extending the NAS methodology to other domains beyond NLP and evaluating its effectiveness on diverse datasets and tasks.
Integrating data augmentation and hyperparameter tuning strategies to further elevate performance at larger model sizes where ET's advantages begin to saturate.

The paper demonstrates the potential of NAS to refine and enhance machine learning models, providing a path forward for using similar techniques to transform other aspects of artificial intelligence and deep learning.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

David R. So (11 papers)
Chen Liang (140 papers)
Quoc V. Le (128 papers)

Citations (448)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

YouTube

Show All Videos