A Transformer-based Neural Architecture Search Method (2505.01314v1)

Published 2 May 2025 in cs.CL and cs.NE

Abstract: This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.

Summary

Transformer-Based Neural Architecture Search Method

The research presented in this paper outlines a Transformer-based Neural Architecture Search (NAS) approach, specifically tailored to enhance translation capabilities in neural machine translation tasks. The authors have innovatively applied genetic algorithms to the architecture search process, highlighting their method, denoted as MO-Trans. By leveraging this technique, they depart from the conventional fixed number and composition of encoder and decoder blocks in Transformer models, aiming to dynamically optimize their configuration for improved translation outcomes.

Methodological Overview

The approach utilizes a multi-objective evolutionary algorithm, MOEA/D, to facilitate decomposed optimization over two key metrics: BLEU score and perplexity. BLEU score functions as the primary evaluation parameter while perplexity offers auxiliary insight into model prediction capabilities. Through a variable-length genetic coding strategy akin to EvoCNN, this method enables the exploration of disparate configurations of encoder and decoder blocks, multihead attention layers, and feed-forward network (FFN) dimensions.

Genetic operations such as crossover and mutation are implemented to introduce variability in the population of individuals representing potential network architectures. The crossover operation is executed at the block level between two parent architectures, carrying forward the trait diversity across the evolving population. Meanwhile, mutation serves to introduce additional variability, altering block types or the specifics of FFN dimensions and MHA layers.

Experimental Results

The empirical evaluation focuses on translation tasks across English-German and German-English language pairs using the Multi30k dataset. The algorithm demonstrated superior performance relative to baseline Transformer configurations, effectively optimizing architectures for enhanced BLEU scores. Specifically, configurations found at $k=0.5$ and $k=0.75$ demonstrated noteworthy improvements, indicating the benefit of perplexity as a secondary evaluation metric.

Investigating the parameterization of architectures, reductions in perplexity correlated with increased translation efficacy, offering insights into the architecture's capability to predict sequential language properties. Such findings suggest that integrating secondary metrics can substantiate and refine the selection of network architecture beyond conventional BLEU-centric paradigms.

Implications and Future Directions

The research proposes a meaningful extension of NAS methodologies applied to Transformer models, demonstrating multi-metric optimization's tangible benefits in language translation tasks. The potential of such techniques to evolve and augment intricate Transformer configurations may prove instrumental in advancing machine translation technologies, inevitably contributing toward more nuanced and accurate neural network designs.

Moving forward, future explorations could focus on expanding the versatility of genetic algorithms in NAS beyond translation tasks, potentially exploring cross-disciplinary applications in domains such as image recognition, natural language understanding, or more complex sequential prediction tasks. Additionally, incorporating additional auxiliary metrics could further enhance this search process, providing more profound insights into the underlying architecture dynamics and translational efficacy, expediting a deeper understanding of multi-head attention mechanisms and decoder-encoder block interdependencies.

The paper thus emphasizes the critical importance of NAS in evolving neural network capabilities, establishing a path for more efficient influence over model architectures via tailored evolutionary algorithms.