Evolved Transformer Model
- The paper introduces an evolved architecture using neural architecture search with techniques like PDH, wide depthwise separable convolutions, and gated linear units to improve sequence modeling.
- Empirical results show that the Evolved Transformer delivers higher BLEU scores and faster convergence on benchmarks like WMT’14 compared to the baseline Transformer.
- The model evolution concept extends to EgoTR, which applies self-cross attention and learnable positional embeddings to tackle cross-view geo-localization challenges.
The Evolved Transformer Model encompasses two distinct research directions in neural network architecture: (1) the Evolved Transformer (ET), a LLM architecture discovered through neural architecture search targeting improved sequence modeling efficiency and accuracy (So et al., 2019); and (2) the Evolving Geo-localization Transformer (EgoTR), an architecture specifically designed for cross-view geo-localization in vision, leveraging evolving representations and novel attention mechanisms to address extreme domain shifts (Yang et al., 2021). Both models extend the foundational Transformer paradigm, introducing learned architectural motifs and mechanisms that enhance parameter efficiency, convergence stability, and generalization across tasks and modalities.
1. Neural Architecture Search and the Evolved Transformer (ET)
The Evolved Transformer (So et al., 2019) was derived via large-scale neural architecture search (NAS) using an evolutionary algorithm combined with the Progressive Dynamic Hurdles (PDH) resource allocation strategy. The search space comprised cell-based encodings, where each cell consists of blocks parameterizing branch structure, normalization, layer type (including depthwise separable convolutions, lightweight convolutions, multi-head self-attention, Gated Linear Units (GLU), skip connections), output dimension ratios, and activation functions (swish, ReLU, leaky ReLU, none). Encoder and decoder cells, each containing multiple such blocks, were evolved under constraints of residual path connectivity, minimum encoder-attention in the decoder, and a parameter-count budget comparable to the base Transformer.
The PDH method enabled computationally efficient search by dynamically assigning additional training steps to candidates surpassing running population mean performance at defined step increments, truncating poorly performing runs early, and thus optimizing search over the expensive WMT'14 English-German translation benchmark.
2. Core Architectural Innovations in Evolved Transformer
The final Evolved Transformer architecture incorporates four distinct enhancements over the canonical Transformer:
- Wide Depthwise Separable Convolutions: Large kernel (up to 11×1) depthwise-separable convolutions in early blocks extend the contextual window efficiently, with computational cost scaling as (kernel size channel count), compared to for standard convolutions. These operations precede self-attention layers, enabling preliminary local feature extraction before global mixing.
- Gated Linear Units (GLU): Certain feed-forward sublayers employ a gating mechanism where the input is split and modulated via . This enables dynamic channel-wise information flow, reportedly enhancing expressivity while controlling parameter growth.
- Branched and Parallel Substructures: Blocks may employ parallel branches (e.g., conv/GLU) whose outputs are combined via addition or concatenation, providing multiple receptive field paths. In the ET encoder cell, for instance, one branch may execute a deep convolution with swish activation while another is pruned, or both can process features in parallel.
- Swish Activation: Early blocks adopt the smooth, non-monotonic swish activation function , empirically yielding faster convergence and improved generalization relative to ReLU.
These motifs are compositional, with ablation studies indicating that no single modification solely accounts for observed performance gains; rather, the synergy among them improves modeling capacity and efficiency (So et al., 2019).
3. Empirical Results and Efficiency Analysis
The Evolved Transformer demonstrates consistent improvements over the baseline Transformer across several benchmarks:
- On WMT'14 English-German, ET attains 29.8 BLEU at 218M parameters, outperforming previous state-of-the-art models with comparable budgets.
- At smaller model sizes (e.g., ~7M parameters), ET surpasses the Transformer by +0.7 BLEU with only ≈2% greater parameter count.
- On additional tasks—WMT’14 English-French, WMT’14 English-Czech, and LM1B—ET models lower perplexity and, where relevant, higher BLEU scores at comparable or slightly increased parameter counts.
- FLOPs-versus-BLEU analysis reveals only modest compute overhead for ET relative to Transformer, primarily due to scalable depthwise-separable convolution operations.
A summary table of model sizes and test BLEU on WMT’14 English-German:
| Model | Parameters | Test BLEU |
|---|---|---|
| Transformer | 7.0M | 21.3 |
| ET | 7.2M | 22.0 |
| Transformer | 61.1M | 27.7 |
| ET | 64.1M | 28.2 |
| Transformer | 210.4M | 28.8 |
| ET | 221.7M | 29.0 |
ET’s gains arise from superior local feature encoding, dynamic channel gating, multiple receptive field paths, and improved nonlinearities (So et al., 2019).
4. The Evolving Geo-localization Transformer (EgoTR)
EgoTR (Yang et al., 2021) extends Transformer architectures to cross-view geo-localization, matching street-level query images to geo-tagged aerial images where visual appearance and geometry differ substantially. The model incorporates two independent ViT-style branches (ground and aerial), each comprising:
- ResNet Backbone generating dense feature maps, flattened into patch tokens.
- Learnable 1D Positional Embeddings () added to input patch and class embeddings, optimized during task learning rather than fixed.
- Self-Cross Attention Layers, a core innovation that, at each layer , computes queries from current-layer outputs and keys from the previous layer’s normalized features, yielding a cross-attention map that models representation evolution across depth. This shortcut attention, alongside usual horizontal attention, improves stability and discourages layer redundancy.
The network is trained via a weighted soft-margin triplet loss using L2-normalized descriptors from the class token positions of both branches, enabling invariant matching between disparate viewpoints.
5. Mechanisms and Empirical Outcomes in EgoTR
EgoTR’s self-cross attention enables explicit modeling of cross-layer dependencies, enhancing information flow and mitigating representational collapse. Its learnable positional embeddings eschew fixed geometric priors, instead discovering relative and view-specific encodings—allowing flexible spatial correspondence.
Empirical analysis reveals:
- Self-cross attention accelerates convergence and stabilizes training, as evidenced by recall@1 dynamics.
- Cross-layer feature similarity decays more rapidly with depth, indicating genuine representational evolution.
- Relative position embeddings manifest as nearest-neighbor locality in the learned token grid, supporting robustness to orientation and geometric distortions.
- EgoTR establishes new state-of-the-art recall@1 on benchmarks such as CVUSA/CVACT, surpassing previous architectures by 10–13% in cross-dataset transfer, and yields up to 1.5% gain in recall@1 in few-shot regimes compared to vanilla attention ViT (Yang et al., 2021).
6. Comparative Analysis and Implications
The Evolved Transformer and EgoTR share a paradigm of model “evolution” but differ in methodology and application. ET employs automated NAS over a broad, multi-modal search space to discover architectural motifs for sequence tasks, emphasizing language modeling efficiency. EgoTR customizes the Transformer block to vision by designing attention mechanisms that address domain-specific representational challenges in geo-localization.
Both approaches converge on architectural innovations—noncanonical attention flows, branch parallelism, and parameter-efficient nonlinearities—that collectively extend the Transformer’s core capabilities. The resulting performance improvements across language and vision tasks empirically validate the model evolution strategies, supporting a broader thesis that systematic architectural exploration and targeted mechanistic modification can yield robust, generalizable Transformer variants.