Three-Way Weight Tying (TWWT) in NMT
- Three-Way Weight Tying (TWWT) is a parameter-sharing strategy that integrates input embeddings, output classifiers, and decoder-context projections into a joint embedding space.
- It enhances translation performance, especially for morphologically rich languages, by enabling flexible control over the output-layer capacity.
- Empirical evaluations demonstrate consistent BLEU improvements over baseline models while ensuring robustness across different network depths and vocabulary sizes.
Three-way weight tying (TWWT) is a parameter-sharing strategy introduced for neural machine translation (NMT) models, specifically enhancing the standard attention-based encoder–decoder architecture. TWWT generalizes and extends conventional weight tying—where input embeddings and output classifiers share parameters—by introducing a joint input-output embedding. This mechanism not only ties input embeddings and output classifiers, but also the decoder-context projection, into a shared parameter space. The resulting structure-aware output layer enables explicit control over model capacity and demonstrates superior empirical performance in translation tasks, notably for morphologically rich languages (Pappas et al., 2018).
1. Standard Attention-Based NMT and Weight Tying
In the baseline NMT setup, the encoder transforms source word indices into embedding vectors via , producing encoder hidden states using LSTMs or bi-LSTMs. At each decoding step , the decoder LSTM state is computed using the embedding of the previous target token and the prior attention context . The attention mechanism computes context vectors as weighted sums over encoder states, with weights determined by
and
The decoder output is projected through a softmax-linear layer,
with and . In conventional weight tying, enforces equality between input embeddings and output classifiers, improving efficiency and often translation quality.
2. Joint Input–Output Embedding Formulation
TWWT replaces the linear output projection with a nonlinear joint embedding approach. Each target word embedding and decoder hidden state is nonlinearly projected into a shared -dimensional joint space. Specifically, for embedding and decoder state ,
with , , , and ; is a nonlinearity (tanh in empirical studies). The score for candidate at position is
where . Thus, prediction probabilities become
In matrix notation, letting and , the output is .
3. Parameter Tying in TWWT
TWWT introduces a three-way tying among:
- The input embedding parameters
- The output classifier projection
- The decoder-context projection
These can be tied to a single shared matrix and bias : yielding
This structured parameter-sharing governs not only input and output mappings but the transformation of the decoder's context. Additionally, optional residual or gating components can be added to relax the hard tie, for example: with learned gates .
4. Control of Output-Layer Capacity
TWWT allows flexible capacity adjustment through the joint space dimensionality , interpolating from low-capacity, tightly regularized models to high-capacity models akin to an unrestricted softmax layer. The number of parameters in the joint output layer is
where is the vocabulary size. Varying thus trades off between model compactness and expressivity, with the regimes ordered as . Notably, adjusting does not require re-architecting the overall network.
5. Empirical Evaluation and Implementation
Experiments evaluate TWWT in English–Finnish and English–German translation. Empirical setup includes:
- Vocabulary via BPE.
- Embedding size , decoder LSTM hidden size .
- Joint space dimensions , selected on development sets.
- Stacked LSTMs: 2 layers baseline, with additional experiments at 1, 4, and 8 layers.
- Dropout after LSTM layers, Adam optimizer (), negative sampling on large vocabularies (25% for , 75% for ).
Translation quality (BLEU) improvements are summarized as follows. For BPE:
| Task | Baseline | Weight-Tied | Joint/TWWT |
|---|---|---|---|
| English→Finnish | 12.68 | 12.58 | 13.03* |
| Finnish→English | 9.42 | 9.59 | 10.19* |
| English→German | 18.46 | 18.48 | 19.79* |
| German→English | 15.85 | 16.51* | 18.11*** |
(* , ** ). BLEU gains with TWWT consistently exceed those of conventional tying and baseline models, with up to 2 BLEU improvement in morphologically rich languages. Increasing vocabulary size maintains TWWT’s empirical benefit (up to 1 BLEU). Training throughput ($5$–$6$k tokens/sec for ) is comparable to baselines, and higher introduces only modest slowdowns, largely mitigated by negative sampling. Depth and frequency robustness are also superior.
6. Significance and Model Implications
TWWT ensures that parameter sharing captures richer semantic structure and preserves prior knowledge in both input representations and translation contexts. The explicit low-rank nonlinear joint space facilitates smooth interpolation between strongly regularized and fully expressive output layers, without architectural disruption. This not only leads to superior performance on several language pairs but also offers robustness across network depths and vocabulary settings. The approach demonstrates that three-way parameter sharing—across embeddings, output classifiers, and context projections—provides a powerful regularization tool and improves translation quality and model robustness, particularly for morphologically complex tasks (Pappas et al., 2018).