Papers
Topics
Authors
Recent
Search
2000 character limit reached

Three-Way Weight Tying (TWWT) in NMT

Updated 27 January 2026
  • Three-Way Weight Tying (TWWT) is a parameter-sharing strategy that integrates input embeddings, output classifiers, and decoder-context projections into a joint embedding space.
  • It enhances translation performance, especially for morphologically rich languages, by enabling flexible control over the output-layer capacity.
  • Empirical evaluations demonstrate consistent BLEU improvements over baseline models while ensuring robustness across different network depths and vocabulary sizes.

Three-way weight tying (TWWT) is a parameter-sharing strategy introduced for neural machine translation (NMT) models, specifically enhancing the standard attention-based encoder–decoder architecture. TWWT generalizes and extends conventional weight tying—where input embeddings and output classifiers share parameters—by introducing a joint input-output embedding. This mechanism not only ties input embeddings and output classifiers, but also the decoder-context projection, into a shared parameter space. The resulting structure-aware output layer enables explicit control over model capacity and demonstrates superior empirical performance in translation tasks, notably for morphologically rich languages (Pappas et al., 2018).

1. Standard Attention-Based NMT and Weight Tying

In the baseline NMT setup, the encoder transforms source word indices x1xmx_1\dots x_m into embedding vectors via ERV×dE \in \mathbb{R}^{|V| \times d}, producing encoder hidden states h1e,,hmeh^e_1, \dots, h^e_m using LSTMs or bi-LSTMs. At each decoding step tt, the decoder LSTM state htRdhh_t \in \mathbb{R}^{d_h} is computed using the embedding of the previous target token and the prior attention context ct1c_{t-1}. The attention mechanism computes context vectors as weighted sums over encoder states, with weights determined by

αti=softmaxi(htWahie)\alpha_{t i} = \mathrm{softmax}_i(h_t^\top W_a h^e_i)

and

ct=iαtihie.c_t = \sum_i \alpha_{t i} h^e_i.

The decoder output hth_t is projected through a softmax-linear layer,

p(yty1:t1,X)exp(Wht+b),p(y_t|y_{1:t-1},X) \propto \exp(W^\top h_t + b),

with WRdh×VW \in \mathbb{R}^{d_h \times |V|} and bRVb \in \mathbb{R}^{|V|}. In conventional weight tying, W=EW = E^\top enforces equality between input embeddings and output classifiers, improving efficiency and often translation quality.

2. Joint Input–Output Embedding Formulation

TWWT replaces the linear output projection with a nonlinear joint embedding approach. Each target word embedding and decoder hidden state is nonlinearly projected into a shared djd_j-dimensional joint space. Specifically, for embedding eje_j and decoder state hth_t,

ej=gout(ej)=σ(Uej+bu)Rdje'_j = g_{\mathrm{out}}(e_j) = \sigma(U e_j^\top + b_u) \in \mathbb{R}^{d_j}

ht=ginp(ht)=σ(Vht+bv)Rdjh'_t = g_{\mathrm{inp}}(h_t) = \sigma(V h_t + b_v) \in \mathbb{R}^{d_j}

with URdj×dU \in \mathbb{R}^{d_j \times d}, buRdjb_u \in \mathbb{R}^{d_j}, VRdj×dhV \in \mathbb{R}^{d_j \times d_h}, and bvRdjb_v \in \mathbb{R}^{d_j}; σ()\sigma(\cdot) is a nonlinearity (tanh in empirical studies). The score for candidate jj at position tt is

ztj=ejht+bjz_{t j} = e'_j{}^\top h'_t + b_j

where bRVb \in \mathbb{R}^{|V|}. Thus, prediction probabilities become

p(yt=j)=exp(ztj)kexp(ztk).p(y_t = j \mid \cdot) = \frac{\exp(z_{t j})}{\sum_k \exp(z_{t k})}.

In matrix notation, letting E=σ(UE+bu)RV×djE' = \sigma(U E^\top + b_u) \in \mathbb{R}^{|V| \times d_j} and ht=σ(Vht+bv)h'_t = \sigma(V h_t + b_v), the output is exp(Eht+b)\exp(E' h'_t + b).

3. Parameter Tying in TWWT

TWWT introduces a three-way tying among:

  • The input embedding parameters EE
  • The output classifier projection UU
  • The decoder-context projection VV

These can be tied to a single shared matrix WRdj×dW^* \in \mathbb{R}^{d_j \times d^*} and bias bb^*: U=V=Eprojection=W,bu=bv=bU = V = E_{\text{projection}} = W^*, \quad b_u = b_v = b^* yielding

ej=σ(Wej+b),ht=σ(Wht+b),ztj=ejht+bj.e'_j = \sigma(W^* e_j^\top + b^*), \quad h'_t = \sigma(W^* h_t + b^*), \quad z_{t j} = e'_j{}^\top h'_t + b_j.

This structured parameter-sharing governs not only input and output mappings but the transformation of the decoder's context. Additionally, optional residual or gating components can be added to relax the hard tie, for example: ej=σ(Wej+b)ge,ht=σ(Wht+b)ghe'_j = \sigma(W^* e_j^\top + b^*) \odot g_e, \quad h'_t = \sigma(W^* h_t + b^*) \odot g_h with learned gates ge,ghRdjg_e, g_h \in \mathbb{R}^{d_j}.

4. Control of Output-Layer Capacity

TWWT allows flexible capacity adjustment through the joint space dimensionality djd_j, interpolating from low-capacity, tightly regularized models to high-capacity models akin to an unrestricted softmax layer. The number of parameters in the joint output layer is

Θjoint=ddj+djdh+V|\Theta_{\text{joint}}| = d \cdot d_j + d_j \cdot d_h + |V|

where V|V| is the vocabulary size. Varying djd_j thus trades off between model compactness and expressivity, with the regimes ordered as Ctied<CbilinearCjointCbaseC_{\text{tied}} < C_{\text{bilinear}} \leq C_{\text{joint}} \leq C_{\text{base}}. Notably, adjusting djd_j does not require re-architecting the overall network.

5. Empirical Evaluation and Implementation

Experiments evaluate TWWT in English–Finnish and English–German translation. Empirical setup includes:

  • Vocabulary V{32K,64K,128K}|V| \in \{32\text{K}, 64\text{K}, 128\text{K}\} via BPE.
  • Embedding size d=512d=512, decoder LSTM hidden size dh=512d_h=512.
  • Joint space dimensions dj{512,2048,4096}d_j \in \{512, 2048, 4096\}, selected on development sets.
  • Stacked LSTMs: 2 layers baseline, with additional experiments at 1, 4, and 8 layers.
  • Dropout p=0.3p=0.3 after LSTM layers, Adam optimizer (lr=0.001\mathrm{lr}=0.001), negative sampling on large vocabularies (25% for V64K|V|\leq64\text{K}, 75% for 128K128\text{K}).

Translation quality (BLEU) improvements are summarized as follows. For V=32000|V|=32\,000 BPE:

Task Baseline Weight-Tied Joint/TWWT
English→Finnish 12.68 12.58 13.03*
Finnish→English 9.42 9.59 10.19*
English→German 18.46 18.48 19.79*
German→English 15.85 16.51* 18.11***

(* p<.05p<.05, ** p<.001p<.001). BLEU gains with TWWT consistently exceed those of conventional tying and baseline models, with up to 2 BLEU improvement in morphologically rich languages. Increasing vocabulary size maintains TWWT’s empirical benefit (up to 1 BLEU). Training throughput ($5$–$6$k tokens/sec for dj=512d_j=512) is comparable to baselines, and higher djd_j introduces only modest slowdowns, largely mitigated by negative sampling. Depth and frequency robustness are also superior.

6. Significance and Model Implications

TWWT ensures that parameter sharing captures richer semantic structure and preserves prior knowledge in both input representations and translation contexts. The explicit low-rank nonlinear joint space facilitates smooth interpolation between strongly regularized and fully expressive output layers, without architectural disruption. This not only leads to superior performance on several language pairs but also offers robustness across network depths and vocabulary settings. The approach demonstrates that three-way parameter sharing—across embeddings, output classifiers, and context projections—provides a powerful regularization tool and improves translation quality and model robustness, particularly for morphologically complex tasks (Pappas et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Three-Way Weight Tying (TWWT).