Tied Word Embedding

Updated 19 December 2025

Tied word embedding is a method that uses a single matrix for both input representations and output predictions in neural sequence models.
It cuts parameter counts significantly, e.g., nearly 50% reduction in embedding parameters on Penn Treebank, leading to overall model size savings of 20–30%.
The approach acts as a regularizer by enforcing symmetric gradient updates and shared semantic spaces, improving perplexity and BLEU scores in language tasks.

Tied word embedding refers to the practice of sharing parameters between the input word embedding matrix and the output word classifier (projection) matrix in neural sequence models, notably recurrent neural network LLMs (RNNLMs) and neural machine translation systems. Instead of learning independent matrices for encoding words and predicting their probabilities, tied embedding enforces a single matrix to perform both roles, motivated by theoretical, computational, and empirical insights. This approach offers substantial reductions in model parameters, imposes a shared semantic space for inputs and outputs, and—empirically—can improve perplexity and BLEU scores relative to untied baselines and prior state of the art (Inan et al., 2016, Press et al., 2016, Pappas et al., 2018).

1. Motivation and Theoretical Foundations

The conventional one-hot classification framework for language modeling represents a word in two places: (1) as an input embedding and (2) as an output classifier. In these models—such as standard RNNLMs or NMT decoders—at each time $t$ , the input is embedded via $x_t = L\,x_t$ , with $L\in\mathbb{R}^{d_x\times|V|}$ , and the output logits are computed as $z_t = W\,h_t + b$ with $W\in\mathbb{R}^{|V|\times d_h}$ and $b\in\mathbb{R}^{|V|}$ . The model is trained using cross-entropy loss against a one-hot target: $J_t = -e_{i^*}^\top \log y_t = D_{\mathrm{KL}}(e_{i^*}\,\|\,y_t)$ .

This decoupled learning produces two inefficiencies:

There is no metric among output classes; all targets are Dirac deltas, so word similarity is not leveraged.
The input embedding $L$ and output classifier $W$ are learned independently, resulting in parameter redundancy and underutilization of shared statistical structure (Inan et al., 2016).

Tied embedding is motivated by the theoretical insight that, under certain idealized loss formulations, the column spaces of $W$ and $x_t = L\,x_t$ 0 are pressured to align. In particular, an augmented loss including a KL term between the model's output logits and those soft targets determined by the embedding space induces, in the high-temperature and zero-loss limit, the equivalence:

$x_t = L\,x_t$ 1

Implying $x_t = L\,x_t$ 2. Imposing the constraint $x_t = L\,x_t$ 3, $x_t = L\,x_t$ 4—termed "tying"—renders this relationship explicit (Inan et al., 2016).

2. Formalization of Tied Embedding

In the tied setting, a single matrix (denoted $x_t = L\,x_t$ 5 or $x_t = L\,x_t$ 6) plays the dual role of input embedding and output classifier. Formally, for vocabulary size $x_t = L\,x_t$ 7 and embedding dimension $x_t = L\,x_t$ 8:

Input embedding: $x_t = L\,x_t$ 9, where $L\in\mathbb{R}^{d_x\times|V|}$ 0 is a one-hot vector.
Hidden representation: $L\in\mathbb{R}^{d_x\times|V|}$ 1.
Output logits: $L\in\mathbb{R}^{d_x\times|V|}$ 2.
Output probabilities: $L\in\mathbb{R}^{d_x\times|V|}$ 3.
Loss: $L\in\mathbb{R}^{d_x\times|V|}$ 4, for output index $L\in\mathbb{R}^{d_x\times|V|}$ 5 (Press et al., 2016).

Gradient updates in the tied model are the sum of input and output embedding gradients for each matrix row. In contrast to the untied case, where each row is updated either as an input embedding or output classifier only, the tied scheme ensures every row receives both types of updates, thus evolving more similarly to the output embedding in the untied baseline.

When input and output embedding dimensions differ, an explicit linear adapter $L\in\mathbb{R}^{d_x\times|V|}$ 6 can be introduced, yielding $L\in\mathbb{R}^{d_x\times|V|}$ 7 (Inan et al., 2016).

3. Parameter Efficiency and Model Capacity

Tied embedding provides major parameter savings. When both input and output embeddings are of size $L\in\mathbb{R}^{d_x\times|V|}$ 8, the untied model requires $L\in\mathbb{R}^{d_x\times|V|}$ 9 parameters for these matrices. Tying reduces this to $z_t = W\,h_t + b$ 0. For Penn Treebank configurations with $z_t = W\,h_t + b$ 1 and $z_t = W\,h_t + b$ 2, this equates to a reduction from 13M to 6.5M parameters in these two matrices—almost a 50% decrease. For large vocabularies and hidden sizes, this constitutes a dominant portion of total model parameters, cutting total model size by 20–30% in practice (Inan et al., 2016). Neural machine translation models also benefit: for instance, decoder parameter counts are reduced by more than half without harming BLEU performance (Press et al., 2016).

This parameter efficiency does not merely reduce storage and training time but acts as a form of regularization, restricting redundant degrees of freedom and encouraging the model to embed semantic relationships symmetrically for input and output.

4. Empirical Evaluation

The empirical impact of tied embedding has been evaluated across language modeling and translation tasks.

Language Modeling

In Penn Treebank experiments with a 2-layer LSTM and variational dropout:

For the "large" setting, untied VD-LSTM achieves test perplexity of 72.6.
Tying embeddings alone ("RE only") reduces test perplexity to 69.0.
Using the augmented loss (AL) and tying together ("AL + RE") yields 68.5, surpassing prior SOTA (Zaremba et al.'s 78.4 and comparable models at 66.0) (Inan et al., 2016).
On WikiText-2 and other corpora, similar relative gains and parameter reductions are observed (Inan et al., 2016, Press et al., 2016).

Neural Machine Translation

On English→French and English→German tasks, BLEU scores with tied embedding (decoder or full three-way) match or improve upon untied models, while reducing model size from 168M to 122M and then to 80M parameters (Press et al., 2016).
For morphologically rich targets such as English→Finnish and English→German, tied or joint (generalized) embeddings yield BLEU improvements up to +2 points and maintain or improve training speed under moderate joint space dimensioning (Pappas et al., 2018).

5. Limitations and Variants

While tied word embedding offers several advantages, it is subject to certain limitations:

It requires that input and output embedding dimensions match ( $z_t = W\,h_t + b$ 3), otherwise a linear adapter is necessary, which reduces but does not eliminate parameter savings (Inan et al., 2016).
Hard tying enforces exact equality between input and output classifiers, which may be suboptimal; it prevents decoupling embedding capacity from decoder capacity and cannot explicitly capture structure among output types (Pappas et al., 2018).
In small/no-dropout models, tied embedding may overfit; inserting and regularizing a projection matrix $z_t = W\,h_t + b$ 4 before the output can mitigate this effect. This is less relevant in high-dropout or large models (Press et al., 2016).

Generalizations of weight tying have been proposed. Learning a joint input–output embedding via nonlinear projections into a joint space decouples embedding dimension from decoder dimension and flexibly interpolates between the expressivity of the untied and tied cases. This approach—termed "structure-aware output layer"—explicitly encodes relationships among output types, improves BLEU on morphologically rich languages, and is robust to negative sampling on large vocabularies (Pappas et al., 2018). The joint model recovers classical weight tying as a special linear case.

6. Theoretical and Practical Implications

Tied embedding operates as a theory-motivated regularizer, enforcing a shared semantic space and enabling gradients to flow symmetrically between input and output. The approach is agnostic to the choice of core architecture—for example, it applies equally to LSTM, GRU, and RHN—as well as other architectural techniques such as dropout, attention, and pointer mechanisms (Inan et al., 2016).

Empirically, tying delivers most of the perplexity reduction and stability gains in large models or data-rich scenarios, whereas the augmented loss component is especially beneficial in smaller or data-scarce regimes (Inan et al., 2016).

7. Extensions and Research Directions

Tied word embedding remains an active area of investigation. Potential extensions and open directions include:

Adaptive or learned temperatures in the augmented loss formulation.
Use of alternative distances—e.g., Wasserstein metrics—in place of KL divergence.
Application to subword-level or character-based vocabularies in large-vocabulary settings (e.g., machine translation, speech).
Combining tied embeddings with large pretrained representations (ELMo, BERT) for improved initialization or continual learning (Inan et al., 2016).

Joint input–output embedding approaches allow for direct parametric control over capacity, thus enabling practitioners to fine-tune the tradeoff between efficiency and expressiveness to match task demands (Pappas et al., 2018).

Table: Parameter Counts in Output Layer Variants

Model Type	Parameter Count	Key Condition
Untied	$z_t = W\,h_t + b$ 5	Classical (no sharing)
Tied (hard)	$z_t = W\,h_t + b$ 6	Require $z_t = W\,h_t + b$ 7
Joint (nonlinear)	$z_t = W\,h_t + b$ 8	$z_t = W\,h_t + b$ 9 tunable (joint embedding dim)

This quantifies the parameter savings and flexibility introduced by tying and joint embedding approaches (Inan et al., 2016, Press et al., 2016, Pappas et al., 2018).

Markdown Report Issue Upgrade to Chat

References (3)

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling (2016)

Using the Output Embedding to Improve Language Models (2016)

Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tied Word Embedding.