Tied Word Embedding
- Tied word embedding is a method that uses a single matrix for both input representations and output predictions in neural sequence models.
- It cuts parameter counts significantly, e.g., nearly 50% reduction in embedding parameters on Penn Treebank, leading to overall model size savings of 20–30%.
- The approach acts as a regularizer by enforcing symmetric gradient updates and shared semantic spaces, improving perplexity and BLEU scores in language tasks.
Tied word embedding refers to the practice of sharing parameters between the input word embedding matrix and the output word classifier (projection) matrix in neural sequence models, notably recurrent neural network LLMs (RNNLMs) and neural machine translation systems. Instead of learning independent matrices for encoding words and predicting their probabilities, tied embedding enforces a single matrix to perform both roles, motivated by theoretical, computational, and empirical insights. This approach offers substantial reductions in model parameters, imposes a shared semantic space for inputs and outputs, and—empirically—can improve perplexity and BLEU scores relative to untied baselines and prior state of the art (Inan et al., 2016, Press et al., 2016, Pappas et al., 2018).
1. Motivation and Theoretical Foundations
The conventional one-hot classification framework for language modeling represents a word in two places: (1) as an input embedding and (2) as an output classifier. In these models—such as standard RNNLMs or NMT decoders—at each time , the input is embedded via , with , and the output logits are computed as with and . The model is trained using cross-entropy loss against a one-hot target: .
This decoupled learning produces two inefficiencies:
- There is no metric among output classes; all targets are Dirac deltas, so word similarity is not leveraged.
- The input embedding and output classifier are learned independently, resulting in parameter redundancy and underutilization of shared statistical structure (Inan et al., 2016).
Tied embedding is motivated by the theoretical insight that, under certain idealized loss formulations, the column spaces of and are pressured to align. In particular, an augmented loss including a KL term between the model's output logits and those soft targets determined by the embedding space induces, in the high-temperature and zero-loss limit, the equivalence:
Implying . Imposing the constraint , —termed "tying"—renders this relationship explicit (Inan et al., 2016).
2. Formalization of Tied Embedding
In the tied setting, a single matrix (denoted or ) plays the dual role of input embedding and output classifier. Formally, for vocabulary size and embedding dimension :
- Input embedding: , where is a one-hot vector.
- Hidden representation: .
- Output logits: .
- Output probabilities: .
- Loss: , for output index (Press et al., 2016).
Gradient updates in the tied model are the sum of input and output embedding gradients for each matrix row. In contrast to the untied case, where each row is updated either as an input embedding or output classifier only, the tied scheme ensures every row receives both types of updates, thus evolving more similarly to the output embedding in the untied baseline.
When input and output embedding dimensions differ, an explicit linear adapter can be introduced, yielding (Inan et al., 2016).
3. Parameter Efficiency and Model Capacity
Tied embedding provides major parameter savings. When both input and output embeddings are of size , the untied model requires $2|V|d$ parameters for these matrices. Tying reduces this to . For Penn Treebank configurations with and , this equates to a reduction from 13M to 6.5M parameters in these two matrices—almost a 50% decrease. For large vocabularies and hidden sizes, this constitutes a dominant portion of total model parameters, cutting total model size by 20–30% in practice (Inan et al., 2016). Neural machine translation models also benefit: for instance, decoder parameter counts are reduced by more than half without harming BLEU performance (Press et al., 2016).
This parameter efficiency does not merely reduce storage and training time but acts as a form of regularization, restricting redundant degrees of freedom and encouraging the model to embed semantic relationships symmetrically for input and output.
4. Empirical Evaluation
The empirical impact of tied embedding has been evaluated across language modeling and translation tasks.
Language Modeling
In Penn Treebank experiments with a 2-layer LSTM and variational dropout:
- For the "large" setting, untied VD-LSTM achieves test perplexity of 72.6.
- Tying embeddings alone ("RE only") reduces test perplexity to 69.0.
- Using the augmented loss (AL) and tying together ("AL + RE") yields 68.5, surpassing prior SOTA (Zaremba et al.'s 78.4 and comparable models at 66.0) (Inan et al., 2016).
- On WikiText-2 and other corpora, similar relative gains and parameter reductions are observed (Inan et al., 2016, Press et al., 2016).
Neural Machine Translation
- On English→French and English→German tasks, BLEU scores with tied embedding (decoder or full three-way) match or improve upon untied models, while reducing model size from 168M to 122M and then to 80M parameters (Press et al., 2016).
- For morphologically rich targets such as English→Finnish and English→German, tied or joint (generalized) embeddings yield BLEU improvements up to +2 points and maintain or improve training speed under moderate joint space dimensioning (Pappas et al., 2018).
5. Limitations and Variants
While tied word embedding offers several advantages, it is subject to certain limitations:
- It requires that input and output embedding dimensions match (), otherwise a linear adapter is necessary, which reduces but does not eliminate parameter savings (Inan et al., 2016).
- Hard tying enforces exact equality between input and output classifiers, which may be suboptimal; it prevents decoupling embedding capacity from decoder capacity and cannot explicitly capture structure among output types (Pappas et al., 2018).
- In small/no-dropout models, tied embedding may overfit; inserting and regularizing a projection matrix before the output can mitigate this effect. This is less relevant in high-dropout or large models (Press et al., 2016).
Generalizations of weight tying have been proposed. Learning a joint input–output embedding via nonlinear projections into a joint space decouples embedding dimension from decoder dimension and flexibly interpolates between the expressivity of the untied and tied cases. This approach—termed "structure-aware output layer"—explicitly encodes relationships among output types, improves BLEU on morphologically rich languages, and is robust to negative sampling on large vocabularies (Pappas et al., 2018). The joint model recovers classical weight tying as a special linear case.
6. Theoretical and Practical Implications
Tied embedding operates as a theory-motivated regularizer, enforcing a shared semantic space and enabling gradients to flow symmetrically between input and output. The approach is agnostic to the choice of core architecture—for example, it applies equally to LSTM, GRU, and RHN—as well as other architectural techniques such as dropout, attention, and pointer mechanisms (Inan et al., 2016).
Empirically, tying delivers most of the perplexity reduction and stability gains in large models or data-rich scenarios, whereas the augmented loss component is especially beneficial in smaller or data-scarce regimes (Inan et al., 2016).
7. Extensions and Research Directions
Tied word embedding remains an active area of investigation. Potential extensions and open directions include:
- Adaptive or learned temperatures in the augmented loss formulation.
- Use of alternative distances—e.g., Wasserstein metrics—in place of KL divergence.
- Application to subword-level or character-based vocabularies in large-vocabulary settings (e.g., machine translation, speech).
- Combining tied embeddings with large pretrained representations (ELMo, BERT) for improved initialization or continual learning (Inan et al., 2016).
Joint input–output embedding approaches allow for direct parametric control over capacity, thus enabling practitioners to fine-tune the tradeoff between efficiency and expressiveness to match task demands (Pappas et al., 2018).
Table: Parameter Counts in Output Layer Variants
| Model Type | Parameter Count | Key Condition |
|---|---|---|
| Untied | Classical (no sharing) | |
| Tied (hard) | Require | |
| Joint (nonlinear) | tunable (joint embedding dim) |
This quantifies the parameter savings and flexibility introduced by tying and joint embedding approaches (Inan et al., 2016, Press et al., 2016, Pappas et al., 2018).