Bidirectional Language Models (biLMs)

Updated 10 November 2025

Bidirectional language models (biLMs) are neural architectures that learn contextualized representations by conditioning on both left and right contexts.
They can be implemented using LSTM, Gated CNN, or Transformer architectures, each offering distinct trade-offs in speed, accuracy, and computational cost.
Layerwise analyses reveal that lower layers capture local syntactic features while upper layers effectively model long-range semantic dependencies.

A bidirectional LLM (biLM) is a neural language modeling framework in which representations for tokens are learned by simultaneously conditioning on both past (left) and future (right) context within a sentence or sequence. This contrasts with unidirectional LMs, which condition only on preceding (forward) or succeeding (backward) tokens. The introduction of biLMs marked a fundamental advance in the construction of contextualized word representations, enabling models to capture both syntactic and semantic information that depends crucially on bidirectional context. The technical and methodological landscape of biLMs spans architectural considerations, layerwise representational analysis, downstream application, and analysis of their linguistic and cognitive fidelity.

1. Formal Definition and Training Objectives

A bidirectional LLM (biLM) is trained to maximize the sum of the log-likelihoods of predicting each token conditioned on its left and right contexts. Let a sequence of tokens be $(w_1,\dots,w_T)$ . Let $\overrightarrow{\theta}$ and $\overleftarrow{\theta}$ be parameters of the forward and backward LMs, respectively. The biLM objective is:

$\mathcal{L}(\theta) =\sum_{t=1}^T\Bigl[ \log p(w_t\mid w_{<t};\,\overrightarrow\theta) + \log p(w_t\mid w_{>t};\,\overleftarrow\theta) \Bigr].$

In practice, both directions produce stacks of $L$ contextualized representations. At position $t$ , the context-independent embedding is $\mathbf{x}_t$ . For each contextual layer $l=1,...,L$ :

$\begin{aligned} \overrightarrow{\mathbf{h}}^{(l)}_t &= f_\mathrm{forward}(\overrightarrow{\mathbf{h}}^{(l-1)}_{<t}, \mathbf{x}_t), \ \overleftarrow{\mathbf{h}}^{(l)}_t &= f_\mathrm{backward}(\overleftarrow{\mathbf{h}}^{(l-1)}_{>t}, \mathbf{x}_t), \end{aligned}$

with layer-specific functions determined by model architecture (LSTM, Gated CNN, Transformer). The final context-sensitive representation at layer $l$ is the concatenation:

$\mathbf{h}_t^{(l)} = [\overrightarrow{\mathbf{h}}^{(l)}_t;\;\overleftarrow{\mathbf{h}}^{(l)}_t].$

Training may share token and softmax embeddings across directions but maintains independent contextual layers.

2. Architectures for biLMs

Empirical studies systematically compare three principal neural architectures for biLMs: LSTM, Gated CNN, and Transformer (Peters et al., 2018). Each architecture instantiates the biLM objective differently:

LSTM biLM: Each direction is parameterized by a multilayer LSTM; at each position, representations are updated using learned input, forget, and output gates, and a memory cell. Empirically, 2-4 layers are standard.
Gated CNN biLM: Contextualization is achieved via stacked gated 1D convolutions. Each “GLU” layer applies elementwise gating over convolutions and includes direct residual connections to stabilize training.
Transformer biLM: Stacked layers of self-attention and feedforward sublayers encode bidirectional dependencies, using [MASK]-based objectives (e.g., BERT) to avoid label leakage from both sides.

Speed and memory trade-offs are architecture-dependent: LSTM biLMs (especially deep, e.g., 4-layer) yield highest accuracy but are 3–5× slower than Transformers or Gated CNNs, which offer improved inference efficiency, especially with batched inputs.

Model	Layers	Perplexity	Params (M)	1 sent	64 sents
LSTM	2	39.7	76	44ms	66ms
LSTM	4	37.5	151	85ms	102ms
Transformer	6	40.7	38	12ms	22ms
Gated CNN	16	44.5	67	9ms	29ms

All architectures substantially outperform GloVe static baselines across semantic tasks (Peters et al., 2018).

3. Layerwise Representation and Linguistic Analysis

The representations learned in biLMs stratify linguistically with network depth (Peters et al., 2018, Liu et al., 2020):

Input Embedding Layer: Character-level encodings and static word embedding capture mostly morphological features (e.g., syntactic word-analogies). Semantic analogies are not well-separated at this level.
Lower Contextual Layers: Local syntax—part-of-speech and small-scale constituent information—peaks here. For a 4-layer LSTM biLM, POS tagging accuracy reaches 97.4% at the first layer versus 88.6% with non-contextual embeddings.
Middle Contextual Layers: Phrase/chunk syntax and basic compositionality are encoded robustly. Span-representation probes extract phrase-structural information effectively in these layers for all major architectures.
Upper Layers: Long-range semantic dependencies and discourse-level features are most salient—these layers provide the strongest coreference, semantic role labeling, and NLI performance.

Empirical probing demonstrates that most tasks benefit from a weighted mixture of layers (e.g., ELMo softmax-normalized layer weights), often with task-optimal weights peaking 2–4 layers below the surface.

4. Applications and Practical Recommendations

biLM-derived contextualized embeddings have yielded state-of-the-art results across a wide range of NLP tasks (Liu et al., 2020):

Token-Level Tasks: POS tagging, chunking, named entity recognition, dependency parsing, where biLM contextualization dramatically increases accuracy, especially for ambiguous tokens and low-resource settings (Tu et al., 2017, Peters et al., 2018).
Phrase/Sentence-Level Tasks: Natural Language Inference, coreference resolution, semantic role labeling, which benefit from the long-range context encoded in upper layers (Peters et al., 2018).
Domain Adaptation and Cross-Lingual Transfer: Multilingual pretraining (e.g., mBERT, XLM) relies on biLM architectures to capture transferably aligned representations across 100+ languages (Liu et al., 2020).

Guidelines for selection:

Syntax-focused tasks: Prefer lower/middle layers for extraction.
Semantics/discourse: Use upper transformer or LSTM layers.
Deployment: Transformer/CNN biLMs may be preferable where speed dominates, though very large contextual models introduce significant computational and memory costs (Peters et al., 2018, Liu et al., 2020).

5. Implementation and Integration into Downstream Models

biLMs typically provide representations in one of two settings:

Feature-based: Extracting contextual embeddings from pretrained, frozen biLMs and using them as input features to downstream models (e.g., as in ELMo-based pipelines). This approach is computationally efficient and enables direct layer selection (Peters et al., 2018).
Fine-tuning: Continuously updating biLM parameters during supervised downstream training (as in BERT/Transformer-based models). This commonly yields improved task performance for sufficient data.

For resource-limited or high-efficiency scenarios, biLMs can be distilled into lightweight, static, or compressed representations via knowledge distillation, pruning, or quantization (Liu et al., 2020).

6. Linguistic and Cognitive Properties: Polysemy, Bias, Robustness

biLMs encode fine-grained word sense distinctions, capturing graded relatedness among senses and distinctions between polysemy and homonymy (Nair et al., 2020). Human judgments of sense relatedness are moderately correlated with distances in biLM embedding space (ρ ≈ 0.57 overall), and homonymous sense pairs are reliably more distant than polysemous pairs, in alignment with psychological data. However, biLMs are especially effective at separating unrelated senses, with nearly perfect discrimination for homonymous senses (F1 ≈ 0.99), but lower sensitivity (F1 ≈ 0.75) for closely related polysemous senses.

Layer selection, tokenization artifacts, and architecture influence both the representation and variance of senses in embedding space (Wang et al., 2022, Matthews et al., 8 Aug 2024). Position bias (especially for first or highly frequent tokens), polysemy, sentence length, and context window size all modulate representational consistency.

Empirical analyses also highlight sensitivity to orthographic noise: contextual representations generated by biLMs are highly sensitive to single-character changes, especially for words represented by a few subword tokens. For applications requiring robustness to typos or non-canonical orthography, character-level or tokenizer-free models should be considered (Matthews et al., 8 Aug 2024).

In sum, biLMs implement a flexible, architecture-agnostic paradigm for contextual representation learning. They have transformed core NLP tasks, enabled rich linguistic modeling, and provided a foundation for both practical systems and theoretical analysis of language understanding. Ongoing research addresses efficiency, interpretability, incorporating extralinguistic context, and mitigating social or cognitive biases to further improve the fidelity and utility of these models.