Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiLSTM-CRF: Neural Sequence Tagging

Updated 7 March 2026
  • BiLSTM-CRF is a neural sequence labeling model that merges bidirectional LSTM encoding with CRF decoding to capture both context and structured label dependencies.
  • It supports diverse input features including word embeddings, character-level representations, and syntactic cues for robust performance on tasks such as NER, POS tagging, and chunking.
  • Empirical results across multiple languages reveal consistent F1 improvements, highlighting its effectiveness in handling structured prediction challenges.

A Bidirectional Long Short-Term Memory network with Conditional Random Fields (BiLSTM-CRF) is a neural sequence labeling architecture that integrates context-sensitive encoding from bidirectional LSTM layers with structured prediction via a linear-chain CRF output layer. This model class is state-of-the-art or close to state-of-the-art for named entity recognition (NER), part-of-speech (POS) tagging, chunking, and other sequence tagging tasks across numerous languages and evaluation settings (Huang et al., 2015, Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Tang et al., 2020, Hoesen et al., 2020, Ni et al., 2021, Yepes, 2018).

1. Architectural Components and Mathematical Formulation

The BiLSTM-CRF model consists of two principal stages: a bidirectional LSTM encoder and a linear-chain CRF decoder. The model input sequence (x1,x2,,xT)(x_1, x_2, \dots, x_T) is typically embedded via pre-trained or learned word vectors, optionally augmented with additional character- or POS-based features.

Bidirectional LSTM Encoder:

At each time step tt, an LSTM cell computes its gates as: it=σ(Wixt+Uiht1+bi) ft=σ(Wfxt+Ufht1+bf) ot=σ(Woxt+Uoht1+bo) c~t=tanh(Wcxt+Ucht1+bc) ct=ftct1+itc~t ht=ottanh(ct)\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned} Two copies of the LSTM process the sequence in forward and backward directions, producing ht\overrightarrow{h}_t and ht\overleftarrow{h}_t, which are concatenated as ht=[ht;ht]h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t] to capture context from both surrounding directions (Huang et al., 2015, Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Ni et al., 2021, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

CRF Output Layer:

A linear "emission" projection generates per-label scores PtP_t from the BiLSTM output for each time step. The CRF models the joint probability of the tag sequence y=(y1,,yT)y = (y_1, \dots, y_T) using: s(x,y)=t=1T(Ayt1,yt+Pt,yt)s(x, y) = \sum_{t=1}^T (A_{y_{t-1}, y_t} + P_{t, y_t}) where AA is a learned transition score matrix. The conditional probability and partition function are: P(yx)=exp(s(x,y))yexp(s(x,y))P(y|x) = \frac{\exp(s(x,y))}{\sum_{y'} \exp(s(x,y'))} Training maximizes the log-likelihood over the correct tag sequence: L(x,y)=s(x,y)+logZ(x)\mathcal{L}(x, y^*) = -s(x, y^*) + \log Z(x) Decoding is performed via the Viterbi algorithm to find the highest-scoring tag sequence, and both forward–backward and Viterbi algorithms run in O(TK2)O(T K^2), where TT is sequence length and KK tag vocabulary size (Huang et al., 2015, Kocoń et al., 2019, Ni et al., 2021, Ganesh et al., 13 Oct 2025, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

2. Feature Representations and Input Layer Choices

The BiLSTM-CRF framework admits a variety of input representations:

  • Word embeddings: Standard approaches include pretrained Word2Vec, GloVe, FastText, or domain-specific embeddings. For Polish, KGR10 300D FastText skip-gram vectors incorporating subword n-grams yield superior results, particularly for morphologically rich or low-resource languages (Kocoń et al., 2019).
  • Character-level features: Many variants employ a secondary character-level feature extractor, either bidirectional LSTM or CNN, feeding its output to the word-level encoder for capturing morphological or orthographic properties (Ganesh et al., 13 Oct 2025, Hoesen et al., 2020, Yepes, 2018).
  • POS-tag embeddings and syntactic features: Augmenting the input with learned POS-tag vectors (e.g., 25D for Indonesian) boosts performance, especially in languages with explicit syntactic marking (Hoesen et al., 2020).
  • Contextualized embeddings: Alternatives include ELMo, BERT, or other transformer-based representations, greatly increasing dd (up to 768 or 1024) (Ni et al., 2021).

Empirical ablations demonstrate that the addition of subword or syntactic embeddings consistently improves recall and F1_1, especially for rare words or boundary cases (Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Hoesen et al., 2020).

3. Training Procedures, Regularization, and Optimization

BiLSTM-CRF models are trained end-to-end using minibatch stochastic gradient descent (SGD) or variants such as Adam, with backpropagation through time (BPTT) for the recurrent layers and gradient flow through the CRF using the forward–backward algorithm.

  • Hyperparameters: Reported best practices include BiLSTM hidden size per direction H=100H = 100–$300$, dropout rates of $0.33$–$0.5$, batch size between $10$–$50$, and up to $300$ training epochs depending on data scale and task (Huang et al., 2015, Ganesh et al., 13 Oct 2025, Tang et al., 2020, Hoesen et al., 2020, Kocoń et al., 2019, Ni et al., 2021).
  • Optimizers: Adam generally outperforms SGD, notably for NER in morphologically rich or low-resource settings (Tang et al., 2020). Learning rates and momentum settings vary, e.g., η=0.01\eta = 0.01–$0.015$ for SGD, with gradient clipping norms at $5.0$ (Ganesh et al., 13 Oct 2025).
  • Regularization: Confidence penalty (entropy regularization), annealed Gaussian gradient noise, and zoneout have been shown to further increase F1_1 and generalization, especially on challenging datasets. For example, applying a confidence penalty and zoneout in Spanish NER raises F1_1 from baseline to $87.18$ (Yepes, 2018).

4. Structured Decoding via Linear-Chain CRF

The conditional random field output layer enables global sequence-level scoring, enforcing valid transitions (e.g., discouraging illegal IOB/IOBES tag transitions) and leveraging sentence-level dependencies. The CRF layer is parameterized by the transition matrix AA, and both the score function and partition function are efficiently computed as: s(x,y)=t=1T(Ayt1,yt+Pt,yt),        Z(x)=yexp(s(x,y))s(x, y) = \sum_{t=1}^T (A_{y_{t-1}, y_t} + P_{t, y_t}), \;\;\;\; Z(x) = \sum_{y'} \exp(s(x, y')) Inference is performed via Viterbi decoding, and all derivatives required for learning are available via dynamic programming (Huang et al., 2015, Ni et al., 2021, Ganesh et al., 13 Oct 2025, Kocoń et al., 2019, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

A key empirical result is the consistent improvement of the CRF output relative to softmax classifiers, especially for tasks where the label structure is highly constrained—notably, NER and chunking (Huang et al., 2015, Ni et al., 2021, Ganesh et al., 13 Oct 2025).

5. Empirical Performance and Applications

BiLSTM-CRF models outperform or match all prior neural and feature-based sequence taggers across POS tagging, chunking, and NER.

Selected empirical results:

  • English CoNLL-2003 NER: BiLSTM-CRF F1_1 = $90.10$ (with SENNA and gazetteers); BiLSTM–CNN–CRF $91.18$ (GloVe, CNNs, 2025 reproduction) (Huang et al., 2015, Ganesh et al., 13 Oct 2025).
  • Indonesian NER: Adding POS tag embeddings yields ++4–6 F1_1 over the baseline; character-level and softmax/CRF comparisons confirm marginal but consistent gains for the CRF (Hoesen et al., 2020).
  • Chinese Judicial NER: BiLSTM-CRF with Adam produces F1=0.855_1 = 0.855 vs. $0.813$ (RMSProp) and $0.688$ (GD) (Tang et al., 2020).
  • Polish Timex Tagging: KGR10 embedding + BiLSTM-CRF achieves strict F1=92.36_1 = 92.36\%, a 3–5 point improvement over non-CRF or non-specialized embeddings (Kocoń et al., 2019).

These results confirm the robustness, cross-lingual applicability, and utility of BiLSTM-CRF for varied languages and entity types.

6. Extensions and Variations

The foundational BiLSTM-CRF architecture admits a number of extensions:

  • Deeper/stacked encoders and residual connections: Multi-layered and skip-connected architectures, as in (Yepes, 2018), further boost accuracy.
  • Character-level CNNs or LSTMs: Additional input features such as char-CNN, char-LSTM, or even language-model embeddings (Ganesh et al., 13 Oct 2025, Yepes, 2018, Hoesen et al., 2020).
  • Advanced regularizers: Confidence-penalty, annealed Gaussian noise, and zoneout regularize the model and avoid overfitting (Yepes, 2018).
  • New tagging schemes: Modifications for tasks like open relation extraction enable improved performance on overlapping labels by restructuring the label space (Ni et al., 2021).
  • Integration with contextualized embeddings: Incorporation of BERT, ELMo, or similar representations, significantly increases representational capacity for state-of-the-art sequence tagging (Ni et al., 2021).

7. Technical Summary Table

Study / Setting Input Features Hidden Size / Dropout Optimizer / LR CRF Output SOTA F₁ / Accuracy
(Huang et al., 2015) English POS/NER/Chunking SENNA/Random 300, ~0 (SGD) SGD, η=0.1\eta=0.1 Linear-CRF 97.55 / 90.10 / 94.46
(Ganesh et al., 13 Oct 2025) CoNLL NER GloVe + char-CNN 100/200, 0.5 SGD, η=0.015\eta=0.015 Linear-CRF 91.18 (NER, F₁)
(Kocoń et al., 2019) Polish Timex KGR10 FastText ~200, 0.5 (PolDeepNer) Adam Linear-CRF 92.36 (strict F₁)
(Tang et al., 2020) Chinese Judicial NER Random/char-emb 200, 0.5 Adam, 0.001 Linear-CRF 0.855 (F₁)
(Hoesen et al., 2020) Indonesian NER w/ POS W2V + char-LSTM + POS 100, tanh layer Not specified Linear-CRF Softmax+POS: ~83; CRF+POS: ~80 (F₁)
(Yepes, 2018) Spanish NER / Regularized char-LSTM, word-emb 100 x 3, zoneout SGD+m, 0.005 Linear-CRF 87.18 (F₁)

References

  • (Huang et al., 2015) Huang et al., "Bidirectional LSTM-CRF Models for Sequence Tagging"
  • (Ganesh et al., 13 Oct 2025) Ganesh & Reddy, "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF"
  • (Kocoń et al., 2019) Kocoń & Gawor, "Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF"
  • (Tang et al., 2020) Li et al., "Recognizing Chinese Judicial Named Entity using BiLSTM-CRF"
  • (Hoesen et al., 2020) Rahmaningtyas et al., "Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger"
  • (Ni et al., 2021) Zhang et al., "Explore BiLSTM-CRF-Based Models for Open Relation Extraction"
  • (Yepes, 2018) Jimeno Yepes, "Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bidirectional Long Short-Term Memory with Conditional Random Fields (BiLSTM-CRF).