BiLSTM-CRF: Neural Sequence Tagging

Updated 7 March 2026

BiLSTM-CRF is a neural sequence labeling model that merges bidirectional LSTM encoding with CRF decoding to capture both context and structured label dependencies.
It supports diverse input features including word embeddings, character-level representations, and syntactic cues for robust performance on tasks such as NER, POS tagging, and chunking.
Empirical results across multiple languages reveal consistent F1 improvements, highlighting its effectiveness in handling structured prediction challenges.

A Bidirectional Long Short-Term Memory network with Conditional Random Fields (BiLSTM-CRF) is a neural sequence labeling architecture that integrates context-sensitive encoding from bidirectional LSTM layers with structured prediction via a linear-chain CRF output layer. This model class is state-of-the-art or close to state-of-the-art for named entity recognition (NER), part-of-speech (POS) tagging, chunking, and other sequence tagging tasks across numerous languages and evaluation settings (Huang et al., 2015, Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Tang et al., 2020, Hoesen et al., 2020, Ni et al., 2021, Yepes, 2018).

1. Architectural Components and Mathematical Formulation

The BiLSTM-CRF model consists of two principal stages: a bidirectional LSTM encoder and a linear-chain CRF decoder. The model input sequence $(x_1, x_2, \dots, x_T)$ is typically embedded via pre-trained or learned word vectors, optionally augmented with additional character- or POS-based features.

Bidirectional LSTM Encoder:

At each time step $t$ , an LSTM cell computes its gates as: $\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$ Two copies of the LSTM process the sequence in forward and backward directions, producing $\overrightarrow{h}_t$ and $\overleftarrow{h}_t$ , which are concatenated as $h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]$ to capture context from both surrounding directions (Huang et al., 2015, Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Ni et al., 2021, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

CRF Output Layer:

A linear "emission" projection generates per-label scores $P_t$ from the BiLSTM output for each time step. The CRF models the joint probability of the tag sequence $y = (y_1, \dots, y_T)$ using: $s(x, y) = \sum_{t=1}^T (A_{y_{t-1}, y_t} + P_{t, y_t})$ where $A$ is a learned transition score matrix. The conditional probability and partition function are: $P(y|x) = \frac{\exp(s(x,y))}{\sum_{y'} \exp(s(x,y'))}$ Training maximizes the log-likelihood over the correct tag sequence: $\mathcal{L}(x, y^*) = -s(x, y^*) + \log Z(x)$ Decoding is performed via the Viterbi algorithm to find the highest-scoring tag sequence, and both forward–backward and Viterbi algorithms run in $O(T K^2)$ , where $T$ is sequence length and $K$ tag vocabulary size (Huang et al., 2015, Kocoń et al., 2019, Ni et al., 2021, Ganesh et al., 13 Oct 2025, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

2. Feature Representations and Input Layer Choices

The BiLSTM-CRF framework admits a variety of input representations:

Word embeddings: Standard approaches include pretrained Word2Vec, GloVe, FastText, or domain-specific embeddings. For Polish, KGR10 300D FastText skip-gram vectors incorporating subword n-grams yield superior results, particularly for morphologically rich or low-resource languages (Kocoń et al., 2019).
Character-level features: Many variants employ a secondary character-level feature extractor, either bidirectional LSTM or CNN, feeding its output to the word-level encoder for capturing morphological or orthographic properties (Ganesh et al., 13 Oct 2025, Hoesen et al., 2020, Yepes, 2018).
POS-tag embeddings and syntactic features: Augmenting the input with learned POS-tag vectors (e.g., 25D for Indonesian) boosts performance, especially in languages with explicit syntactic marking (Hoesen et al., 2020).
Contextualized embeddings: Alternatives include ELMo, BERT, or other transformer-based representations, greatly increasing $d$ (up to 768 or 1024) (Ni et al., 2021).

Empirical ablations demonstrate that the addition of subword or syntactic embeddings consistently improves recall and F $_1$ , especially for rare words or boundary cases (Kocoń et al., 2019, Ganesh et al., 13 Oct 2025, Hoesen et al., 2020).

3. Training Procedures, Regularization, and Optimization

BiLSTM-CRF models are trained end-to-end using minibatch stochastic gradient descent (SGD) or variants such as Adam, with backpropagation through time (BPTT) for the recurrent layers and gradient flow through the CRF using the forward–backward algorithm.

Hyperparameters: Reported best practices include BiLSTM hidden size per direction $H = 100$ –$300$, dropout rates of $0.33$–$0.5$, batch size between $10$–$50$, and up to $300$ training epochs depending on data scale and task (Huang et al., 2015, Ganesh et al., 13 Oct 2025, Tang et al., 2020, Hoesen et al., 2020, Kocoń et al., 2019, Ni et al., 2021).
Optimizers: Adam generally outperforms SGD, notably for NER in morphologically rich or low-resource settings (Tang et al., 2020). Learning rates and momentum settings vary, e.g., $\eta = 0.01$ –$0.015$ for SGD, with gradient clipping norms at $5.0$ (Ganesh et al., 13 Oct 2025).
Regularization: Confidence penalty (entropy regularization), annealed Gaussian gradient noise, and zoneout have been shown to further increase F $_1$ and generalization, especially on challenging datasets. For example, applying a confidence penalty and zoneout in Spanish NER raises F $_1$ from baseline to $87.18$ (Yepes, 2018).

4. Structured Decoding via Linear-Chain CRF

The conditional random field output layer enables global sequence-level scoring, enforcing valid transitions (e.g., discouraging illegal IOB/IOBES tag transitions) and leveraging sentence-level dependencies. The CRF layer is parameterized by the transition matrix $A$ , and both the score function and partition function are efficiently computed as: $s(x, y) = \sum_{t=1}^T (A_{y_{t-1}, y_t} + P_{t, y_t}), \;\;\;\; Z(x) = \sum_{y'} \exp(s(x, y'))$ Inference is performed via Viterbi decoding, and all derivatives required for learning are available via dynamic programming (Huang et al., 2015, Ni et al., 2021, Ganesh et al., 13 Oct 2025, Kocoń et al., 2019, Yepes, 2018, Tang et al., 2020, Hoesen et al., 2020).

A key empirical result is the consistent improvement of the CRF output relative to softmax classifiers, especially for tasks where the label structure is highly constrained—notably, NER and chunking (Huang et al., 2015, Ni et al., 2021, Ganesh et al., 13 Oct 2025).

5. Empirical Performance and Applications

BiLSTM-CRF models outperform or match all prior neural and feature-based sequence taggers across POS tagging, chunking, and NER.

Selected empirical results:

English CoNLL-2003 NER: BiLSTM-CRF F $_1$ = $90.10$ (with SENNA and gazetteers); BiLSTM–CNN–CRF $91.18$ (GloVe, CNNs, 2025 reproduction) (Huang et al., 2015, Ganesh et al., 13 Oct 2025).
Indonesian NER: Adding POS tag embeddings yields $+$ 4–6 F $_1$ over the baseline; character-level and softmax/CRF comparisons confirm marginal but consistent gains for the CRF (Hoesen et al., 2020).
Chinese Judicial NER: BiLSTM-CRF with Adam produces F $_1 = 0.855$ vs. $0.813$ (RMSProp) and $0.688$ (GD) (Tang et al., 2020).
Polish Timex Tagging: KGR10 embedding + BiLSTM-CRF achieves strict F $_1 = 92.36$ \%, a 3–5 point improvement over non-CRF or non-specialized embeddings (Kocoń et al., 2019).

These results confirm the robustness, cross-lingual applicability, and utility of BiLSTM-CRF for varied languages and entity types.

6. Extensions and Variations

The foundational BiLSTM-CRF architecture admits a number of extensions:

Deeper/stacked encoders and residual connections: Multi-layered and skip-connected architectures, as in (Yepes, 2018), further boost accuracy.
Character-level CNNs or LSTMs: Additional input features such as char-CNN, char-LSTM, or even language-model embeddings (Ganesh et al., 13 Oct 2025, Yepes, 2018, Hoesen et al., 2020).
Advanced regularizers: Confidence-penalty, annealed Gaussian noise, and zoneout regularize the model and avoid overfitting (Yepes, 2018).
New tagging schemes: Modifications for tasks like open relation extraction enable improved performance on overlapping labels by restructuring the label space (Ni et al., 2021).
Integration with contextualized embeddings: Incorporation of BERT, ELMo, or similar representations, significantly increases representational capacity for state-of-the-art sequence tagging (Ni et al., 2021).

7. Technical Summary Table

Study / Setting	Input Features	Hidden Size / Dropout	Optimizer / LR	CRF Output	SOTA F₁ / Accuracy
(Huang et al., 2015) English POS/NER/Chunking	SENNA/Random	300, ~0 (SGD)	SGD, $\eta=0.1$	Linear-CRF	97.55 / 90.10 / 94.46
(Ganesh et al., 13 Oct 2025) CoNLL NER	GloVe + char-CNN	100/200, 0.5	SGD, $\eta=0.015$	Linear-CRF	91.18 (NER, F₁)
(Kocoń et al., 2019) Polish Timex	KGR10 FastText	~200, 0.5 (PolDeepNer)	Adam	Linear-CRF	92.36 (strict F₁)
(Tang et al., 2020) Chinese Judicial NER	Random/char-emb	200, 0.5	Adam, 0.001	Linear-CRF	0.855 (F₁)
(Hoesen et al., 2020) Indonesian NER w/ POS	W2V + char-LSTM + POS	100, tanh layer	Not specified	Linear-CRF	Softmax+POS: ~83; CRF+POS: ~80 (F₁)
(Yepes, 2018) Spanish NER / Regularized	char-LSTM, word-emb	100 x 3, zoneout	SGD+m, 0.005	Linear-CRF	87.18 (F₁)

References

(Huang et al., 2015) Huang et al., "Bidirectional LSTM-CRF Models for Sequence Tagging"
(Ganesh et al., 13 Oct 2025) Ganesh & Reddy, "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF"
(Kocoń et al., 2019) Kocoń & Gawor, "Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF"
(Tang et al., 2020) Li et al., "Recognizing Chinese Judicial Named Entity using BiLSTM-CRF"
(Hoesen et al., 2020) Rahmaningtyas et al., "Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger"
(Ni et al., 2021) Zhang et al., "Explore BiLSTM-CRF-Based Models for Open Relation Extraction"
(Yepes, 2018) Jimeno Yepes, "Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition"