Bidirectional LM-Based Classifiers

Updated 30 January 2026

Bidirectional LM-based classifiers are models that integrate both left and right token information to improve discriminative tasks such as sequence tagging and document classification.
Modern approaches predominantly use Transformer-based architectures like BERT, employing masked language modeling and fine-tuning to capture rich contextual representations.
Empirical evidence shows these classifiers outperform unidirectional methods, achieving notable gains in accuracy across domains like legal text, hate speech, and cross-lingual sentiment analysis.

Bidirectional LM-based classifiers are systems utilizing LLMs with bidirectional context (typically via masked language modeling or bidirectional recurrent neural networks) for discriminative prediction tasks such as sequence tagging, document classification, or span selection. Such models leverage information from both preceding and following tokens, conferring superior contextual sensitivity compared to strictly unidirectional approaches. Modern bidirectional classifiers predominantly adopt Transformer-based masked language modeling paradigms (e.g., BERT, RoBERTa), but classical architectures employ bidirectional LSTMs, convolutional layers, or hybrid mechanisms. Empirical evidence indicates that bidirectional context is critical in scenarios where prediction depends on both left and right context, or where headword-centered discriminative cues are prevalent.

1. Architectural Taxonomy and Formulations

Bidirectional classifiers fall into two broad architectural patterns: (i) models utilizing masked language modeling pretraining (e.g., BERT, RoBERTa, ERNIE), (ii) neural networks employing explicit bidirectional sequence encoders (bidirectional LSTM/GRU, or hybrid ACNN+BLSTM) (Liang et al., 2016, Peters et al., 2017, ZiqiZhang et al., 25 Aug 2025). Both patterns extract contextualized representations $h_t$ at each position $t$ by integrating information from tokens before and after $t$ .

Transformer-based Bidirectional LM Classifiers

These models use the Transformer encoder with bidirectional self-attention. Prediction is posed as masked language modeling (MLM), with the [MASK] token used to predict entities, span boundaries, or classifiers. For instance, Chinese classifier prediction is cast as:

$\log P(c | X) = \log \left(\mathrm{softmax}\left(\mathrm{BERT}(X_f(c))_{[I_1]}\right)_c\right)$

where the classifier slot [MASK] is replaced by candidate $c$ and per-candidate likelihoods are compared (ZiqiZhang et al., 25 Aug 2025). The MLM classifier is trained by minimizing cross-entropy over the softmax outputs at mask positions.

Bidirectional RNN-Based Classifiers

Bidirectional LSTM/GRU (BLSTM/BGRU) models process input from both left-to-right and right-to-left directions. For example, AC-BLSTM combines asymmetric convolutional filters with stacked BLSTMs, producing contextually rich hidden states $h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]$ at each timestep, which are often concatenated over the sequence before classification (Liang et al., 2016).

Pseudo-Bidirectional Embedding Concatenation

Recent work shows that concatenating representations from separate forward and backward unidirectional LMs can yield competitive pseudo-bidirectional features for downstream classification, even outperforming base BERT in full-data and few-shot setups (Goto et al., 2024). For token $t$ ,

$h_t = [h_t^{\rightarrow}; h_t^{\leftarrow}]$

is fed to a classification head trained for the target task.

2. Training Paradigms and Optimization

Bidirectional LM-based classifiers exploit pretraining on large unlabeled corpora via LM objectives to produce generalizable representations. Downstream classification is performed via supervised fine-tuning or shallow classification heads training (Peters et al., 2017, ZiqiZhang et al., 25 Aug 2025).

Supervised Fine-Tuning: BERT-like models are fine-tuned for classification tasks using cross-entropy loss on labeled data. Typical settings employ 12 Transformer layers, 768 hidden units, and dropout on classification layers (Zhang et al., 23 May 2025).
Frozen LM Feature Extraction: In semi-supervised sequence tagging (TagLM), pretrained BiLM weights are frozen, and their outputs are used as features in a supervised BiRNN-CRF, refining with only the task-specific parameters (Peters et al., 2017).
Contrastive Regularization: For cross-lingual transfer, twin BiLSTM branches with shared parameters are trained using contrastive (cosine-based) losses to align task-specific representations from resource-rich and resource-poor languages (Choudhary et al., 2018).

Optimization methods typically use AdamW or RMSProp, with regularization via dropout and gradient clipping. Margin-based losses are adopted for metric learning settings.

3. Empirical Performance and Comparative Analysis

Quantitative measures across benchmarks substantiate the advantage of bidirectionality for various classification tasks:

Chinese Classifier Prediction: BERT MLM achieves 62.31% accuracy, improving to 69.54% post fine-tuning, outperforming GPT-4 (50.70%), DeepSeek-R1 (59.64%), and Qwen3 (31.80–47.69% with fine-tuning). BERT consistently yields lower R-Rank for gold classifiers (ZiqiZhang et al., 25 Aug 2025).
Text Classification and Sequence Tagging: On hate-speech, legal, and code datasets, BERT-like models deliver 88.1–99.7% accuracy, generally 10–20 points above LLM probing or zero-shot methods for pattern-driven tasks. LLMs (e.g., GPT-4o, Qwen2.5, LLaMA-3) lead only in knowledge-intensive tasks (hallucination detection) (Zhang et al., 23 May 2025).
NER and Chunking: TagLM increases CoNLL-2003 NER F₁ from 90.87 to 91.93 (+1.06), and Chunking F₁ from 95.0 to 96.37 (+1.37), with largest gains in low-resource regimes (Peters et al., 2017).
Contrastive BiLSTM: For cross-lingual sentiment analysis and emoji prediction, Siamese BiLSTM models provide +13–23 points over baselines on resource-poor languages, demonstrating robust cross-lingual transfer (Choudhary et al., 2018).
Pseudo-Bidirectionality: Concatenating a small backward LM's hidden states to a frozen large forward LM yields +10 F₁ improvements (CoNLL-2003 NER: 65→77 for Llama-2-7B), surpassing BERT in few-shot and rare-domain settings (Goto et al., 2024).

4. Mechanistic Insights and Ablation Studies

Bidirectional LM-based classifiers exploit right-context information absent in left-to-right architectures. In Chinese classifier tasks, masking the head noun drops BERT accuracy from 62% to 33%, confirming its centrality for semantic disambiguation. Blocking left context entirely reduces performance below 26%; right-context masking exacts a smaller penalty (ZiqiZhang et al., 25 Aug 2025). Self-attention maps indicate strong peaks at the head noun. In sequence tagging, appending bidirectional LM features to the intermediate layers of BiRNN models yields best gains, with backward context especially correcting forward-only errors at sentence-initial positions (Goto et al., 2024, Peters et al., 2017).

Pseudo-bidirectional approaches are most effective under few-shot constraints, rare domains, and cases of span boundary ambiguity. Backward LM size exhibits diminishing returns beyond ~100–200M parameters when paired with larger forward LMs (Goto et al., 2024).

5. Task-Based Model Selection and Recommendations

Not all classification tasks benefit equally from bidirectionality. PCA and classifier probing reveal:

Bidirectional models excel in tasks with high pattern intensity and moderate rule regularity—implicit hate speech, code, and legal detection.
LLM-based approaches can surpass bidirectional classifiers in tasks demanding semantic/world knowledge, such as hallucination detection (Zhang et al., 23 May 2025).

Formally, the TaMAS strategy defines task features $\phi(t) = [p(t), r(t), s(t)]^\top$ for pattern intensity, rule regularity, and semantic depth, respectively. Scoring functions $S_{\mathrm{BLM}}(t)$ and $S_{\mathrm{LLM}}(t)$ prescribe whether BERT-like fine-tuning or LLM utilization is optimal. For empirically pattern-driven tasks, bidirectional LM classifiers should be preferred for efficiency and accuracy; for knowledge-heavy tasks, internal-state LLM methods or zero-shot inference are recommended.

6. Hybrid and Semi-Supervised Extensions

Hybrid classifiers, such as combining bidirectional encoders for classification with unidirectional decoders for generation, are advocated to leverage the strengths of both architectures (ZiqiZhang et al., 25 Aug 2025). AC-BLSTM demonstrates how asymmetric convolutional feature extractors combined with BLSTM encoders enhance classification by capturing multi-scale phrase features (Liang et al., 2016). Semi-supervised extensions (G-AC-BLSTM, TagLM) can integrate unlabeled or generated data, further improving performance in low-resource conditions without additional labeled data or external lexica.

7. Future Directions

Future work recommends:

Integration of bidirectional attention mechanisms into large-scale LLM architectures to close the pattern-driven performance gap observed versus BERT-like models (ZiqiZhang et al., 25 Aug 2025, Zhang et al., 23 May 2025).
Exploration of fine-tuning and prompt-engineering strategies that explicitly utilize noun-centered or span representations to enhance bidirectional disambiguation (ZiqiZhang et al., 25 Aug 2025).
Expansion of cross-lingual, resource-efficient classifiers via tied-weight bidirectional encoders and contrastive losses as per the Siamese BiLSTM framework (Choudhary et al., 2018).
Architectural innovation toward encoder-decoder hybrids, pseudo-bidirectional representation concatenation, or lightweight backward LMs for rapid adaptation tasks (Goto et al., 2024).

Bidirectional LM-based classifiers remain foundational for discriminative text tasks requiring nuanced context sensitivity, with architectural and training strategies continuously evolving to adapt for efficiency, generalization, and task-specific efficacy.