Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TENER: Adapting Transformer Encoder for Named Entity Recognition (1911.04474v3)

Published 10 Nov 2019 in cs.CL and cs.LG

Abstract: The Bidirectional long short-term memory networks (BiLSTM) have been widely used as an encoder in models solving the named entity recognition (NER) task. Recently, the Transformer is broadly adopted in various NLP tasks owing to its parallelism and advantageous performance. Nevertheless, the performance of the Transformer in NER is not as good as it is in other NLP tasks. In this paper, we propose TENER, a NER architecture adopting adapted Transformer Encoder to model the character-level features and word-level features. By incorporating the direction and relative distance aware attention and the un-scaled attention, we prove the Transformer-like encoder is just as effective for NER as other NLP tasks.

TENER: Adapting Transformer Encoder for Named Entity Recognition

The paper introduces TENER, an architecture designed to enhance the performance of Named Entity Recognition (NER) tasks using a modified Transformer Encoder. This model addresses the limitations observed in the vanilla Transformer when applied to NER, aiming to match the effectiveness seen in other NLP tasks traditionally dominated by BiLSTMs.

Background and Motivation

NER tasks involve identifying and classifying entities within text, benefiting applications like relation extraction and coreference resolution. Traditionally, BiLSTMs have been a prevalent choice due to their ability to capture contextual word representations effectively. However, with the rise of Transformer models, known for their self-attention mechanisms and parallel processing capabilities, there is a compelling need to adapt this architecture for NER.

Despite the success of Transformers in domains such as machine translation and LLMing, their performance in NER tasks has been suboptimal. The authors investigate and address two primary shortcomings: the lack of directionality in positional encoding and the tendency of scaled attention to dilute essential signal amidst noise.

Key Contributions

  1. Direction- and Distance-Aware Attention:
    • The proposed model incorporates a relative positional encoding mechanism that accounts for both distance and directionality, addressing an intrinsic limitation of the sinusoidal embeddings in the vanilla Transformer. By leveraging a revised formulation that introduces fewer parameters, TENER can maintain essential syntactic information, improving its suitability for NER.
  2. Un-Scaled Attention:
    • The authors identify that the smoothness induced by scaled attention may not benefit tasks like NER, where precise and sparse attention is preferable. They eliminate the scaling factor in the dot-product attention, which empirically results in sharper focus and better entity recognition.
  3. Transformer as a Character Encoder:
    • Character-level encoding augments word representation, addressing data sparsity and OOV issues. The authors utilize a Transformer-based encoder at the character level, capable of identifying complex linguistic patterns more effectively than traditional CNN-based encoders.

Experimental Evaluation

The efficacy of TENER is validated across six NER datasets—two in English and four in Chinese. The results consistently demonstrate its superior performance compared to existing BiLSTM and CNN-based models, as well as the vanilla Transformer. Notably, TENER achieves state-of-the-art results without relying on pre-trained LLMs, highlighting its robustness.

  1. Performance on English Datasets:
    • TENER outperforms existing methods such as BiLSTM-CRF and CNN-BiLSTM-CRF on CoNLL2003 and OntoNotes 5.0 datasets. Even without contextualized embeddings, the model shows marked improvement, underlining the effectiveness of its architectural adaptations.
  2. Performance on Chinese Datasets:
    • In datasets including MSRA and Weibo, TENER surpasses traditional NER models and confirms the Transformer’s potential when properly adapted for direction and attention specificity.

Implications and Future Directions

The findings suggest significant potential for Transformer-based models in NER tasks when adjusted for syntactic and attention mechanisms. This work implies that further exploration into attention patterns and positional encoding can yield substantial returns for other sequence labeling tasks. Future research could explore integrating such models with contextualized embeddings or adapting similar mechanisms for related tasks such as dependency parsing and sentiment analysis.

In conclusion, TENER represents a meaningful advancement in adapting Transformer architectures specifically for NER tasks, demonstrating that with appropriate modifications, Transformers can transcend their traditional application boundaries and redefine performance benchmarks in sequence labeling tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hang Yan (86 papers)
  2. Bocao Deng (1 paper)
  3. Xiaonan Li (48 papers)
  4. Xipeng Qiu (257 papers)
Citations (256)
Github Logo Streamline Icon: https://streamlinehq.com