Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging (1808.04208v3)

Published 13 Aug 2018 in cs.CL

Abstract: Character-level models of tokens have been shown to be effective at dealing with within-token noise and out-of-vocabulary words. But these models still rely on correct token boundaries. In this paper, we propose a novel end-to-end character-level model and demonstrate its effectiveness in multilingual settings and when token boundaries are noisy. Our model is a semi-Markov conditional random field with neural networks for character and segment representation. It requires no tokenizer. The model matches state-of-the-art baselines for various languages and significantly outperforms them on a noisy English version of a part-of-speech tagging benchmark dataset. Our code and the noisy dataset are publicly available at http://cistern.cis.lmu.de/semiCRF.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Apostolos Kemos (1 paper)
  2. Heike Adel (51 papers)
  3. Hinrich Schütze (250 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.