Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TAMS: Translation-Assisted Morphological Segmentation (2403.14840v2)

Published 21 Mar 2024 in cs.CL

Abstract: Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to dramatically speed up this process. But in typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage this data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual LLMs as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Enora Rice (5 papers)
  2. Ali Marashian (5 papers)
  3. Luke Gessler (13 papers)
  4. Alexis Palmer (14 papers)
  5. Katharina von der Wense (22 papers)

Summary

We haven't generated a summary for this paper yet.