Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 451 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Language Segmentation (1510.01717v1)

Published 6 Oct 2015 in cs.CL

Abstract: Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks. The problem can be solved by training LLMs on language data. However, in the case of low- or no-resource languages, this is problematic. I therefore investigate whether unsupervised methods perform better than supervised methods when it is difficult or impossible to train supervised approaches. A special focus is given to difficult texts, i.e. texts that are rather short (one sentence), containing abbreviations, low-resource languages and non-standard language. I compare three approaches: supervised n-gram LLMs, unsupervised clustering and weakly supervised n-gram LLM induction. I devised the weakly supervised approach in order to deal with difficult text specifically. In order to test the approach, I compiled a small corpus of different text types, ranging from one-sentence texts to texts of about 300 words. The weakly supervised LLM induction approach works well on short and difficult texts, outperforming the clustering algorithm and reaching scores in the vicinity of the supervised approach. The results look promising, but there is room for improvement and a more thorough investigation should be undertaken.

Citations (1)

View on Semantic Scholar