Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German (1912.00159v3)

Published 30 Nov 2019 in cs.CL

Abstract: This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of LLMing. To capture new content, our approach will run continuously to keep increasing the corpus over time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lucy Linder (3 papers)
  2. Michael Jungo (4 papers)
  3. Jean Hennebert (3 papers)
  4. Claudiu Musat (38 papers)
  5. Andreas Fischer (54 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.