Papers
Topics
Authors
Recent
2000 character limit reached

Automatic register identification for the open web using multilingual deep learning (2406.19892v3)

Published 28 Jun 2024 in cs.CL

Abstract: This article investigates how well deep learning models can identify web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain 72,504 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Our multilingual models achieve state-of-the-art results (79% F1 score) using multi-label classification. This performance matches or exceeds previous studies that used simpler classification schemes, showing that models can perform well even with a complex register scheme at a massively multilingual scale. However, we observe a consistent performance ceiling around 77-80% F1 score across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid documents -- texts combining multiple registers -- reveals that the main challenge is not in classifying hybrids themselves, but in distinguishing between hybrid and non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly helping languages with limited training data. While zero-shot performance drops by an average of 7% on unseen languages, this decrease varies substantially between languages (from 3% to 20%), indicating that while registers share many features across languages, they also maintain language-specific characteristics.

Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.