Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Clustering of Streaming News (1809.00540v1)

Published 3 Sep 2018 in cs.CL and cs.IR

Abstract: Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual story clusters. Unlike typical clustering approaches that consider a small and known number of labels, we tackle the problem of discovering an ever growing number of cluster labels in an online fashion, using real news datasets in multiple languages. Our method is simple to implement, computationally efficient and produces state-of-the-art results on datasets in German, English and Spanish.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sebastião Miranda (5 papers)
  2. Artūrs Znotiņš (1 paper)
  3. Shay B. Cohen (78 papers)
  4. Guntis Barzdins (9 papers)
Citations (26)