Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Optimal Transport for Document Representation (1906.10827v2)

Published 26 Jun 2019 in cs.LG, cs.CL, cs.IR, and stat.ML

Abstract: The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mikhail Yurochkin (68 papers)
  2. Sebastian Claici (8 papers)
  3. Edward Chien (9 papers)
  4. Farzaneh Mirzazadeh (6 papers)
  5. Justin Solomon (86 papers)
Citations (84)

Summary

We haven't generated a summary for this paper yet.