Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wasserstein-Fisher-Rao Document Distance (1904.10294v2)

Published 23 Apr 2019 in cs.LG, cs.CL, and stat.ML

Abstract: As a fundamental problem of natural language processing, it is important to measure the distance between different documents. Among the existing methods, the Word Mover's Distance (WMD) has shown remarkable success in document semantic matching for its clear physical insight as a parameter-free model. However, WMD is essentially based on the classical Wasserstein metric, thus it often fails to robustly represent the semantic similarity between texts of different lengths. In this paper, we apply the newly developed Wasserstein-Fisher-Rao (WFR) metric from unbalanced optimal transport theory to measure the distance between different documents. The proposed WFR document distance maintains the great interpretability and simplicity as WMD. We demonstrate that the WFR document distance has significant advantages when comparing the texts of different lengths. In addition, an accelerated Sinkhorn based algorithm with GPU implementation has been developed for the fast computation of WFR distances. The KNN classification results on eight datasets have shown its clear improvement over WMD.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zihao Wang (216 papers)
  2. Datong Zhou (11 papers)
  3. Yong Zhang (660 papers)
  4. Hao Wu (625 papers)
  5. Chenglong Bao (42 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.