Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation (2202.13590v2)

Published 28 Feb 2022 in cs.CL and cs.LG

Abstract: In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Keita Nonaka (1 paper)
  2. Kazutaka Yamanouchi (1 paper)
  3. Tomohiro I (37 papers)
  4. Tsuyoshi Okita (13 papers)
  5. Kazutaka Shimada (3 papers)
  6. Hiroshi Sakamoto (16 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.