Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques in Statistical Language Model of Bahasa Indonesia (1706.07786v1)

Published 23 Jun 2017 in cs.CL

Abstract: Smoothing is one technique to overcome data sparsity in statistical LLM. Although in its mathematical definition there is no explicit dependency upon specific natural language, different natures of natural languages result in different effects of smoothing techniques. This is true for Russian language as shown by Whittaker (1998). In this paper, We compared Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical LLM of Bahasa Indonesia. We used train sets of totally 22M words that we extracted from Indonesian version of Wikipedia. As far as we know, this is the largest train set used to build statistical LLM for Bahasa Indonesia. The experiments with 3-gram, 5-gram, and 7-gram showed that Modified Kneser-Ney consistently outperforms Witten-Bell smoothing technique in term of perplexity values. It is interesting to note that our experiments showed 5-gram model for Modified Kneser-Ney smoothing technique outperforms that of 7-gram. Meanwhile, Witten-Bell smoothing is consistently improving over the increase of n-gram order.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Ismail Rusli (2 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.