Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Normalized Importance Sampling for Neural Language Modeling

Published 11 Nov 2021 in cs.CL, cs.SD, and eess.AS | (2111.06310v2)

Abstract: To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural LLM, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural LLMs. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Through self-normalized LLM training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.

Citations (1)

Summary

  • The paper introduces self-normalized importance sampling, simplifying softmax normalization in neural language models by eliminating extra correction steps.
  • It compares variants that include or exclude target classes and demonstrates competitive perplexity and WER against methods like NCE.
  • Experimental results on datasets like Switchboard and Librispeech show improved training efficiency and strong performance in ASR applications.

Self-Normalized Importance Sampling for Neural Language Modeling

Introduction

The paper "Self-Normalized Importance Sampling for Neural Language Modeling" (2111.06310) presents advancements in sampling-based training criteria for neural LMs, specifically addressing the inefficiencies of traversing large vocabularies during softmax normalization. Neural LMs, which consistently outperform count-based methods in terms of perplexity, encounter challenges in efficient training and testing due to large vocabulary sizes. This study builds on previous work by proposing self-normalized importance sampling, which negates the need for additional correction steps to recover intended class posteriors, thus simplifying the training procedure while maintaining competitive performance in both research and production automatic speech recognition (@@@@3@@@@) tasks.

Methodology

The methodology involves comparing various sampling-based training criteria such as Binary Cross Entropy (BCE), @@@@1@@@@ (NCE), and Importance Sampling (IS). The paper identifies that IS lacks self-normalization by default; thus, several modifications are proposed to enable self-normalization directly:

  1. Mode 1 - Sampling including the Target Class: Subtracts the target term to exclude the target class from the summation, resulting in a self-normalized model output. 2. Mode 2 - Sampling excluding the Target Class: Uses a zero expected count for the target class during sampling to avoid the target class from being sampled.
  2. Mode 3 - Sampling excluding the Target Class: Employs a specific function that evaluates to zero when the target class is sampled, thus excluding it from the approximation.

These modifications aim to provide an approximation consistent with empirical posterior distribution p(cx)p(c|x), thus achieving self-normalization without the necessity of correction steps.

Experimental Results

Experiments conducted on datasets such as Switchboard, Librispeech, and an in-house English dataset demonstrate the effectiveness of self-normalized IS. The study evaluates models with perplexities and rescoring word error rates (WERs) following lattice rescoring protocols for ASR outputs. Notably, the self-normalized IS (Mode 3) shows competitive results compared to NCE across different datasets and sizes. The experimental results emphasize the advantage of self-normalized IS in training efficiency and performance parity with traditional scaling methods. Figure 1

Figure 1: The influence of number of samples K.

The influence of sample number KK reveals the trade-off between training speed and perplexity reduction, with larger KK offering better approximation and reduced variance at the cost of training speed. Mode 3 is highlighted for its efficiency and competitiveness.

Implications and Future Developments

The implications of this research extend to practical deployment scenarios where training speed and model accuracy are paramount. The simplicity of the self-normalized method paves a path for efficient integration into production systems, particularly in tasks requiring extensive lexical coverage. Future developments may explore extensions to other types of LMs and broaden the scope to multilingual tasks, potentially enhancing generalization and robustness across diverse language corpora.

Conclusion

The paper establishes self-normalized importance sampling as a viable alternative to traditional sampling-based criteria like NCE, simplifying the training process while delivering competitive performance without additional correction steps. The demonstrated benefits in terms of training efficiency and competitive perplexity suggest promising avenues for broader application in neural language modeling and ASR contexts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.