Fast Decoding in Sequence Models Using Discrete Latent Variables
The paper "Fast Decoding in Sequence Models Using Discrete Latent Variables" introduces a novel method for enhancing the parallelism of sequence model decoding, particularly within the field of neural machine translation. Traditional autoregressive models, including Recurrent Neural Networks (RNNs), WaveNet, and the Transformer architecture, although effective, encounter significant constraints when it comes to parallelized processing of long sequences due to their inherently sequential decoding methodology. This paper proposes utilizing discrete latent variables to transform the decoding process into one that is more computationally efficient.
Methodology
The paper outlines a method for extending sequence models via discrete latent variables to enable more parallelizable decoding processes. During inference, rather than generating output sequences in an entirely autoregressive manner, the target sequence is first auto-encoded into a compressed sequence of discrete latent variables. This shorter sequence is then processed autoregressively, but crucially, the reconstruction of the original output sequence from the latent sequence happens in parallel, thereby sidelining the limitations on parallelism facing autoregressive models.
The paper explores several discretization techniques for these discrete latent variables, including the Gumbel-Softmax trick, improved semantic hashing, Vector Quantization - Variational Autoencoder (VQ-VAE), and the novel Decomposed Vector Quantization (DVQ). DVQ is particularly noteworthy, solving issues of index collapse observed in traditional VQ-VAE when dealing with large latent spaces.
Results
The model's potential was evaluated in the context of neural machine translation, compared with traditional autoregressive models. Although it performed lower in BLEU scores than leading autoregressive models, the proposed model exhibited superior translation results compared to previously introduced non-autoregressive models. Importantly, it displayed a substantial increase in decoding speed—an order of magnitude faster than comparable autoregressive models—demonstrating the core advantage of increased parallelism.
The Latent Transformer (LT) model, constructed with these discretization techniques, shows a promising approach to neural machine translation, particularly when rescoring methods are applied, lifting its BLEU scores close to those of the autoregressive baseline models without beam search.
Implications and Future Directions
The implications of fast decoding in sequence models are profound, especially for applications requiring real-time or near-real-time processing of extensive data sequences, such as speech generation or large document summarizations. The paper lays the groundwork for future developments that may further increase the speed and accuracy of such models.
Going forward, improving both speed and accuracy can be further pursued by exploring hierarchical methods of latent generation, decreasing the autoregressive bottleneck even in latent sequences. Additionally, integrating with sampling methods or semi-autoregressive processes could potentially balance the trade-offs between the speed of non-autoregressive models and the high fidelity of autoregressive ones. These advancements could bridge the performance gap between different models, leading to broad and versatile applications in artificial intelligence.