Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Decoding in Sequence Models using Discrete Latent Variables (1803.03382v6)

Published 9 Mar 2018 in cs.LG

Abstract: Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet still operate sequentially during decoding. Inspired by [arxiv:1711.00937], we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods. Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. While lower in BLEU than purely autoregressive models, our model achieves higher scores than previously proposed non-autoregressive translation models.

Fast Decoding in Sequence Models Using Discrete Latent Variables

The paper "Fast Decoding in Sequence Models Using Discrete Latent Variables" introduces a novel method for enhancing the parallelism of sequence model decoding, particularly within the field of neural machine translation. Traditional autoregressive models, including Recurrent Neural Networks (RNNs), WaveNet, and the Transformer architecture, although effective, encounter significant constraints when it comes to parallelized processing of long sequences due to their inherently sequential decoding methodology. This paper proposes utilizing discrete latent variables to transform the decoding process into one that is more computationally efficient.

Methodology

The paper outlines a method for extending sequence models via discrete latent variables to enable more parallelizable decoding processes. During inference, rather than generating output sequences in an entirely autoregressive manner, the target sequence is first auto-encoded into a compressed sequence of discrete latent variables. This shorter sequence is then processed autoregressively, but crucially, the reconstruction of the original output sequence from the latent sequence happens in parallel, thereby sidelining the limitations on parallelism facing autoregressive models.

The paper explores several discretization techniques for these discrete latent variables, including the Gumbel-Softmax trick, improved semantic hashing, Vector Quantization - Variational Autoencoder (VQ-VAE), and the novel Decomposed Vector Quantization (DVQ). DVQ is particularly noteworthy, solving issues of index collapse observed in traditional VQ-VAE when dealing with large latent spaces.

Results

The model's potential was evaluated in the context of neural machine translation, compared with traditional autoregressive models. Although it performed lower in BLEU scores than leading autoregressive models, the proposed model exhibited superior translation results compared to previously introduced non-autoregressive models. Importantly, it displayed a substantial increase in decoding speed—an order of magnitude faster than comparable autoregressive models—demonstrating the core advantage of increased parallelism.

The Latent Transformer (LT) model, constructed with these discretization techniques, shows a promising approach to neural machine translation, particularly when rescoring methods are applied, lifting its BLEU scores close to those of the autoregressive baseline models without beam search.

Implications and Future Directions

The implications of fast decoding in sequence models are profound, especially for applications requiring real-time or near-real-time processing of extensive data sequences, such as speech generation or large document summarizations. The paper lays the groundwork for future developments that may further increase the speed and accuracy of such models.

Going forward, improving both speed and accuracy can be further pursued by exploring hierarchical methods of latent generation, decreasing the autoregressive bottleneck even in latent sequences. Additionally, integrating with sampling methods or semi-autoregressive processes could potentially balance the trade-offs between the speed of non-autoregressive models and the high fidelity of autoregressive ones. These advancements could bridge the performance gap between different models, leading to broad and versatile applications in artificial intelligence.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Aurko Roy (18 papers)
  2. Ashish Vaswani (23 papers)
  3. Niki Parmar (17 papers)
  4. Samy Bengio (75 papers)
  5. Jakob Uszkoreit (23 papers)
  6. Noam Shazeer (37 papers)
  7. Łukasz Kaiser (17 papers)
Citations (225)