Cold Fusion: Training Seq2Seq Models Together with Language Models (1708.06426v1)

Published 21 Aug 2017 in cs.CL

Abstract: Sequence-to-sequence (Seq2Seq) models with attention have excelled at tasks which involve generating natural language sentences such as machine translation, image captioning and speech recognition. Performance has further been improved by leveraging unlabeled data, often in the form of a LLM. In this work, we present the Cold Fusion method, which leverages a pre-trained LLM during training, and show its effectiveness on the speech recognition task. We show that Seq2Seq models with Cold Fusion are able to better utilize language information enjoying i) faster convergence and better generalization, and ii) almost complete transfer to a new domain while using less than 10% of the labeled training data.

View on arXiv

Authors (4)

Anuroop Sriram (32 papers)
Heewoo Jun (14 papers)
Sanjeev Satheesh (14 papers)
Adam Coates (11 papers)

Citations (277)

View on Semantic Scholar

Summary

Cold Fusion: Enhancements in Training Sequence-to-Sequence Models with Integrated LLMs

The paper "Cold Fusion: Training Seq2Seq Models Together with LLMs" examines the integration of a pretrained LLM (LM) with sequence-to-sequence (Seq2Seq) models to enhance the performance of tasks like machine translation and speech recognition. This paper introduces an innovative approach known as "Cold Fusion," which addresses the limitations of previous methodologies like Deep Fusion.

Seq2Seq models, augmented with attention mechanisms, have established themselves as the state-of-the-art in various NLP tasks due to their proficiency in handling sequential data mappings. The common practice involves utilizing an external LM trained on large text corpora to integrate language information for improved fluency and generalization. Traditionally, methods like Shallow Fusion modulate this integration by adding LM weights during inference, while Deep Fusion offers a more connected mechanism by fusing the LM states with the Seq2Seq decoder through a gating function. However, Deep Fusion requires independent pre-training of the Seq2Seq and LM components, potentially leading to inefficiencies due to the decoder redundantly learning linguistic patterns.

The principal contribution of Cold Fusion lies in simultaneously training the Seq2Seq model with a fixed, pretrained LM, encouraging the decoder to rely on the external LM for language-specific nuances from the outset. This promotes efficient parameter utilization since the decoder’s capacity is devoted exclusively to task-specific learning rather than implicit LM training. Cold Fusion allows for faster convergence and significant domain transfer capabilities while using limited labeled data.

Cold Fusion's architectural innovations are particularly compelling:

The gate integrates both the Seq2Seq and LM states, allowing the model to prioritize external language information during instances of input uncertainty.
Fine-grained gating assigns specific gate values per LM state node, enhancing model flexibility.
Probabilistic projection replaces raw LM states with token probabilities in a common embedding space, simplifying integration and rendering the model invariant to permutation differences in state nodes, permitting direct substitution of LMs.

Experimentation with automatic speech recognition tasks highlights Cold Fusion’s effectiveness. Using datasets from two different domains—search queries as the source and movie transcripts as the target—the paper measures performance through metrics like Word Error Rate (WER) and Character Error Rate (CER). Results empirically substantiate Cold Fusion’s superiority, demonstrating significant improvements in both in-domain and out-of-domain scenarios compared to baseline and Deep Fusion models. Notably, Cold Fusion achieved about 38% reduction in domain adaptation gaps without necessitating retraining of underlying Seq2Seq parameters, demonstrating the robustness achieved through LM infusion.

Additionally, experiments underscore Cold Fusion's remarkable decoder efficiency due to its synergistic LM integration, which is evident from its consistent performance given reduced decoder sizes. Unlike the substantial performance degradation seen in traditional attention-based Seq2Seq models when downsizing the decoder, Cold Fusion exhibits robustness by fully leveraging LM assistance.

In practical applications, these characteristics endorse Cold Fusion’s efficacy, particularly in scenarios like speech recognition where domain adaptation is critical. Fine-tuning even with minimal labeled target data results in substantial performance gains, enabling models built on Cold Fusion to emulate comprehensive domain-specific adaptations. These outcomes further highlight Cold Fusion’s capability to adapt seamlessly across varying tasks and domains, making it a versatile solution for numerous NLP applications.

Future research directions could explore expanding Cold Fusion's applicability to broader tasks within NLP and optimizing its integration techniques for novel architectures such as transformers. As models continue to evolve, Cold Fusion provides a foundational methodological enhancement crucial for achieving optimal balance between efficiency, speed, and transferability in Seq2Seq models.

PDF Markdown

Related Papers

Find Related Papers