Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities (2405.18669v2)

Published 29 May 2024 in cs.LG, cs.AI, cs.CL, and eess.AS

Abstract: Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  5. LLM augmented LLMs: Expanding capabilities through composition. In The Twelfth International Conference on Learning Representations, 2024.
  6. AudioLM: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  7. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  10. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, 2023.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  12. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250, 2021.
  13. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, December 2022.
  14. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.
  15. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36, 2023.
  16. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  17. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. https://github.com/facebookresearch/libri-light.
  18. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  20. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  21. An embarrassingly simple approach for llm with strong asr capacity. arXiv preprint arXiv:2402.08846, 2024.
  22. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  24. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  25. Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities. arXiv preprint arXiv:2311.05698, 2023.
  26. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  27. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  28. Transformer protein language models are unsupervised structure learners. Biorxiv, pages 2020–12, 2020.
  29. AudioPaLM: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  30. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  31. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Slm: Bridge the thin gap between speech and text foundation models. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
  34. On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
  35. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
  36. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  37. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, 2024.
  38. Connecting speech encoder and large language model for asr. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12637–12641. IEEE, 2024.
  39. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  40. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech 2019, pages 1526–1530, 2019.
  41. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vicky Zayats (14 papers)
  2. Peter Chen (9 papers)
  3. Dirk Padfield (7 papers)
  4. Melissa Ferrari (1 paper)

Summary

  • The paper presents Zipper, which integrates independently pre-trained unimodal decoders using a multi-tower design.
  • It employs gated cross-attention and autoregressive masking to efficiently align and retain the core capabilities of text and speech modalities.
  • Empirical results show a 12-point absolute WER reduction for TTS and competitive ASR performance, even with minimal aligned data.

Integrating Independently Pre-trained Unimodal Decoders in Multimodal Generation Tasks: The Zipper Model

Introduction

The paper presents a novel approach, Zipper, designed to address challenges associated with integrating multiple generative foundation models trained on different modalities. The main difficulties in achieving efficient multimodal integration involve aligning data between modalities and leveraging unimodal representations in cross-domain tasks without losing their original capabilities.

Methodology

Zipper introduces a multi-tower decoder architecture that utilizes cross-attention to fuse independently pre-trained unimodal decoders. Primarily focused on speech and text modalities, this architecture can compose multimodal generative models more efficiently than traditional approaches. The architecture consists of two autoregressive decoder towers—a text and a speech backbone—combined using gated cross-attention layers. Each backbone is trained independently using next-token prediction.

Auto-regressive masking is adapted for multi-modal sequences, allowing the model to generate outputs in a specified sequence of modalities during inference. This flexible design ensures the retention of unimodal generation performance by freezing the corresponding modality tower when necessary. For example, retaining the text-to-text generation capability during cross-modal alignment tasks like Automatic Speech Recognition (ASR).

Experiments and Results

Automatic Speech Recognition (ASR)

The Zipper architecture exhibits competitive performance on ASR tasks when compared to the conventional Single Decoder model, which expands the vocabulary to include speech tokens. The paper reports that for the frozen modality backbone, Zipper's performance is comparable with small differences in WER, particularly on the noisier test-other subset.

Text-to-Speech Generation (TTS)

In the field of TTS, the Zipper model demonstrates a significant reduction in WER compared to the Single Decoder model. It notably improves performance by reducing WER by 12 absolute points (40% relative error reduction) for models with an unfrozen speech backbone. The advantages stem from using pre-trained unimodal backbones for better alignment, especially important as the context length in speech generation grows.

Limited Aligned Data Scenarios

Empirical results highlight Zipper's capability to learn efficiently from a minimal amount of aligned data. With as little as 1% of the original data, Zipper achieves a mid-twenty WER on ASR tasks, significantly outperforming the Single Decoder model under identical conditions. This capability underscores Zipper's advantage in data-constrained scenarios, leveraging strong unimodal pre-training.

Implications and Future Work

The Zipper model's capability to retain unimodal generative performance while adding cross-modal functionalities addresses several challenges in multimodal model integration. Given its flexibility and reduced dependency on large amounts of aligned data, it stands to significantly impact tasks requiring limited cross-modal data.

Going forward, several expansions are envisaged. The model can be extended to integrate more than two modalities, incorporating text, speech, images, video, and other niche modalities like protein sequences. Future work will also explore scaling Zipper to larger model sizes and more diverse datasets, providing a comprehensive solution for multimodal generation tasks.

Conclusion

The Zipper architecture demonstrates a robust method for integrating unimodal generative models, preserving their core capabilities while adding cross-modal functionalities. The empirical results in ASR and TTS tasks affirm its competitive edge against traditional approaches, especially in data-constrained scenarios. This work lays a foundation for more flexible and scalable multimodal generative models, potentially transforming various applications in AI where multimodal data integration is paramount.